Friday, 1 February 2013

I'm doing it again. The pdf library organisation, I mean.

As the long-term readers among you might know, I am using EndNote for my citation and bibliography needs. Theoretically, I also have all the .pdf files that I have of articles in one folder on my system, and all the titles of these articles in the EndNote database.

Theoretically. Which, unfortunately, is not at all what I really have - over time, a large stack of .pdfs has accumulated that are not yet in the database, because it all meant opening the file, copying the relevant bibliographical data into the database, then saving the file under a new (and unique) name also written to the database file. Yes, the programme can theoretically attach files to database entries - but I have never gotten the hang of that, and I also prefer to have things separately in case of desasters.

So here I was, with a stack of pdf files - unsorted, and with quite a few non-articles crept in between them - and my database. Enter Qiqqa, a database/citation tool geared entirely towards .pdf collections. (If it were a little less geared towards those, and a bit more open and more import-friendly from EndNote's end, I would have considered switching to it completely.) In my process of checking out Qiqqa, I already tried to use it for sorting, organising, and EndNote-ing my pdf files, but it turned out to be a tiny bit less trivial than I had thought.

So I have made a clean slate in Qiqqa and have now tackled (again) the task of sorting my files and inputting them into EndNote. It's still a multi-step process, but much less tedious than before. Preparation step was to make three new folders for sorting: One to hold the batch of pdfs for processing, one to hold the exported files, and one for the "rejects" - files that have obscure bibliographical data that will have to be entered by hand.

Step one: Move a batch of pdf files from the big heap into the processing folder.
Step two: Import that folder into Qiqqa (or set it as watch folder). The programme will now index (and, if necessary, OCR) those files.
Step three: Use the inbuild BibTex-Sniffer to match bibliographical data to the individual files, and delete all the non-articles from the library.
Step four: Make sure to move all the files for hand processing from the processing folder to the "rejects" folder (else they will be lost), then delete them from the library.
Step five: Export bibliographical data to a .bib file.
Step six: Export complete library to the export folder.
Step seven: Convert BibTex-file to an Endnote .xml file using this nifty little programme.
Step eight: Import bibliographical data into Endnote (excluding duplicates). (I only had one minor glitch with importing up to now which seems to have been an incompatible record type number.)
Step nine: Add "pdf file available" or similar thing into a suitable field of each of the new references (this can be done quickly with "change and move fields").
Step ten: Move all the exported files from the "doc" subfolder in the export folder into the regular folder for referenced pdf files.
Step eleven: Delete everything from the export folder and the processing folder.
Step twelve: Delete all the entries of the library.

Then start over... until everything is processed. It takes some time, but on the other hand, it allows me to be sure I get everything referenced and lets me clear out all the other pdfs that crept in without too much woes. And with the possibility to do this in smaller batches, it's also not so overwhelming to add hundreds of BibTex entries at once.

(And if this blogpost has made you want a bibliography programme/database, here is a list of those currently available, including EN and Qiqqa.)


A Life Long Scholar said...

my pdf organization system is more basic than that. I do use EndNote to keep track of my references, but instead of the complicated steps you do, I simply save the pdfs to a folder with a name that is relevant to its topic, then within EndNote I record the folder name in the "notes" field, and I have assigned another EndNote field (I choose "name of database" since it was one I didn't think I would need) to record the format of the reference--this is where I tell it if it is pdf, paper photocopy, book, awaiting arrival from an ILL request, or if it was "borrowed from Ron", or "haven't actually seen it".

Later, when I want to find it, I look in EndNote, see which folder it is in (most articles could be in any number of folders, really, as how many touch on only one topic?) and I go straight there.

a stitch in time said...

And that would again mean filling in all the reference details by hand plus recording the folder name. Plus presumably changing the name of the file by hand, since most downloaded articles have a lovely name such as s(number) or a string of numbers... and that is not helpful if I want to open one article. Plus I have lots of stuff on similar topics... that would mean a big folder.

I use the system because it saves me from a) entering most of the reference data by hand and b) takes care of the re-naming of the pdf files, with the system author-title.pdf. And that suits me really well. It's still work, but work of a kind that is less getting on my nerves than the filling stuff in by hand.