Amtrak, B-Movies, Web Development, and other nonsense

Tag: zotero

How I file things

At the end of my blog post about Hammerspoon and case conversion I had this aside: ” …it works well for storing electronic articles and books.” I’m going to expand a little on that and how it fits into my scheme for organizing information. Hat-tip to nateofnine for prodding me into sharing this.

The Library of Zotero

Spreadsheet of citations

A typical Zotero collection. This one contains everything tagged for the GE E60 electric locomotive. Exciting, right?

I use Zotero to keep track of information, mostly related to railroading history. I wrote a post last year about developing a custom translator for importing articles. As of writing I have 1193 1200 items indexed in Zotero: books, chapters, journal and newspaper articles, blog posts, doctoral theses, maps, etc.

Zotero lets you attach things to an entry such as notes or tags. You can also link an external resource. This might be a website, or a link to the a local copy of the document if one exists. Zotero lets you sync content to its cloud, but the free tier is limited to 300 MB. If you’re storing article PDFs you’ll exceed that pretty quickly. Leaving aside HTML snapshots, I have 249 articles consuming ~2.5 GB of space.

Let the data flow

Electronic documents come from all over the place—interlibrary loan, online databases, websites, scans I’ve done myself of physical media that I possess but need to store. The only common factor that is that they become a PDF and I need to organize them.

I start by having a folder structure with a top-level folder named Articles. Beneath that I organize by publication, with definite articles removed to permit natural sorting (thus The New York Times becomes New York Times). Within each publication I adopt a naming convention of DATE-NAME. Date is of the format YYYY, YYYYMM, or YYYYMMDD, depending on the granularity of the publication. A monthly journal, for example, will go no further than YYYYMM. The purpose of this is to ensure chronological sorting when looking at the articles outside of Zotero.

For the NAME, I fall back on my Hammerspoon case conversion module. I’ve already created an entry for the item in Zotero, with its full title. I throw that title into the case conversion, get the slug, and append it to the date. This gives me a filename that sorts by date, is human-readable if need be, and is easy to manipulate from the command line. For example, J. David Ingles’ article in Trains magazine from the May 1979 issue entitled “How super are the Superliners?” becomes 197905-how-super-are-the-superliners.

Putting it all together

The file created, I drop it into the appropriate publication folder, then use the “Add Attachment > Attach Link to File” option in Zotero to associate it with the index. Now, when I double-click on the item in Zotero, it’ll open the file for me to read. The Articles directory tree lives on my Nextcloud which means that (a) there’s redundancy in case something happens to my laptop and (b) I can read the articles on my phone, without needing to have Zotero installed.

 

Featured image by Sue Peacock (card catalogs) [CC BY-SA 2.0], via Wikimedia Commons

Writing a Zotero translator

Sometimes I joke that I do web development to support my railfanning habit. It’s not entirely true, but it’s always pleasant when the two intersect.

I’m a Trains subscriber. Trains is a monthly publication which serves both those who actually work in the railroad industry and enthusiasts (railfans) like me. Beyond the monthly print publication (which I get electronically, but never mind), Trains publishes a daily news feed called News Wire. There’s lots of good information here on the various comings and goings in the industry, though US-centric.

I use Zotero to index information for research projects–mostly railroading, but other topics as well. There’s an extension for Chrome, Zotero Connector, which lets you import web content directly into Zotero, saving a lot of manual entry. Many publications like The Atlantic and The Washington Post are natively supported. When one isn’t, Zotero makes a best guess based on page structure and metadata. How well that works depends on how well-formed the page here.

This is where we have a problem. The News Wire postings don’t have proper metadata at all–you need to scrape the page to find all the relevant parts. Zotero doesn’t know how to do that. The result is that one article imported with the following values:

Title Amtrak AEM-7 arrives in Strasburg | Trains Magazine
Author 12, Wayne Laepple | June
Author 2015
Website Title TrainsMag.com
URL http://trn.trains.com/news/news-wire/2015/06/12strasburg

Not so much. This is all on page. The title comes from the page title, probably because there are multiple h1 headers defined. The date and the author are commingled in a “byline” div.

Fortunately you can roll your own definition; Zotero calls these “Translators“. There’s a primer which I found useful, though it omitted some steps. The easiest way to proceed is to use Zotero Scaffold, which is an IDE for Mozilla Firefox. I use standalone Zotero with Chrome so I didn’t have Zotero for Firefox installed. If you don’t Scaffold will install but will not work. No error messages; it just sits there. This was incredibly frustrating until I realized my error.

Zotero Scaffold will write out a completed definition to the translators directory inside your Zotero data directory. On OSX mine was in /Users/foo/Library/Application Support/Zotero/Profiles/random string/zotero/translators. I’ve posted one to github as a gist: https://gist.github.com/mackensen/981b1d5393e07e8435798eaee843e3fc. A few comments on this:

  • Scaffold takes care of all the front matter, including the GUID.
  • detectWeb and doWeb can be more complex if there are different types of data (such as a search results page). I deliberately provided a narrow page target so we’re only handling single posts.
  • All the terror is in scrape, where I used xpath queries to extract the parts of the page I needed and then format them appropriately. Don’t overlook the utility function cleanAuthor(), which takes an author string and breaks out component parts. In my first iteration I was reading the author and everything seemed fine but it didn’t import into Zotero.
  • Translators are loaded into memory by the browser. If you make a change, you’ll need to reload the browser (boo) or disable and reenable the extension (yay).

New result, same article:

Title Amtrak AEM-7 arrives in Strasburg
Author Laepple, Wayne
Blog Title Trains News Wire
Date June 12, 2015
URL http://trn.trains.com/news/news-wire/2015/06/12strasburg

Yeah, that’s much better.