Month: March 2017

Character recognition with PDFs in OS X

March 29, 2017 / Charles Fulton / 0 Comments

I’ve started using Tesseract to add an optical character recognition (OCR) layer to PDFs. What follows are my notes on getting this to a reasonable state, and a word of warning about Preview on Sierra.

Background

I’ve written about my collection of articles before. They’re all PDFs and indexed in Zotero. Their source various: some are distillations of digital documents, some are scans from print or microfilm. Some, but especially the latter, haven’t been through OCR so they aren’t searchable. That’s not a big deal in a 1-2 page article, but in longer works it’s obnoxious. Adding OCR also exposes the text to Spotlight, OS X’s internal search.

Tesseract

Tesseract is the de facto standard for open source OCR solutions. It’s installable via homebrew. That’s easy. I discovered pretty quickly that Tesseract doesn’t work with PDFs out of the box. I wasn’t averse to building a script that pulled apart the PDF, converted each page to a TIFF, did the OCR, and then put it all back together, but I figured someone had crafted a more elegant solution.

pdfsandwich

Per Cornelius’ excellent tutorial, the more elegant solution is Tobias Elze’s pdfsandwich, which wraps all that, plus plenty of additional functionality which I don’t fully grok yet. The code is available via SVN: svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich

You’re going to need a few more dependencies from homebrew, in addition to tesseract:

poppler: provides the pdfunite package
gawk: used during compilation
ocaml: used during compilation
unpaper: handles OCR preprocessing and other things

Those are all the dependencies I needed to install; a bare system probably needs more.

Examples

You invoke pdfsandwich from the command line. pdfsandwich —help gives a rundown of the many options. This is the most basic invocation:

pdfsandwich -lang eng <filename>

This will process the given file and output a new file with the OCR layer included. By default the new file is named . If the file has images you’ll want an additional argument to ensure they’re overlaid correctly. I used the -gray flag for grayscale documents and -rgb for color. Note that while pdfsandwich use multiple threads for processing a large document (50+ pages) will take at least several minutes on a pretty fast MacBook.

By default the command spits out quite a bit of information which you’re free to ignore. Once it’s done you’ll have a new PDF and the text is searchable.

Beware Preview

There’s a pretty nasty bug on OS X which, depending on who you talk to, was introduced in Sierra or has been around for years. In a nutshell, saving a PDF in Preview can corrupt the OCR layer. The text will no longer be searchable nor copyable. Despite reports to the contrary it’s not fixed (at least not for me) in 10.12.3. I’ve taken the precaution of locking the PDFs for now. I’m sure I’ll forget in two months time, but it’ll solve the immediate problem.

How I file things

March 22, 2017 / Charles Fulton / 0 Comments

At the end of my blog post about Hammerspoon and case conversion I had this aside: ” …it works well for storing electronic articles and books.” I’m going to expand a little on that and how it fits into my scheme for organizing information. Hat-tip to nateofnine for prodding me into sharing this.

The Library of Zotero

A typical Zotero collection. This one contains everything tagged for the GE E60 electric locomotive. Exciting, right?

I use Zotero to keep track of information, mostly related to railroading history. I wrote a post last year about developing a custom translator for importing articles. As of writing I have ~~1193~~ 1200 items indexed in Zotero: books, chapters, journal and newspaper articles, blog posts, doctoral theses, maps, etc.

Zotero lets you attach things to an entry such as notes or tags. You can also link an external resource. This might be a website, or a link to the a local copy of the document if one exists. Zotero lets you sync content to its cloud, but the free tier is limited to 300 MB. If you’re storing article PDFs you’ll exceed that pretty quickly. Leaving aside HTML snapshots, I have 249 articles consuming ~2.5 GB of space.

Let the data flow

Electronic documents come from all over the place—interlibrary loan, online databases, websites, scans I’ve done myself of physical media that I possess but need to store. The only common factor that is that they become a PDF and I need to organize them.

I start by having a folder structure with a top-level folder named Articles. Beneath that I organize by publication, with definite articles removed to permit natural sorting (thus The New York Times becomes New York Times). Within each publication I adopt a naming convention of DATE-NAME. Date is of the format YYYY, YYYYMM, or YYYYMMDD, depending on the granularity of the publication. A monthly journal, for example, will go no further than YYYYMM. The purpose of this is to ensure chronological sorting when looking at the articles outside of Zotero.

For the NAME, I fall back on my Hammerspoon case conversion module. I’ve already created an entry for the item in Zotero, with its full title. I throw that title into the case conversion, get the slug, and append it to the date. This gives me a filename that sorts by date, is human-readable if need be, and is easy to manipulate from the command line. For example, J. David Ingles’ article in Trains magazine from the May 1979 issue entitled “How super are the Superliners?” becomes 197905-how-super-are-the-superliners.

Putting it all together

The file created, I drop it into the appropriate publication folder, then use the “Add Attachment > Attach Link to File” option in Zotero to associate it with the index. Now, when I double-click on the item in Zotero, it’ll open the file for me to read. The Articles directory tree lives on my Nextcloud which means that (a) there’s redundancy in case something happens to my laptop and (b) I can read the articles on my phone, without needing to have Zotero installed.

Featured image by Sue Peacock (card catalogs) [CC BY-SA 2.0], via Wikimedia Commons

Writing LDAP unit tests for a Moodle plugin

March 8, 2017 / Charles Fulton / 0 Comments

In 2016 Lafayette College began maintaining the LDAP Syncing Scripts (local_ldap) plugin after the tragic death of the previous maintainer, Patrick Pollet.

I didn’t know Patrick but he had a strong reputation in the Moodle community. I’m pleased to say that we made few substantive changes to his code. Most of the changes were simple updates, such as migrating the command-line/cron scripts to Moodle’s task infrastructure, and various nit-picky code standards issues which didn’t affect functionality.

PHPUnit

The biggest lift was implementing PHPUnit test coverage for the plugin. I started out with the following requirements:

Fully-scripted setup for OpenLDAP, so that the tests can run inside a continuous integration environment
Test coverage for group synchronization
Test coverage for attribute synchronization

I started this project by building an OpenLDAP environment inside Moodle Hat, the Vagrant development profile I maintain. Implementing a configuration in Puppet is good practice for wrestling with Travis.

Starting from scratch with OpenLDAP (every time!) presents certain challenges that you don’t encounter in a mature environment. A few I encountered:

When you bootstrap OpenLDAP it has a completely empty schema. PHP’s ldap libraries can’t talk to it in that state. You have to populate it with some data, even if it’s completely arbitrary.
Selection of backend databases matter. LDIF is the quick and easy path, but it doesn’t support pagination and Moodle will break in obscure ways. I chose bdb because it’s available in most repositories and it worked.
When you’re setting a generic testing password in your slapd.conf you can just dump in rootpw SomeArbitraryPlaintextPassword and it’ll work. Don’t run in production! Or, really, anywhere that has state.

Once I’d worked through those issues Christian Weiske’s invaluable blog post provided everything I needed for implementing on Travis.

Travis

LDAP Syncing Scripts leverages Moodlerooms’ excellent moodle-plugin-ci plugin for travis-ci integration, with a few tweaks. The full travis-ci.yml file is visible on the GitHub repository; let me walk through a few things.

We need the slapd and ldap-utils packages installed. To use Moodle’s built-in LDAP PHPUnit testing we need to define the location of the test server in the config file:

define("TEST_AUTH_LDAP_HOST_URL", "ldap://localhost:3389");
define("TEST_AUTH_LDAP_BIND_DN", "cn=admin,dc=example,dc=com");
define("TEST_AUTH_LDAP_BIND_PW", "password");
define("TEST_AUTH_LDAP_DOMAIN", "dc=example,dc=com");

We need to create an INI file to force PHP (in travis) to load the ldap extension, and a slapd.conf file to define how our OpenLDAP enviroment will function. The schema settings need to match what you added to Moodle’s config.php. We start slapd and then, as the final step, import our default data. This data isn’t used, but it gets around the problem of an empty schema. Note that while this data is stored as an ldif file for readability purpose, the backend is bdb.

Tests

The actual tests I derived from the tests for Moodle’s auth_ldap plugin. The code is long but self-documenting. There are no particular gotchas, though I found it helpful to extend auth_ldap_plugin_testcase instead of starting fresh.

Remembering the Hoosier State

March 1, 2017 / Charles Fulton / 0 Comments

Today Amtrak resumes full operation of the Hoosier State, ending an 18-month experiment in which Iowa Pacific proved rolling stock and on-board services. I had an opportunity to take this unusual train in June 2016 while on a business trip to Indianapolis. What follows are my notes on the experience.

Aboard the Hoosier State

We’ve just wrapped up the CLAMP’s Hack/Doc Fest at Butler University in Indianapolis, Indiana. You can read Ken Newquist’s daily updates to see how the conference went, including all the gory details on the updated annotation in Moodle 3.1. Short version: it needs love. I’d like to talk about a most unusual aspect of the conference: Amtrak and Iowa Pacific’s Hoosier State, which runs between Chicago and Indianapolis.

Background

Horizon coaches on the Lincoln Service in 2009. These are a common sight on the Hoosier State. Photo by Jeramey Jannene from Milwaukee, WI, United States of America (Train Boarding) [CC BY 2.0], via Wikimedia Commons

Train service between Chicago and Indianapolis has a tortured history in the Amtrak era. Amtrak does not, except in a few cases, own its own track, and must therefore rely on access to rails owned by private freight companies. The most direct routes to Indianapolis were abandoned or downgraded in the 1970s and 1980s as freight traffic moved elsewhere. Amtrak trains which serve Indiana have repeatedly moved to less desirable routes in order to maintain service.

By the 2010s the Hoosier State operated with a couple coaches and no food service on a slow, bumpy, five-hour journey. The train ran quad-weekly; on the other days the long-distance Cardinal ran over the route.

Indiana takes control

The federal Passenger Rail Investment and Improvement Act 2008 (PRIIA 2008) required states to provide funding for short-haul trains operating within their borders beginning in 2013. Indiana cobbled together an agreement involving state and local money for 2013, but only after prolonged debate. It was clear that Indiana wasn’t satisfied with its level of service and sought alternatives.

For 2014 Indiana decided to make a change. The discussions, negotiations, false starts, and accusations would take forever to recount here and make my head hurt. The end result is that Indiana brought in a private company, Iowa Pacific, to provide rolling stock, marketing, and on-board service personnel. Amtrak personnel operate the train itself; Amtrak sells tickets and the train remains part of Amtrak’s system. The new train began running on August 2, 2015 and after some hiccups it seems to be a success. According to a recent press release, ticket revenue is up, customer satisfaction is up, and delays are down.

Hack/Doc at Butler gave me a perfect opportunity to inspect this strange new service.

First impressions

Iowa Pacific coaches on the Hoosier State. Photo by David Wilson from Oak Park, Illinois, USA (20170205 13 Hoosier State @ Rensselaer, Indiana) [CC BY 2.0], via Wikimedia Commons

I haven’t been on this route in six years. In 2010, after the Hack/Doc at DePauw, I caught the Cardinal from Indianapolis back up to Chicago. I remember an unpleasant station, a crowded cafe car, and a slow slog through rail yards south of Chicago. The brioche French toast was pretty good.

We board from the Metropolitan Lounge in Chicago, and you can’t miss the Hoosier State. Most of Amtrak’s equipment is stainless steel with red, white, and blue striping. Iowa Pacific has painted the cars for this service in classic black-and-orange Illinois Central Railroad livery. The cars themselves are unusual: an old Santa Fe Big Dome, the Summit View, and old Budd coaches with large picture windows.

Inside I’m impressed by the leg room. We’re in the Du Quoin, a 44-seat leg-rest coach. The leg room is comparable a long-distance Amfleet coach, maybe even better. There’s a proper 120V wall outlet and folding tray table. The picture windows have blinds which you can pull down. The upholstery is sort I’d expect to find on a couch or easy chair in my grandfather’s living room.

The bathroom is most unusual. It’s off to the side, about the size of a bedroom. You enter and there’s a couch, and two sinks. The toilet and urinal are in a separate room which locks. Ken likened it to a receiving room.

Crawling out of Chicago

The creature comforts on Iowa Pacific can’t do much about the route. You really do feel as though you’re sneaking out of Chicago. It’s 28 miles from Chicago to the first station stop in Dyer, Indiana. It’s timetabled for 90 minutes. A car could do it half the time, barring shenanigans on the Dan Ryan Expressway.

After a brief run down the Chicago Line and Metra track we hit the Belt Railway of Chicago at 75th Street and slow to a crawl. I took these notes at the time:

“Pretty good run here, but we’re doomed once we hit the Belt. There’s no escape from the BRC. Yep, we’re on the Belt and everything is slow and grinding. No one around me shares my concerns about the Belt.”

We use the Belt to reach the Union Pacific Railroad’s Villa Park Subdivision. This is a slow crawl through the south side, and it reaches its nadir at Riverdale. We have to cross the Little Calumet River, but there’s a freight train ahead of us which has to be recrewed and we need to cross a major interlocking. Having done all that, we will crawl past Dolton Yard.

We hit Dyer a little early at 6:50 PM. I can’t emphasize how frustrating the stretch on Union Pacific is.

Dinner time

The dining area on the lower level of the dome car. Photo by David Wilson from Oak Park, Illinois, USA (20170205 09 on board Hoosier State) [CC BY 2.0], via Wikimedia Commons

Business class passengers get meals as part of their booking (and sit in the dome!), but the dining car is open to everyone. We’re summoned by destination, with the Indianapolis passengers going last. As with standard Amtrak trains Iowa Pacific practices community seating. This is usually a high point of the trip. You never know who you’re going to be with.

As we hit Dyer I find myself seated in the lower level of the Summit View with my boss and a newlywed couple from Marion, Ohio. Their kids are down in Texas so they honeymooned in Chicago for the weekend. I had the pork chop, asparagus, potatoes, and a side salad. IPA to drink. Overall it was pretty good. I found the food service a little more competent than average Amtrak. I think they were a little overwhelmed—too many people summoned at once. No real complaint though.

What is to be done?

We made good time all the way to Indianapolis. The coaches rode well and we didn’t encounter many problems once we got on to CSX in Indiana. The A/C wasn’t working in our coach but it didn’t bother me much. We arrived at our destination before midnight and headed off to our hotel while the Hoosier State headed for the yard.

Fast-forward to March 2017. I appreciated the improvements in on-board service but wondered whether Iowa Pacific could really be making a go of it. The answer is that they couldn’t, although that may be more due to the parent company’s problems than the Hoosier State itself. Certainly they couldn’t have done it without Amtrak’s incremental access rights and logistical support.

What’s next? Amtrak probably puts Horizon coaches back on the Hoosier State. With everything else that’s going on a quad-weekly train from Chicago to Indianapolis isn’t high on their list of priorities. Iowa Pacific could afford to give this train individual attention; Amtrak can’t. The real question is this: what will Indiana do next? They don’t want to kill the train but they don’t like the level of service Amtrak gives them.

What needs to happen, but won’t, is finding a better route out of Chicago. That means real money; tens if not hundreds of millions of dollars.

Featured image by David Wilson from Oak Park, Illinois [CC BY 2.0], via Wikimedia Commons

Month: March 2017

Character recognition with PDFs in OS X

Background

Tesseract

pdfsandwich

Examples

Beware Preview

How I file things

The Library of Zotero

Let the data flow

Putting it all together

Writing LDAP unit tests for a Moodle plugin

PHPUnit

Travis

Tests

Remembering the Hoosier State

Aboard the Hoosier State

Background

Indiana takes control

First impressions

Crawling out of Chicago

Dinner time

What is to be done?

Blogroll

Arrivals

Tracks

Schedule