March 29, 2017 – Terrestrial Navigation

I’ve started using Tesseract to add an optical character recognition (OCR) layer to PDFs. What follows are my notes on getting this to a reasonable state, and a word of warning about Preview on Sierra.

Background

I’ve written about my collection of articles before. They’re all PDFs and indexed in Zotero. Their source various: some are distillations of digital documents, some are scans from print or microfilm. Some, but especially the latter, haven’t been through OCR so they aren’t searchable. That’s not a big deal in a 1-2 page article, but in longer works it’s obnoxious. Adding OCR also exposes the text to Spotlight, OS X’s internal search.

Tesseract

Tesseract is the de facto standard for open source OCR solutions. It’s installable via homebrew. That’s easy. I discovered pretty quickly that Tesseract doesn’t work with PDFs out of the box. I wasn’t averse to building a script that pulled apart the PDF, converted each page to a TIFF, did the OCR, and then put it all back together, but I figured someone had crafted a more elegant solution.

pdfsandwich

Per Cornelius’ excellent tutorial, the more elegant solution is Tobias Elze’s pdfsandwich, which wraps all that, plus plenty of additional functionality which I don’t fully grok yet. The code is available via SVN: svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich

You’re going to need a few more dependencies from homebrew, in addition to tesseract:

poppler: provides the pdfunite package
gawk: used during compilation
ocaml: used during compilation
unpaper: handles OCR preprocessing and other things

Those are all the dependencies I needed to install; a bare system probably needs more.

Examples

You invoke pdfsandwich from the command line. pdfsandwich —help gives a rundown of the many options. This is the most basic invocation:

pdfsandwich -lang eng <filename>

This will process the given file and output a new file with the OCR layer included. By default the new file is named . If the file has images you’ll want an additional argument to ensure they’re overlaid correctly. I used the -gray flag for grayscale documents and -rgb for color. Note that while pdfsandwich use multiple threads for processing a large document (50+ pages) will take at least several minutes on a pretty fast MacBook.

By default the command spits out quite a bit of information which you’re free to ignore. Once it’s done you’ll have a new PDF and the text is searchable.

Beware Preview

There’s a pretty nasty bug on OS X which, depending on who you talk to, was introduced in Sierra or has been around for years. In a nutshell, saving a PDF in Preview can corrupt the OCR layer. The text will no longer be searchable nor copyable. Despite reports to the contrary it’s not fixed (at least not for me) in 10.12.3. I’ve taken the precaution of locking the PDFs for now. I’m sure I’ll forget in two months time, but it’ll solve the immediate problem.

S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Day: March 29, 2017

Character recognition with PDFs in OS X

Background

Tesseract

pdfsandwich

Examples

Beware Preview

Blogroll

Arrivals

Tracks

Schedule