Terrestrial Navigation

Amtrak, B-Movies, Web Development, and other nonsense

Page 2 of 7

Running Moodle CI tests in GitLab

I maintain about a dozen plugins in the Moodle plugins repository. I use Moodlerooms’ moodle-plugin-ci project to test all of them on Travis CI. It’s a fantastic tool and rely heavily on it.

This fall I’ve been working on a plugin which, because of various hooks into Lafayette’s internal infrastructure, I’m not releasing publicly. I’d still like to test it in the usual way, so I decided to run the tests on our internal GitLab infrastructure.

Continue reading

Implementing a course archiving strategy in Moodle

A course archiving strategy is the white whale of higher education. I can remember being part of discussions a decade ago about how to keep a Moodle instance at a manageable size while preserving information. There were two challenges: come up with a policy that faculty could support, and execute a reasonable technical implementation of that policy. In this post I’ll discuss the tools we built to implement our chosen policy.

Continue reading

Measuring activity in Moodle

It’s a simple question with a complex answer: in a given academic term, what percentage of our Moodle courses are “active” (used by a faculty member in the teaching of their course). We have to start by figuring out what counts as a “course” in a term, and then come up with an inclusive measurement of activity.

Continue reading

Pick a date, any date

Moodle 3.2 introduced the concept of end dates for courses. Moodle 3.3 added a new Course Overview block which uses end dates to determinate whether a course is in progress, in the past, or in the future. This is pretty great, unless you’re in the following situation:

  • Your school has five years worth of courses
  • Those courses don’t have end dates

Congratulations—you now have five years of courses in progress. Your faculty will have five pages worth of past courses on the Course Overview block! That’s probably undesirable. To avoid it, I’m writing a plugin that lets an administrator set course start and end dates at the category level. While working on it, I ran an interesting edge case with Behat acceptance tests, reminding me that you’d best treat Behat like it’s a real user.

Continue reading

WordPress and partial content

Eighteen months ago we had an anomalous problem where video playback didn’t work on some, but not all, of our WordPress multisites. Videos wouldn’t play, or would play but wouldn’t seek. The problem was confined to local uploads embedded in a page. Videos from YouTube played fine; if you viewed the video directly playback worked as expected.

The problem turned out to be long-standing issue with how ms-files.php served up files from pre-WordPress 3.5 multisites. Solutions had floated around for years. Our problem was describing the problem with enough specificity to actually find the right solution.

Continue reading

Overlaying dependency management

I’ve described how Lafayette’s deployment strategy involved pushing rocks uphill. A key change in our thinking came when we started treating each of our WordPress multisite installations as its own software project, with its own dependencies and lifecycle. Enabling this pivot was a technology which wasn’t mature in 2013: Composer.

What is Composer?

Composer is a package manager for PHP. It fills a role similar to npm for Node.js and bundler for Ruby. It uses a JSON file to capture metadata about the project and the project’s dependencies. If you have a custom application built on Symfony and Silex, your composer.json file might look like this:


{
    "name": "outlandish/wpackagist",
    "description": "Install and manage WordPress plugins with Composer",
    "require": {
        "php": ">=5.3.2",
        "composer/composer": "1.3.*",
        "silex/silex": "~1.1",
        "twig/twig": ">=1.8,<2.0-dev",
        "symfony/console": "*",
        "symfony/filesystem":"*",
        "symfony/twig-bridge": "~2.3",
        "symfony/form": "~2.3",
        "symfony/security-csrf": "~2.3",
        "symfony/locale": "~2.3",
        "symfony/config": "~2.3",
        "symfony/translation": "~2.3",
        "pagerfanta/pagerfanta": "dev-master",
        "franmomu/silex-pagerfanta-provider": "dev-master",
        "doctrine/dbal": "2.5.*",
        "knplabs/console-service-provider": "1.0",
        "rarst/wporg-client": "dev-master",
        "guzzlehttp/guzzle-services": "1.0.*"
    },
    "bin-dir": "bin",
    "autoload": {
        "psr-4": {
            "Outlandish\\Wpackagist\\": "src/"
        }
    }
}

The real action here is in the require block, where we spell out all the different packages that compose our application. Each key/value pair is an individual package and its version constraint. Composer users semantic versioning, and supports a wide range of expressions. Here are some quick examples:

  • “*”: wildcard; highest version possible (probably don’t ever do this)
  • “2.*”: highest version within the 2.x major version
  • “~1.2”: highest version between 1.2 and 2.0

The key name is the name of the package, in the format vendor/project name. By default Composer assumes that you’re installing these packages from Packagist, but as we’ll see that’s just the beginning of the Composer ecosystem.

What is Packagist?

Packagist is a centralized Composer repository which anyone can register packages on. It’s full of packages like the ones listed in the example above. Given a specific package and version constraint, it returns the matching files. Packagist is special inasmuch as it’s available by default for every Composer project, but it’s possible to define additional repositories if you have packages which aren’t in Packagist.

What Repositories, where?

Let’s say you’ve got a private project that you can’t publish to Packagist, but you’d like to make it available to an internal Composer project. Composer has a number of options for doing this. The simplest is adding your version control repository (VCS) as a repository to your project:


{
  "name": "yourcompany/yourproject",
  "description": "Your sample project",
  "repositories": [
    {
      "type": "vcs",
      "url": "https://github.com/yourcompany/someotherproject"
    }
  ]
}

This means that Composer will scan that repository, in addition to Packagist, when searching for packages. All you need to do to make this work is to add a reasonable composer.json file to that private project.

This is fine for one or two packages, but becomes unwieldy with a dozen or more. This is where a project like Satis becomes useful. Satis transforms a repository block into a lightweight Composer repository. This way, your internal projects need to include the Satis repository only—as you add new VCS repositories to Satis they become available to your downstream projects:


{
  "name": "yourcompany/yourproject",
  "description": "Your sample project",
  "repositories": [
    {
      "type": "composer",
      "url": "https://satis.yourcompany.com"
    }
  ]
}

What’s this got to do with WordPress?

Composer’s structure lets you overlay a dependency management system on existing code without too much pain. With your own private packages adding composer.json files is straightforward. Obviously you’re not going to get composer.json files added to every package on the WordPress theme and plugin repositories.

Fortunately, this is a solved problem. The fine folks at Outlandish maintain the WordPress Packagist composer repository. WPackagist scans the WordPress theme and plugin subversion repositories, mapping projects and versions in a way that composer can understand. The themes and plugins themselves are downloaded as zip files and extracted. At a stroke, all themes and plugins from WordPress.org are available to your project along with your private projects:


{
  "name": "yourcompany/yourproject",
  "description": "Your sample project",
  "repositories": [
    {
      "type": "composer",
      "url": "https://satis.yourcompany.com"
    },
    {
      "type":"composer",
      "url":"https://wpackagist.org"
    }
  ]
}

This is far more efficient than converting repositories to git, and you arrive at the same end result: specifying a known version constraint for a given project. Next up: how you actually deploy a WordPress site with this stuff.

Don’t push rocks uphill

Pictured: developer pushing rocks uphill

For three years Lafayette’s official WordPress deployment strategy was to push rocks uphill. This was a doubtful plan, but it represented an improvement over its predecessor, which was to stare at the rocks doubtfully, then roll them around a field at random. Here follows a warning to others.

Git all the things!

In 2013 we had embraced git with the fervor of the converted. Applying this to WordPress was difficult. WordPress.org gave us two options for getting themes and plugins:

  • Download them as a ZIP file manually
  • Clone them from subversion

These weren’t great options. We weren’t going to adopt a strategy which incorporated subversion as part of the deployment itself, and we didn’t want to lose revision information with a manual download process.

We hit upon the strategy of pushing rocks uphill. We setup a platform to clone the WordPress themes and plugins we need and then convert them from SVN to git, using the svn2git ruby gem. We pushed the result into a private git repository. This git repository was then added as a submodule to our WordPress deployment repository.

This was cumbersome and time-consuming. The WordPress theme and plugin SVN repositories are massive. The initial conversion of a module could take hours, or just fail. Repository structures varied according to the whims of the plugin maintainers. Tagging was inconsistent. WordPress doesn’t encourage atomic committing to SVN, which undercut the value of having commit messages. Maintaining a private repository for each theme and plugin added significant overhead.

Submodules: threat or menace?

Deployment in progress. Taken by Peter Standing [CC BY-SA 2.0], via Wikimedia Commons

We haven’t even talked about submodules. With a submodule you nest a git repository inside a git repository. The top-level repository has a .gitmodules file which tracks the location of the submodule remote; the revision history tracks which commit should be checked out in the submodule.

In a sample WordPress deployment, you would have your WordPress git repository, and you would attach your theme and plugin submodules. You then clone this repository on to your web server, and update from git as needed. This works, but it’s not as slick as it sounds.

Time was I wouldn’t hear a word said against them. That time is past. They’re a kludge and should be used sparingly. Most of “Submodules considered harmful” think pieces focus on their use in development. Here’s a couple: Why your company shouldn’t use Git submodules and Git: why submodules are evil. Their use is more defensible in a deployment context, but there are still problems:

  1. When you clone a git repository which has submodules, the submodules have to be initialized and updated separately.
  2. When you update a repository, the submodules have to updated as well. A sample command would be git submodule foreach git fetch —all.
  3. With deployments, you’re now worried about the state of each git repository, submodule or no.

With a collection of shell scripts this is manageable, but again it’s a lot of overhead. It also doesn’t self-document very well. Looking at my project repo, I can run git submodule status and get a mixture of useful and not useful information:


 20e6e064792e9735157d88f97eece9c5aef826a8 wp-content/plugins/conditional-widgets (2.2)
 90c74decfb020fdaa255bba68acb142550dfac35 wp-content/plugins/contact-form-7 (4.4.1)
 13cfa86ceb8001438b0fec9ea3a5a094d42e2397 wp-content/plugins/custom-field-template (2.3.4)
 292e34607378de0b4352ba164ccf7e1ecdaa44e9 wp-content/plugins/mathjax-latex (v1.1-59-g292e346)

This is okay as far as it goes, but I’m at the mercy of what’s in the submodules. I can’t rely on it to be human-readable, and I can’t use them for anything. If I want to update a module, I’ve got to push the new code through the pipeline, then update the submodule on my local machine, commit, and then send it out to the web server.

Also, you can’t easily remove a submodule, which you might want to do if you’re deleting a plugin. The last time I did this, I had to follow this process:

  1. Using a text editor, delete its entry from .gitmodules
  2. Stage that change
  3. Using a text editor, delete its entry from .git/config
  4. Run git rm --cached path/to/submodule
  5. Run rm -rf path/to/submodule
  6. Commit everything
  7. Push to your repository
  8. Repeat steps 3-5 when updating on the remote

A developer removes a submodule

Rocks. Uphill. Pushing. Allegedly this has gotten better in the last few years, but (a) we’ve moved on and (b) the version of git is low enough on our servers that it wouldn’t matter anyway.

Do you have a better idea?

The mistake here was forcing the WordPress modules into an unnatural path. They have versions on the WordPress.org repository; we should have adopted a method that leveraged that, rather than re-create that method in our own private repositories. Next up: overlaying dependency management.

Apparating a source

This is a story about adding a little knowledge to the public internet.

Three M-K TE70-4S locomotives on their initial run in 1978. Photo by Roger Puta.

A few weeks ago I went looking for information on the Morrison-Knudsen TE70-4S diesel locomotive. It’s an oddball; a rebuild of a General Electric U25B with a Sulzer engine. Only four conversions were done, all for the Southern Pacific Railroad, and they were unsuccessful. There isn’t much information on the internet about them.

Googling turned up a curious reference: Application of Sulzer 12ASV 25/30 Diesel Engines to M-K TE70-4S Locomotives. 12 pages by a J. G. Fearon in 1979 devoted to the subject, published at the time of the conversion. Sounds great. Except…

  • This is the only search result for this title
  • No one offers this “book” for sale
  • No library counts it among its holdings, although it’s in WorldCat

What the hell is this thing?

In the course of researching the Turboliner and the General Electric E60 electric locomotive, I spent some time with industry journals such as Mechanical Engineering, Transportation Research Record, and the various conference proceedings that are published by IEEE. The functional title certainly sounded like a conference paper.

I drew a blank on IEEExplore, but full text has reached 1988 only, with exceptions. Next up was HathiTrust. HathiTrust can be irritating to work with: it doesn’t offer full-text for many things, and when searching journals the results aren’t more specific than the volume, which makes it hard to identify items for inter-library loan.

The first interesting result was for Conference papers index, which reported a hit for 1979. That confirmed my guess (without details) that Fearon, who worked for Morrison-Knudsen, presented on the topic.

The second was a hit for both the author name (common enough) and the locomotive name (very uncommon) in the same volume of Mechanical Engineering. A reel of scrolled microfilm later, and I had an entry in the “Technical Digest” (meeting abstracts) from ASME, and there it was: J. G. Fearon, “Application of Sulzer 12ASV 25/30 Diesel Engines to M-K TE70-4S Locomotives.” Presented at the ASME Energy Technology Conference, November 5-9, 1978. Either it was also issued in pamphlet form, or someone at Google erred. Either way, I was grateful for the pointer.

The next step was requesting the conference proceedings. Time passed. The proceedings arrived and…no joy. Turns out there’s an important difference between the “Petroleum Division” and the “Diesel & Gas Engine Power Division” of the ASME, although both were present at this conference. I’m sure there’s a story. I put in a request for those proceedings and crossed my fingers.

No dice; only one library has those proceedings in that format, and they weren’t giving them up. After various blind alleys I discovered, somehow, that “Paper”, published by the ASME, is a recognized serial by a couple dozen libraries in the United States (ISSN 0402-1215). This serial encompasses all the papers given at ASME conferences throughout the year. You can see this record from Cornell for example of what a nightmare the various ASME codes and numbers are.

The next step was identifying a library which had holdings for DGP from 1978, including paper No. 15. At this point I cheated: I knew someone at a library which held it and was able to bypass the ILL process. I’m pretty confident, however, that an ILL request for that paper and author, specifying the ISSN, year, and paper code, would ultimately have been successful.

The moral of the story is that there’s a wealth of information which, even now, is effectively buried unless you really know where to look. For that unusual Google Books entry, I’m not sure I’d have thought to throw the query at HathiTrust, though I will from now on. Even then, I caught a break that someone had digitized the conference paper index. There are plenty of journals and books which have never been digitized and are undiscoverable via the public internet.

Character recognition with PDFs in OS X

I’ve started using Tesseract to add an optical character recognition (OCR) layer to PDFs. What follows are my notes on getting this to a reasonable state, and a word of warning about Preview on Sierra.

Background

I’ve written about my collection of articles before. They’re all PDFs and indexed in Zotero. Their source various: some are distillations of digital documents, some are scans from print or microfilm. Some, but especially the latter, haven’t been through OCR so they aren’t searchable. That’s not a big deal in a 1-2 page article, but in longer works it’s obnoxious. Adding OCR also exposes the text to Spotlight, OS X’s internal search.

Tesseract

Tesseract is the de facto standard for open source OCR solutions. It’s installable via homebrew. That’s easy. I discovered pretty quickly that Tesseract doesn’t work with PDFs out of the box. I wasn’t averse to building a script that pulled apart the PDF, converted each page to a TIFF, did the OCR, and then put it all back together, but I figured someone had crafted a more elegant solution.

pdfsandwich

Per Cornelius’ excellent tutorial, the more elegant solution is Tobias Elze’s pdfsandwich, which wraps all that, plus plenty of additional functionality which I don’t fully grok yet. The code is available via SVN: svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich

You’re going to need a few more dependencies from homebrew, in addition to tesseract:

Those are all the dependencies I needed to install; a bare system probably needs more.

Examples

You invoke pdfsandwich from the command line. pdfsandwich —help gives a rundown of the many options. This is the most basic invocation:

pdfsandwich -lang eng <filename>

This will process the given file and output a new file with the OCR layer included. By default the new file is named . If the file has images you’ll want an additional argument to ensure they’re overlaid correctly. I used the -gray flag for grayscale documents and -rgb for color. Note that while pdfsandwich use multiple threads for processing a large document (50+ pages) will take at least several minutes on a pretty fast MacBook.

By default the command spits out quite a bit of information which you’re free to ignore. Once it’s done you’ll have a new PDF and the text is searchable.

Beware Preview

There’s a pretty nasty bug on OS X which, depending on who you talk to, was introduced in Sierra or has been around for years. In a nutshell, saving a PDF in Preview can corrupt the OCR layer. The text will no longer be searchable nor copyable. Despite reports to the contrary it’s not fixed (at least not for me) in 10.12.3. I’ve taken the precaution of locking the PDFs for now. I’m sure I’ll forget in two months time, but it’ll solve the immediate problem.

« Older posts Newer posts »