Terrestrial Navigation

Amtrak, B-Movies, Web Development, and other nonsense

Apache and HTTP/2 on AWS

I’ve spent the last few months building up a generic highly-available EC2 stack for our various on-premises applications. We’re using Apache as the webserver on EC2 since that’s what we’re familiar with. An interesting wrinkle is that, out-of-the-box, cURL and Safari didn’t work with WordPress running on this stack. cURL would return this cryptic error message:

curl: (92) HTTP/2 stream 1 was not closed cleanly: PROTOCOL_ERROR (err 1)

Safari refused to render the page at all, with this equally (un)helpful message:

"Safari can't open the page. The error is "The operation couldn't be completed. Protocol error" (NSPOSIXErrorDomain:100)"

Okay, that’s not great. Chrome and Firefox were fine; what’s going on?

Architecture

Let’s take a step back and discuss the architecture. We have a CloudFront distribution forwarding HTTP and HTTPS requests to an Application Load Balancer in a public subnet. It supports both HTTP/1.1 and HTTP/2. It’s forwarding all traffic over port 443 to the EC2 instances in a private subnet. The EC2 instances are responsible for the TLS termination. Apache on the EC2 instances supports HTTP/1.1 and HTTP/2.

Troubleshooting

I started with the cURL problem, which led to a number of blind alleys before someone (I forget where) suggesting using the --http1.1 flag which confirmed that the problem was an HTTP/2 issue. That was helpful to a point, but with separate pieces of infrastructure in the mix–CloudFront, ALB, and EC2–I wasn’t sure where the underlying problem lay.

I backed out, and started researching the Safari failure mode, and ran across a good discussion on Serverfault which explained the messages from Safari and cURL. The underlying issue is that Apache is sending an h2c header over a connection that is already an HTTP/2 connection. This is invalid under RFC 7540; the degree to which clients respect that varies widely.  Given that in our configuration everything with the client is over TLS there’s no reason to send that header in the first place.

Resolution

All signs pointed to Apache. We ship a slightly modified version of the default Apache configuration from the Amazon Linux 2 AMI. The HTTP/2 module is enabled by default, with the following configuration:

<IfModule mod_http2.c>
    Protocols h2 h2c http/1.1
</IfModule>

There it is: h2, h2c, and http/1.1. Per the Apache documentation if you’re offering HTTP/2 and TLS your Protocols line should omit h2c. I made that change, and also explicitly unset the upgrade header:

<IfModule mod_http2.c>
    Protocols h2 http/1.1
    Header unset Upgrade
</IfModule>

With that change and an Apache restart all is well. The lesson here is that the default Apache configuration is generally sane for most use cases, but you probably need to go through it and ensure that it’s reasonable for your use case. We didn’t have much experience with HTTP/2 at Lafayette prior to his project, else we’d have caught this sooner.

Creating an AWS Backup Plan using CDK

I’ve been doing a lot with the Amazon Web Services Cloud Development Kit (CDK) over the last few months. CDK lets you define all your infrastructure in code, such as Typescript, which is then compiled into a CloudFormation template. It’s pretty cool, but it’s also a new production changing rapidly and the documentation hasn’t always kept up. Below are my notes for setting up a backup strategy for an Elastic File System (EFS) using Amazon Backup.

High-level overview

With Amazon Backup, you will define a Backup Plan with one or more rules targeting various resources, and then store those backups in a Backup Vault. This means you’re going to need the following:

  1. A Backup Vault
  2. A Backup Plan
  3. A Backup Selection
  4. A resource to be backed up. For this example, an EFS instance.

You’re also going to need an Identity and Access Management (IAM) role with sufficient permissions to provision all the backup infrastructure.

IAM permissions

This was the trickiest aspect of the implementation. CDK build failures do not generally indicate which permission failed; sometimes they don’t even indicate which resource was in play. I encountered this error repeatedly trying to create a Backup Vault:

 2/16 | 3:03:21 PM | CREATE_FAILED | AWS::Backup::BackupVault | &amp;amp;lt;VaultName&amp;amp;gt; Insufficient privileges to perform this action. (Service: AWSBackup; Status Code: 403; Error Code: AccessDeniedException; Request ID: &amp;amp;lt;id&amp;amp;gt;)

Not much to go on, and especially puzzling given that I’d given my IAM user the backup:CreateBackupVault permission. Amazon has a page describing the permissions involved to manage AWS Backup programmatically, and there are actually seven different permissions that you need. Confusingly, several of the ARN examples given on that page are incorrect. To create the Backup Vault, the Backup Plan, the Backup Selection, and delegate the backup role correctly, I eventually crafted this policy snippet:

        {
            "Sid": "BackupPolicy",
            "Effect": "Allow",
            "Action": [
                "backup:CreateBackupPlan",
                "backup:CreateBackupSelection",
                "backup:CreateBackupVault",
                "backup:DeleteBackupPlan",
                "backup:DeleteBackupSelection",
                "backup:DeleteBackupVault",
                "backup:DescribeBackupVault",
                "backup:GetBackupPlan",
                "backup:UpdateBackupPlan",
                "iam:PassRole",
                "kms:CreateGrant",
                "kms:GenerateDataKey",
                "kms:Decrypt",
                "kms:RetireGrant",
                "kms:DescribeKey"
            ],
            "Resource": [
                "arn:aws:backup:<region>:<account-id>:backup-plan:*",
                "arn:aws:backup:<region>:<account-id>:backup-vault:<Name>*",
                "arn:aws:backup:<region>:<account-id>:key:*",
                "arn:aws:iam::<region>:<account-id>:role/<Name>*"
            ]
        },
        {
            "Sid": "BackupStoragePolicy",
            "Effect": "Allow",
            "Action": [
                "backup-storage:MountCapsule"
            ],
            "Resource": "*"
        }

Something to note here is that you can’t qualify everything by ARN, not easily anyway. The generated ARNs for backup plans appear to be completely random.

CDK code

This example presupposes that you have an EFS filesystem defined elsewhere, named efsUploads. The following code creates a Backup Vault, a Backup Plan, and targets that EFS filesystem by ARN. It also creates an IAM service role for running backups. The backup rules create daily backups in warm storage, retained for 35 days, and monthly backups in cold storage, retained for a year.

    const efsVault = new backup.CfnBackupVault(this, opts.siteName + 'BackupVault', {
      backupVaultName: opts.siteName,
    });
  
    const efsBackupRole = new iam.Role(this, opts.siteName + 'EFSBackupRole', {
      assumedBy: new iam.ServicePrincipal('backup.amazonaws.com'),
      managedPolicies: [
        iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AWSBackupServiceRolePolicyForBackup'),
        iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AWSBackupServiceRolePolicyForRestores'),
      ],
    });
    const efsBackup = new backup.CfnBackupPlan(this, opts.siteName + 'EFSBackupPlan', {
      backupPlan: {
        backupPlanName: opts.siteName + 'EFSBackupPlan',
        backupPlanRule: [
          {
            ruleName: opts.siteName + 'DailyWarmBackup',
            lifecycle: {
              deleteAfterDays: 35
            },
            targetBackupVault: efsVault.attrBackupVaultName,
            scheduleExpression: 'cron(0 8 * * ? *)',
          },
          {
            ruleName: opts.siteName + 'MonthlyColdBackup',
            lifecycle: {
              deleteAfterDays: 365,
              moveToColdStorageAfterDays: 30
            },
            targetBackupVault: efsVault.attrBackupVaultName,
            scheduleExpression: 'cron(0 8 1 * ? *)',
          }
        ]
      }
    });
    const efsBackupPlanSelection = new backup.CfnBackupSelection(this, opts.siteName + 'EFSBackupPlanSelection', {
      backupPlanId: efsBackup.attrBackupPlanId,
      backupSelection: {
        iamRoleArn: efsBackupRole.roleArn,
        selectionName: opts.siteName + 'EFS',
        resources: [
          'arn:aws:elasticfilesystem:' + this.region + ':' + this.account + ':file-system/' + efsUploads.ref,
        ]
      }
    });

There’s no way (that I know of) to get the ARN of the EFS instance programmatically, so you have to construct it. The AWS Backup library in CDK is still in developer preview, and all the constructs are still pretty low-level.

Using TLS-ALPN-01 on a Raspberry PI

I’ll always read ALPN as Alpine

I run Nextcloud on my Raspberry PI and I have a certificate from Let’s Encrypt on it. I set this up a couple years ago with certbot and it’s been fine. I don’t expose port 80 in my environment so I’ve relied on the TLS-SNI-01 challenge using the certbot client which ships on Raspbian Jessie.

TLS-SNI-01 is now end-of-life because of security vulnerabilities so I needed to find an alternative. There are a couple different challenge methods available. Let’s Encrypt’s cerbot supports two:

  • HTTP-01: a random file is placed in the webroot of your server
  • DNS-01: a random TXT record is added to your DNS

Neither of these was a great option for me. Routing port 80 traffic really wasn’t something I wanted to do. The DNS option is feasible, but my current registrar doesn’t offer a good API I’d either have to change registrars or introduce a manual step to the renewal process.

Fortunately, there’s another option, but it involves a little work setting up: TLS-ALPN-01. Certbot doesn’t support it, but other clients do. After some trial-and-error, I settled on dehydration after reading a great Medium post by Sam Decrock.

Getting the Pi ready

One thing I’d noted during this whole process, though in the end it wasn’t relevant, is that the shipped version of certbot in Raspbian Jessie was really old, and newer versions required Python 3.5 or higher, but Jessie was stuck at 3.4+. Dehydration’s sample ALPN responder also depends on the ALPN support added to the ssl module in Python 3.5.

Stretch has been available for a year and half; I’d built my Pi maybe six months prior. The core instructions for a dist-upgrade went smoothly though it took a couple hours. The only oddity was that the wired network interface wasn’t available at boot. I added auto eth0 to /etc/network/interfaces to resolve the issue.

Setting up dehydration

I followed Decrock’s post closely for configuring dehydration. I dumped everything into /etc/dehydrated. The sample domains.txt has varous sample configurations; I replaced all of it with a single line containing my hostname. The sample responder worked out of the box once I had Python 3.5 available. The one gotcha is that you have stop Apache/Nginx/whichever before running the responder so that it can listen on port 443.

Migrating Apache

Apache was already configured to use the certbot-issued certificates. These were in /etc/letsencrypt/live/yourdomain/; dehydration’s were in /etc/dehydration/certs/yourdomain. Changing the paths in the default host configuration worked fine. I noted that dehydration didn’t appear to have anything comparable to the configuration block in /etc/letsencrypt/options-ssl-apache.conf (ciphers and such), so I copied that directly into the virtual host configuration. Apache started cleanly on the first try.

Automation and cleanup

Let’s Encrypt certificates have a shelf life of three months, and this is a recurring task I don’t want. The renewal process looks like this:

  1. Stop apache
  2. Start the responder on port 443
  3. Execute the renewal request
  4. Stop the responder
  5. Start apache

This isn’t as simple as the TLS-SNI-01 challenge and does involve a little bit of downtime. If we’re not in the renewal window this takes ~3-5 seconds; if it renews the certificate it’s more like 20 seconds. I wrapped up the whole process in a shell script, using nohup to background the responder task:

#!/bin/bash
/bin/systemctl stop apache2
/usr/bin/nohup /usr/bin/python3 /etc/dehydrated/alpn-responder.py > /dev/null 2>&1 &
alpnPID=$!
/etc/dehydrated/dehydrated -c -f /etc/dehydrated/config
kill $alpnPID
/bin/systemctl start apache2

The magic part here is getting the PID from the responder process so that we can safely kill it (and not anything else!) once the renewal task is complete. I scheduled this for the wee hours of the morning. Finally, I uninstalled the now-updated certbot package because I don’t need it anymore and it won’t work going forward anyway.

Node on RHEL7

In August 2017 we implemented Swarthmore College’s PDF accessibility tool for Moodle. This required us to stand up a Node.js application, which was a new experience for us. Our environment was RHEL7, and our preferred web server Apache.

Application deployment

We followed our usual Capistrano principles for deploying the application. We created a simple project with all the Capistrano configuration and then mounted the Swarthmore project as a git submodule in a top-level directory named public. We configured the Capistrano npm module to use public as its working directory to ensure that the various node modules are installed on deployment.

PM2

PM2 is a Node process manager; its role is to ensure that the application runs at boot. To use it, we first need to install it globally:

sudo npm install -g pm2

Next, we create an ecosystem.json file. This needs to be in the root of the project repository; since we’re using Capistrano we define it in shared/public and symlink it on deploy. This is what ours looked like:

{
"apps": [{
    "name": "{NAME}",
    "script": "./index.js",
    "cwd": "/var/www/{NAME}/current/public",
    "error_file": "/var/www/{NAME}/current/logs/{NAME}.err.log",
    "out_file": "/var/www/{NAME}/current/logs/app.out.log",
    "exec_mode": "fork_mode"
}] }

All straightforward. We create a new user on the unix platform to own this job, have it start the process:

sudo -u {USER} pm2 start ecosystem.json

We can run a second command which generates the necessary syntax for setting up the systemd commands:

sudo -u {USER} pm2 startup

Apache

Having done all that, the node application is happily running on port 8080. We’re not interested in exposing that port in our environment, so we add a proxy pass to our standard Apache configuration for that virtual host:

        ProxyRequests on
        ProxyPass / http://localhost:8080/

We’ll have to revisit this if we ever want to have a second node application on the system, but for now it works.

Featured image by Hermann Luyken [CC BY-SA 3.0 or GFDL], from Wikimedia Commons.

Critical pagination

This is a story about Moodle, PHP, Active Directory, OpenLDAP, and how I stared a problem in the face for two days without realizing what I was looking at.

Pagination

I assumed maintenance of the LDAP syncing scripts plugin (local_ldap) in 2016. One thing I did was to add PHPUnit coverge which I wrote about in Writing LDAP unit tests for a Moodle plugin.

I’ve received reports about a possible bug with the plugin, Active Directory, and large numbers of users. After standing up a test Active Directory server (which is a story for another day), I’ve been extending the unit tests for the local_ldap module to take advantage of pagination.

PHP added support for LDAP pagination in PHP 5.4. Moodle added support soon after in Moodle 2.4 (MDL-36119). Beyond some range queries on the Active Directory side, there wasn’t anything in the plugin using pagination. The queries quickly revealed problems with the Active Directory code:

1) local_ldap_sync_testcase::test_cohort_group_sync
ldap_list(): Partial search results returned: Sizelimit exceeded

/var/www/moodle/htdocs/local/ldap/locallib.php:144
/var/www/moodle/htdocs/local/ldap/locallib.php:626
/var/www/moodle/htdocs/local/ldap/tests/sync_test.php:176
/var/www/moodle/htdocs/lib/phpunit/classes/advanced_testcase.php:80

Adding pagination is fairly straightforward, but you have to take care because not all LDAP implementations support it. Here’s an example cribbed from Moodle’s implementation:

$connection = $this->ldap_connect(); // Connection and bind.
$ldappagedresults = ldap_paged_results_supported($this->config->ldap_version, $connection);
$ldapcookie = '';
do {
	if ($ldappagedresults) {
    	ldap_control_paged_result($connection, $this->config->pagesize, true, $ldapcookie);
    }
    ... // Whatever LDAP task you're doing
    if ($ldappagedresults) {
    	ldap_control_paged_result_response($connection, $ldapresult, $ldapcookie);
    }
} while ($ldappagedresults && $ldapcookie !== null && $ldapcookie != '');

If the LDAP server doesn’t support pagination all the extra code is a no-op and you pass through the do…while loop once.

I wound up adding pagination in three places: the code that gets all the groups from the LDAP server, the code that searches for all the distinct attribute values (e.g. all possible values of eduPersonAffiliation), and the code in the unit test itself which tears down the test container in LDAP. Everything passed in Active Directory, so I made what I figured would be a pro forma push to Travis to test the code against OpenLDAP. I came back from lunch to read an error message I’d never even heard of:

ldap_delete(): Delete: Critical extension is unavailable

Critically unavailable

I’ve spent a fair amount of time debugging bizarre LDAP problems but “Critical extension is unavailable” was a new one on me:

1) local_ldap_sync_testcase::test_cohort_group_sync
ldap_delete(): Delete: Critical extension is unavailable
/home/travis/build/moodle/local/ldap/tests/sync_test.php:405
/home/travis/build/moodle/local/ldap/tests/sync_test.php:67
/home/travis/build/moodle/lib/phpunit/classes/advanced_testcase.php:80

Researching this phrase led me to a discovery: if you’ve run a paginated query against LDAP in PHP, you need to tear down that connection and reconnect to the server afterwards. The code in question was in the PHPUnit code which tore down the environment between tests. It runs two queries; one deletes all the users and groups while the second deletes the organizational units. This was code I’d taken from the Moodle LDAP authentication module (auth_ldap) and extended with pagination when the Active Directory tests failed.

Mocked up, this is what the code did before I modified it:

Establish LDAP connection
Get the top-level information about the test container
Get the users and groups from the test container
Delete those users and groups from the test container
Get the organizational units from the test container
Delete the organizational units from the test container
Delete the test container

After adding pagination and connection closures, the code did this:

Establish LDAP connection
Get the top-level information about the test container
Do
	Get the users and groups from the test container
	Delete those users and groups from the test container
Loop
Close LDAP connection
Establish LDAP connection
Do
	Get the organizational units from the test container
	Delete the organizational units from the test container
Loop
Close LDAP connection
Establish LDAP connection
Delete the test container

And it still didn’t work. I played around with various permutations for hours but didn’t make any progress. It was Friday, I was tired, I went home for a three-day weekend and didn’t think about it at all (maybe a little). When I got back in the office the problem was staring me right in the face.

In the old, unpaginated code, it wasn’t a problem to invoke ldap_delete after retrieving results, using the same LDAP connection. With pagination that logic doesn’t work anymore; we need to ensure we have all the results first, recreate the connection, then separately run the deletes. Thus modified, the logic is like this:

Establish LDAP connection
Get the top-level information about the test container
Do
	Get the users and groups from the test container
Loop
Close LDAP connection
Establish LDAP connection
Delete those users and groups from the test container
Do
	Get the organizational units from the test container
Loop
Close LDAP connection
Establish LDAP connection
Delete the organizational units from the test container
Delete the test container

The tests pass now. It’s interesting to me that this was never an issue with Active Directory, only with OpenLDAP. It’s a valuable lesson about when to refactor your code and check and your assumptions.

Featured image by Nevit Dilmen [GFDL or CC-BY-SA-3.0], from Wikimedia Commons

ImageMagick’s convert command handles image-to-PDF just fine, but by default can spit out a very small image–almost as though it’s “zoomed in.” To get around that, set the density flag when invoking it: convert -density 100% foo.jpg foo.pdf. You may have to play with it a little to get a usable result.

No, that’s not it

This is a story of how a log file that got too large degraded a production system for a couple days. It illustrates what happens if you dive into a problem without stepping back and asking basic questions.

We use Redmine as an issue tracking/project management platform. It has the capability to ingest emails from standard input. Late last week, we realized that this feature had stopped working. What followed was a lot of time in the weeds which could have been avoided if I’d just stopped to work the problem.

Continue reading

« Older posts