This is a story of how a log file that got too large degraded a production system for a couple days. It illustrates what happens if you dive into a problem without stepping back and asking basic questions.

We use Redmine as an issue tracking/project management platform. It has the capability to ingest emails from standard input. Late last week, we realized that this feature had stopped working. What followed was a lot of time in the weeds which could have been avoided if I’d just stopped to work the problem.

Bouncy bouncy

We use postfix as our mail handler, with the aliases defined in /etc/aliases. The read email is piped to a Rake task, like so:

redmine:        "|BUNDLE_GEMFILE=/path/to/installation/Gemfile /usr/local/bin/bundle exec rake -f /path/to/installation/Rakefile redmine:email:read tracker=Bug status=New allow_override=tracker,status,priority,done_ratio,assigned_to RAILS_ENV=\"production\""

Our WordPress server is configured to send certain notifications to Redmine. Here’s the actual failure as it appeared in /var/log/maillog:

Oct 17 18:05:16 <redmine server> postfix/local[23999]: E79001805D1: to=<classroom@...>, relay=local, delay=8.5, delays=0.03/0.02/0/8.5, dsn=5.3.0, status=bounced (Command died with status 153: "(cat && echo "Assigned to: Classroom support")|BUNDLE_GEMFILE=/path/to/installation/Gemfile /usr/local/bin/bundle exec rake -f /path/to/installation/Rakefile redmine:email:read status=New tracker=Support project=classroom-support unknown_user=accept no_permission_check=1 allow_override=all RAILS_ENV="production"". Command output: sh: line 1: 24001 Done                    ( cat && echo "Assigned to: Classroom support" )      24002 File size limit exceeded| BUNDLE_GEMFILE=/path/to/installation/Gemfile /usr/local/bin/bundle exec rake -f /path/to/installation/Rakefile redmine:email:read status=New tracker=Support project=classroom-support unknown_user=accept no_permission_check=1 allow_override=all RAILS_ENV="production" )

The actual problem is staring me right in the face: “File size limit exceeded”. The thing is, it doesn’t make any sense to me. The server isn’t out of disk space. The emails don’t exceed postfix’s message size limits (configured at 50 MB). Brief googling doesn’t turn up an obvious answer in connection with postfix. I do, however, see these entries at about the same time:

Oct 17 18:05:16 <redmine server> postfix/smtp[24011]: connect to <wordpress server>[X.X.X.X]:25: No route to host
Oct 17 18:05:16 hagbard1-0 postfix/smtp[24011]: 6BBCB1805E3: to=<apache@<wordpress server>>, relay=none, delay=0.02, delays=0/0.01/0/0, dsn=4.4.1, status=deferred (connect to <wordpress server>[X.X.X.X]:25: No route to host)

Here’s where I go astray. The failure of the rake task generated a bounce-back. The bounce-back can’t be delivered. That’s a symptom, not a cause, but I start chasing it. We’d moved our WordPress environment to a new server a few weeks ago, so maybe there was a network topology issue? It’s not implausible as Lafayette’s network is heavily segmented. It is, however, wrong and easy to rule out.

When?

The first thing you should do with any problem is establish when it actually stopped working. There are several sources of emails to Redmine: the WordPress server, our hosted Qualtrics survey platform, and people responding to Redmine notifications. All of these were broken, but because WordPress was reported first I treated it as a WordPress problem. If I’d grepped /var/log/maillog for errors I would have learned the following:

  1. Ingesting emails from standard input began failing on October 17, weeks after the WordPress server move.
  2. Email ingesting was failing from all of the sources above, with the same behavior.

I could have verified this myself by sending an email directly to one of the aliases, which I did in fact do after wasting lots of time chasing the network theory.

Memory?

We’ve had an on-going issue with memory usage on some of our RHEL7 servers, particularly those with gluster mounts, so it’s in our headspace as a possible source of problems. Could this be a memory issue? It seemed unlikely; the server in question had an uptime of 220 days and all its vitals (monitored by Nagios) were healthy. Nevertheless I gave it a reboot, which solved nothing but took up more time. It did establish that whatever was wrong was a permanent feature.

Spare

We have a hot spare Redmine server; it’s identical in every way to the primary one except the uploads aren’t synchronized. We redirected email to that box and it worked like a champ, which only reinforced that problem, whatever it was, was local to the server in question.

Log, it’s log…

Redmine keeps a log of all its actions in/path/to/installation/log/production.log. I’d noted during troubleshooting that the bounced emails left no trace. I hadn’t drawn the obvious connection between the two—that the source of the problem was the log itself. We weren’t rotating the log (that’s going to change), and it had grown to over 50 MB:

-rw-rw----+ 1 apache apache 52945825 Oct 20 15:26 production.log

As far as log files go that’s not terrible, and the web application handled it fine. Postfix, on the other hand? That size dovetailed with the message and mailbox limits we configured for Postfix:

message_size_limit=52428800
mailbox_size_limit=52428800

It’s not clear to me why those limits would affect Postfix’s ability to complete the task, but that couldn’t be a coincidence. I manually rotated the log, bounced Apache, and voila—service was restored.

Lessons

This took longer than it should have to resolve. A few take-aways:

  1. Always rotate your logs. Most applications aren’t designed to grapple with large logs and you can enounter odd edge cases as they grow.
  2. When troubleshooting, figure out your boundaries first. When did it stop working? How broad is the problem?
  3. Setup monitoring for all sources of failure. This feature died quietly on a Tuesday evening; a colleague notified me the next morning. Repeated failure notifications would have helped me realize that the problem wasn’t as isolated as I thought.

For the last issue, it means adding the server’s maillogs to Splunk and setting up a rule that matches on a Rake command dying. It’s not as heart-stopping as a text message from Nagios, but at least we’ll know in the future that there’s a problem.