For the last few days, one of the websites I work on was having a strange issue. Every now and then, the whole server would grind to a halt and stop serving traffic. This would last for a couple of minutes, then suddenly everything would "magically" pick up and we'd be back to normal. For a while...
Simple website monitoring
The first part to figuring out what was going on was coming up with a way to be notified when the issue was occuring. It's hard to debug these kind of issues after the fact. So I wanted to set up something that would ping the server and email me whenever it was taking too long to respond. I knocked up the following script file (I might've inserted backslashes at invalid locations, sorry, that's just to ease reading on the blog):
#!/bin/bash
TIME=$( { /usr/bin/time -f %e wget -q -O /dev/null \
http://www.example.com/; } 2>&1 )
TOOSLOW=$(awk "BEGIN{ print ($TIME>2.5) }")
if [ $TOOSLOW -eq 1 ]; then
echo "The time for this request, $TIME, was greater than 2.5 seconds!" \
| mail -s "Server ping ($TIME sec)" "me@me.com"
fi
I set this up as a cron job on my media centre PC (high-tech, I know) to run every 5 minutes. It would email me whenever the website took longer than 2.5 seconds to respond (a "normal" response time is < 0.5 seconds, so I figured 5 times longer was enough).
It didn't take long for the emails to start coming through. Then it was a matter of jumping on the server and trying to figure out what the problem was.
First steps
Once the problem was happening, there's a couple of "obvious" first things to try out. The first thing I always do is run top and see what's happening:
top - 08:51:03 up 73 days, 7:45, 1 user, load average: 69.00, 58.31, 46.89
Tasks: 316 total, 2 running, 314 sleeping, 0 stopped, 0 zombie
Cpu(s): 11.0%us, 1.3%sy, 0.0%ni, 15.2%id, 72.0%wa, 0.0%hi, 0.5%si, 0.0%st
Mem: 8299364k total, 7998520k used, 300844k free, 15480k buffers
Swap: 16779884k total, 4788k used, 16775096k free, 6547860k cached
Check out that load! 69.00 in the last five minutes, that's massive! Also of concern is 75% next to "wa", which means 75% of the CPU time was spend in an uninterruptable wait. There's not many things that run in uninterruptable wait (usually kernal threads), and it's usually some I/O sort of thing. So lets see what iotop
(which is like top
but for I/O) reports:
Total DISK READ: 77.37 K/s | Total DISK WRITE: 15.81 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
25647 be/4 apache 73.50 K/s 0.00 B/s 0.00 % 99.99 % httpd
24387 be/4 root 0.00 B/s 0.00 B/s 99.99 % 99.99 % [pdflush]
23813 be/4 root 0.00 B/s 0.00 B/s 0.00 % 99.99 % [pdflush]
25094 be/4 root 0.00 B/s 0.00 B/s 96.72 % 99.99 % [pdflush]
25093 be/4 root 0.00 B/s 0.00 B/s 99.99 % 99.99 % [pdflush]
25095 be/4 root 0.00 B/s 0.00 B/s 99.99 % 99.99 % [pdflush]
25091 be/4 root 0.00 B/s 0.00 B/s 0.00 % 99.99 % [pdflush]
24389 be/4 root 0.00 B/s 0.00 B/s 99.99 % 99.99 % [pdflush]
24563 be/4 root 0.00 B/s 0.00 B/s 99.99 % 99.99 % [pdflush]
24390 be/4 apache 0.00 B/s 23.21 K/s 96.71 % 99.99 % httpd
24148 be/4 apache 0.00 B/s 0.00 B/s 96.71 % 99.99 % httpd
24699 be/4 apache 0.00 B/s 0.00 B/s 99.99 % 99.99 % httpd
23973 be/4 apache 0.00 B/s 0.00 B/s 99.99 % 99.99 % httpd
24270 be/4 apache 0.00 B/s 0.00 B/s 99.99 % 99.99 % httpd
24298 be/4 apache 0.00 B/s 1918.82 K/s 96.71 % 99.02 % httpd
628 be/3 root 0.00 B/s 0.00 B/s 0.00 % 97.51 % [kjournald]
25092 be/4 root 0.00 B/s 0.00 B/s 0.00 % 96.72 % [pdflush]
24258 be/4 root 0.00 B/s 0.00 B/s 99.99 % 96.71 % [pdflush]
23814 be/4 root 0.00 B/s 0.00 B/s 0.00 % 96.71 % [pdflush]
24388 be/4 root 0.00 B/s 0.00 B/s 99.02 % 96.71 % [pdflush]
25545 be/4 apache 0.00 B/s 0.00 B/s 0.19 % 92.73 % httpd
25274 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 92.38 % httpd
24801 be/4 apache 0.00 B/s 5.84 M/s 99.99 % 91.63 % httpd
25281 be/4 apache 0.00 B/s 5.75 M/s 0.00 % 91.33 % httpd
26115 be/4 apache 0.00 B/s 0.00 B/s 9.60 % 19.26 % httpd
25561 be/4 apache 0.00 B/s 3.87 K/s 0.00 % 9.66 % httpd
26035 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 9.63 % httpd
So all those pdflush
commands looked suspicious to me. pdflush
is a kernel thread that's responsible for writing out dirty pages of memory to disk in order to free up memory.
It was at this point that I was suspecting some kind of hardware failure. Checking the output of sar -d 5 0
I could see this:
Linux 2.6.18-308.1.1.el5PAE (XXX) 23/08/12
08:55:45 DEV tps ... await svctm %util
08:55:50 dev8-0 877.25 ... 179.28 1.14 99.84
08:55:50 dev8-1 0.00 ... 0.00 0.00 0.00
08:55:50 dev8-2 0.00 ... 0.00 0.00 0.00
08:55:50 dev8-3 877.25 ... 179.28 1.14 99.84
Check out that utilization column! 99.84% is really bad (more than 70% or so is when you'd start to have problems)
I was at a bit of a loss, because I'm not too familar with the hardware that's running this server (it's not mine) but I knew the disks were in hardware RAID and smartctl
wasn't being helpful at all, so I posted a question on Server Fault. At this point, I was thinking it was a hardware problem, but I wasn't sure where to go from there.
My first hint
My first hint was a comment by Mark Wagner:
Apache PIDs 24801 and 25281 are doing by far the most I/O: 5.84 M/s and 5.75 M/s, respectively. I use iotop -o
 to exclude processes not doing I/O
What were those two processes doing? I opened one of them up with strace:
# strace -p 24801
[sudo] password for dean:
Process 24801 attached - interrupt to quit
write(26, "F\0\0\1\215\242\2\0\0\0\0@\10\0\0\0\0\0\0\0"..., 4194304) = 4194304
write(26, "F\0\0\1\215\242\2\0\0\0\0@\10\0\0\0\0\0\0\0"..., 4194304) = 4194304
write(26, "F\0\0\1\215\242\2\0\0\0\0@\10\0\0\0\0\0\0\0"..., 4194304) = 4194304
write(26, "F\0\0\1\215\242\2\0\0\0\0@\10\0\0\0\0\0\0\0"..., 4194304) = 4194304
write(26, "F\0\0\1\215\242\2\0\0\0\0@\10\0\0\0\0\0\0\0"..., 4194304) = 4194304
It was just spewing those lines out, writing huge amount of data to file number "26". But what is file number 26? For that, we use the handy-dandy lsof
:
# lsof -p 24801
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
httpd 24801 apache cwd DIR 8,3 4096 2 /
httpd 24801 apache rtd DIR 8,3 4096 2 /
httpd 24801 apache txt REG 8,3 319380 6339228 /usr/sbin/httpd
. . .
httpd 24801 apache 26w 0000 8,3 0 41713666 /tmp/magick-XXgvRhzG
So at the end of the output from lsof
, we see under the "FD" (for "file descriptor") column, 26w (the "w" means it's open for writing) and the file is actually /tmp/magick-XXgvRhzG
.
A quick ls
on the /tmp
directory, and I'm shocked:
-rw------- 1 apache apache 1854881318400 Aug 20 04:26 /tmp/magick-XXrQahSe
-rw------- 1 apache apache 1854881318400 Aug 20 04:26 /tmp/magick-XXTaXatz
-rw------- 1 apache apache 1854881318400 Aug 20 04:26 /tmp/magick-XXtf25pe
These files are 1.6 terrabytes! Luckily(?) they're sparse files so they don't actually take up that much physical space (the disks aren't even that big), but that's definitely not good.
The last piece of the puzzle was figuring out what images were being worked on to use those enormous temporary files. I was thinking maybe there was a corrupt .png file or something that was causing ImageMagick to freak out, but firing up Apache's wonderful mod_status and I could immediately see that the problem was my own dumb self: the URL it was requesting was:
/photos/view/size:320200/4cc41224cae04e52b76041be767f1340-q
Now, if you don't spot it right away, it's the "size" parameter: size:320200, it's supposed to be size:320,200 - what my code does if you leave off the "height" is it assumes you want to display the image at the specified width, but with the same aspect ratio. So it was trying to generate an image that was 320200x200125, rather than 320x200!
The Solution
The solution was, as is often the case, extremely simple once I'd figured out the problem. I just made sure the image resizer never resized an image to be larger than the original (our originals are generally always bigger than what's displayed on the website).
The only remaining question was where was this request coming from? The output of mod_status
showed an IP address that belonged to Google, so it must've been the Googlebot crawling the site. But a quick search through the database showed no links to an image with the invalid size:320200 parameter.
At this stage, it's still an open mystery where that link was coming from. The image in question was from an article written in 2010, so it's not something current. In any case, where the link was coming from is of less concern than the fact that it no longer causes the whole server to grind to a halt whenever someone requests it.
So I'm happy to leave that one open for now.