This blog entry is about modern Linuxes. In other words RHEL6 equivalents with 2.6.3x kernels and not the ancient RHEL5 with 2.6.18 kernel (wtf?!), which is the most common in enterprises unfortunately. And no, I’m not going to use kernel debuggers or SystemTap scripts here, just plain old “cat /proc/PID/xyz” commands against some useful /proc filesystem entries.
Troubleshooting a “slow” process
Here’s one systematic troubleshooting example I reproduced in my laptop. A DBA was wondering why their find command had been running “much slower”, without returning any results for a while. Knowing the environment, we had a hunch, but I got asked about what would be the systematic approach for troubleshooting this – already ongoing – problem right now.
Luckily the system was running OEL6, so had a pretty new kernel. Actually the 2.6.39 UEK2.
So, let’s do some troubleshooting. First let’s see whether that find process is still alive:
[root@oel6 ~]# ps -ef | grep find root 27288 27245 4 11:57 pts/0 00:00:01 find . -type f root 27334 27315 0 11:57 pts/1 00:00:00 grep find
Yep it’s there – PID 27288 (I’ll use that pid throughout the troubleshooting example).
Let’s start from the basics and take a quick look what’s the bottleneck for this process – if it’s not blocked by anything (for example reading everything it needs from cache) it should be 100% on CPU. If it’s bottlenecked by some IO or contention issues, the CPU usage should be lower – or completely 0%.
[root@oel6 ~]# top -cbp 27288 top - 11:58:15 up 7 days, 3:38, 2 users, load average: 1.21, 0.65, 0.47 Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 2026460k total, 1935780k used, 90680k free, 64416k buffers Swap: 4128764k total, 251004k used, 3877760k free, 662280k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27288 root 20 0 109m 1160 844 D 0.0 0.1 0:01.11 find . -type f
Top tells me this process is either 0% on CPU or very close to zero percent (so it gets rounded to 0% in the output). There’s an important difference though, as a process that is completely stuck, not having a chance of getting onto CPU at all vs. a process which is getting out of its wait state every now and then (for example some polling operation times out every now and then and thee process chooses to go back to sleep). So, top on Linux is not a good enough tool to show that difference for sure – but at least we know that this process is not burning serious amounts of CPU time.
Let’s use something else then. Normally when a process seems to be stuck like that (0% CPU usually means that the process is stuck in some blocking system call – which causes the kernel to put the process to sleep) I run
strace on that process to trace in which system call the process is currently stuck. Also if the process is actually not completely stuck, but returns from a system call and wakes up briefly every now and then, it would show up in strace (as the blocking system call would complete and be entered again a little later):
[root@oel6 ~]# strace -cp 27288 Process 27288 attached - interrupt to quit ^C ^Z + Stopped strace -cp 27288 [root@oel6 ~]# kill -9 %% + Stopped strace -cp 27288 [root@oel6 ~]# + Killed strace -cp 27288
Oops, the strace command itself got hung too! It didn’t print any output for a long time and didn’t respond to CTRL+C, so I had to put it into background with CTRL+Z and kill it from there. So much for easy diagnosis.