Systematic application troubleshooting in Unix

How many times have you seen a following case, where a user or developer complains that their Oracle session is stuck or running very slowly and the person who starts investigating the issue does following:

Checks the database for locks
Checks free disk space
Checks alert log
Goes back to the client saying “we did a healthcheck and everything looks ok” and closes the case or asks the user/developer to contact application support team or tune their SQL

The point here is that what the heck do the database locks, alert log or disk space have to do with first round session troubleshooting, when Oracle provides just about everything you need in one simple view?

Yes, I am talking about sampling V$SESSION_WAIT here. Database locks, free space and potential errors in alert log may have something to do with your users problems, but not necessarily. As there are many more causes, like network issues etc which could affect your user (and the whole database), it doesn’t make sense to go through all those random “healthchecks” every time you receive a user phone call. Moreover, even if you identify that there is shortage of disk space or there are many database locks – so what? They may not have anything to do with the users problem.

The issue here is that still many people do not know about V$SESSION_WAIT which in most cases shows your problem immediately or at least points you to right direction (e.g. there’s no need to check for locks if your session is waiting on “log file switch (archiving needed)” wait – and vice versa). Even if “these people” have heard of V$SESSION_WAIT and may be able to drop this in during their job interview, they may not know how to use it in systematic troubleshooting context. Many hours of service downtime and user frustration would be saved if all DBAs knew this extremely simple concept of looking at V$SESSION_WAIT.

This blog entry is not about Oracle though, so I will leave this rant for a future blog post.

This post is about a similar problem in Unix world. Having been involved with resolving some serious production issues lately I have been surprised quite many times by the corporate Unix support people who seem to do behave in similar manner. For example, there is a user calling in saying that their scheduled Unix job, which normally takes 5 minutes, has been running for hours now. The “senior unix support analyst” will do following:

Check for free swap space
Check for free disk space
Check for number of network connections
Maybe runs top, sar or vmstat to see what is the system-wide CPU utilization

And reply goes as: “We did a healtcheck and everything looks OK from our side”.

By now I know that the phrase above really means that “We have no idea how to check what’s wrong with your job and we don’t really care”

Ok, in order to avoid my blog becoming a collection of essays about the essence of life rather than a technical information source, here comes the tech part. Its about an issue I hit today.

I was copying one directory with lots of files from one Solaris 10 box to another – using scp. The scp did output all file names copied so I saw the progress. After many files were copied, the copy process suddenly got stuck.

Now, I could have started checking for random things like disk space, swap space, CPU or memory utilization, which would have led me nowhere… Instead I chose the simple and systematic approach which allowed me to diagnose the issue with 2 commands only:

$ ps -ef | grep scp
  oracle  1768   694   0 20:00:59 pts/3       0:00 grep scp
  oracle  1602  1601   0 19:13:09 ?           3:11 scp -r -p -f copy
$

Ok, my scp process is there (it hasn’t died or anything).

Let’s check what this actual process is doing (rather than checking some system-wide aggregations which don’t show anything about individual processes):

$ truss -p 1602
open64("copy/tmp/pipe", O_RDONLY) (sleeping...)

<...no further lines returned...>

^C$
$ </code>

The above command returned only one line – showing that my scp process was stuck trying to open a named pipe (named copy/tmp/pipe)… which blocked my scp from proceeding as there was noone writing into the other end of this pipe. Apparently there’s a problem with the scp I was using, that it didn’t know how to handle pipes.

As I did not want to kill and restart my scp process I resolved the issue in a simple way:

$ echo blah > copy/tmp/pipe
$
 </code>

This command above allowed the open64() syscall to complete, my scp read the “blah” string – reached EOF and knew to close this “file” and proceed to next.

So, the point of this post is – you need to use right tool for the right problem. A single session or process problem can not be diagnosed using systemwide tools.

Whenever you diagnose a single session hang or performance problem in Oracle, you should first look into V$SESSION_WAIT (sample it few times to see whether the SEQ# and P1,2,3 values change). If you see the SEQ# value changing fast, you can sample and calculate wait time deltas from V$SESSION_EVENT as Snapper does.
Whenever you diagnose a single process hang or performance problem in Unix, you should first use ps, prstat or top to see whether the process still exists and whether it’s mostly on CPU or not and use truss/strace or pstack as next steps for diagnosing what the process is doing exactly.

There will be follow-up blog posts on usage details of those tools…

Update:

Here’s a link to the version of v$session_wait sampling script I use:

https://github.com/tanelpoder/tpt-oracle/blob/master/sw.sql

See the comments of this blog entry for usage examples.

Systematic application troubleshooting in Unix

Tanel Poder

2008-01-05