Systematic application troubleshooting in Unix

How many times have you seen a following case, where a user or developer complains that their Oracle session is stuck or running very slowly and the person who starts investigating the issue does following:

  1. Checks the database for locks
  2. Checks free disk space
  3. Checks alert log
  4. Goes back to the client saying “we did a healthcheck and everything looks ok” and closes the case or asks the user/developer to contact application support team or tune their SQL

The point here is that what the heck do the database locks, alert log or disk space have to do with first round session troubleshooting, when Oracle provides just about everything you need in one simple view?

Yes, I am talking about sampling V$SESSION_WAIT here. Database locks, free space and potential errors in alert log may have something to do with your users problems, but not necessarily. As there are many more causes, like network issues etc which could affect your user (and the whole database), it doesn’t make sense to go through all those random “healthchecks” every time you receive a user phone call. Moreover, even if you identify that there is shortage of disk space or there are many database locks – so what? They may not have anything to do with the users problem.

The issue here is that still many people do not know about V$SESSION_WAIT which in most cases shows your problem immediately or at least points you to right direction (e.g. there’s no need to check for locks if your session is waiting on “log file switch (archiving needed)” wait – and vice versa). Even if “these people” have heard of V$SESSION_WAIT and may be able to drop this in during their job interview, they may not know how to use it in systematic troubleshooting context. Many hours of service downtime and user frustration would be saved if all DBAs knew this extremely simple concept of looking at V$SESSION_WAIT.

This blog entry is not about Oracle though, so I will leave this rant for a future blog post.

This post is about a similar problem in Unix world. Having been involved with resolving some serious production issues lately I have been surprised quite many times by the corporate Unix support people who seem to do behave in similar manner. For example, there is a user calling in saying that their scheduled Unix job, which normally takes 5 minutes, has been running for hours now. The “senior unix support analyst” will do following:

  1. Check for free swap space
  2. Check for free disk space
  3. Check for number of network connections
  4. Maybe runs top, sar or vmstat to see what is the system-wide CPU utilization

And reply goes as: “We did a healtcheck and everything looks OK from our side”.

By now I know that the phrase above really means that “We have no idea how to check what’s wrong with your job and we don’t really care”

Ok, in order to avoid my blog becoming a collection of essays about the essence of life rather than a technical information source, here comes the tech part. Its about an issue I hit today.

I was copying one directory with lots of files from one Solaris 10 box to another – using scp. The scp did output all file names copied so I saw the progress. After many files were copied, the copy process suddenly got stuck.

Now, I could have started checking for random things like disk space, swap space, CPU or memory utilization, which would have led me nowhere… Instead I chose the simple and systematic approach which allowed me to diagnose the issue with 2 commands only:

$ ps -ef | grep scp
  oracle  1768   694   0 20:00:59 pts/3       0:00 grep scp
  oracle  1602  1601   0 19:13:09 ?           3:11 scp -r -p -f copy
$

Ok, my scp process is there (it hasn’t died or anything).

Let’s check what this actual process is doing (rather than checking some system-wide aggregations which don’t show anything about individual processes):

$ truss -p 1602
open64("copy/tmp/pipe", O_RDONLY) (sleeping...)

<...no further lines returned...>

^C$
$  
 

The above command returned only one line – showing that my scp process was stuck trying to open a named pipe (named copy/tmp/pipe)… which blocked my scp from proceeding as there was noone writing into the other end of this pipe. Apparently there’s a problem with the scp I was using, that it didn’t know how to handle pipes.

As I did not want to kill and restart my scp process I resolved the issue in a simple way:

$ echo blah > copy/tmp/pipe
$
   
 

This command above allowed the open64() syscall to complete, my scp read the “blah” string – reached EOF and knew to close this “file” and proceed to next.

So, the point of this post is – you need to use right tool for the right problem. A single session or process problem can not be diagnosed using systemwide tools.

  1. Whenever you diagnose a single session hang or performance problem in Oracle, you should first look into V$SESSION_WAIT (sample it few times to see whether the SEQ# and P1,2,3 values change). If you see the SEQ# value changing fast, you can sample and calculate wait time deltas from V$SESSION_EVENT as Snapper does.
  2. Whenever you diagnose a single process hang or performance problem in Unix, you should first use ps, prstat or top to see whether the process still exists and whether it’s mostly on CPU or not and use truss/strace or pstack as next steps for diagnosing what the process is doing exactly.

  

There will be follow-up blog posts on usage details of those tools…

 Update:

Here’s a link to the version of v$session_wait sampling script I use:

http://www.tanelpoder.com/files/scripts/sw.sql

See the comments of this blog entry for usage examples.

Note that this year’s only Advanced Oracle Troubleshooting class takes place in the end of April/May 2014, so sign up now if you plan to attend this year!

This entry was posted in Uncategorized and tagged , . Bookmark the permalink.

9 Responses to Systematic application troubleshooting in Unix

  1. Bas says:

    From 10.1 and up most columns of v$session_wait can also be found in v$session, it saves a join. My first query when troubleshooting a current performance problem is:

    SELECT wait_class
    , event
    , COUNT (1) wait_count
    FROM v$session
    WHERE NOT username IS NULL
    AND wait_class ‘Idle’
    AND audsid SYS_CONTEXT (‘USERENV’, ‘SESSIONID’)
    GROUP BY event
    , wait_class
    ORDER BY wait_class
    , event
    /

  2. tanelp says:

    Thanks Bas, yes 10g makes life easier thanks to v$session_wait contnts being in v$session.

    I’m using this script in addition to Snapper for checking session wait info:

    http://www.tanelpoder.com/files/scripts/sw.sql

    My script works on 9i and 10g/11g. It also decodes the enqueue and latch names the session is waiting on, on 9i.


    SQL> @sw 16

    SID STATE EVENT SEQ# SECONDS_IN_WAIT P1 P2 P3 P1TRANSL
    ---------- ------- ------------------- ---------- --------------- ---------- ---------- ---------- ----------------------
    16 WAITING enqueue 26 18 1415053318 262156 5249 0x54580006: TX mode 6

    I’ve written the script the waythat you can write the SID parameter as a subquery. This gives pretty good flexibility, especially when you want to report a group of sessions at a time:


    SQL> @sw "select sid from v$session where status = 'ACTIVE'"

    SID STATE EVENT SEQ# SECONDS_IN_WAIT P1 P2 P3 P1TRANSL
    ---------- ------- ------------------- ---------- --------------- ---------- ---------- ---------- -------------------------
    1 WAITING pmon timer 113 305 300 0 0
    2 WAITING rdbms ipc message 147 296 300 0 0
    3 WAITING rdbms ipc message 2029 159 300 0 0
    4 WAITING rdbms ipc message 261 0 300 0 0
    5 WAITING smon timer 208 239 300 0 0
    6 WAITING rdbms ipc message 5 242 180000 0 0
    7 WAITING rdbms ipc message 74 278 500 0 0
    13 WORKING On CPU / runqueue 53 0 1413697536 1 0
    16 WAITING enqueue 74 162 1415053318 262156 5249 0x54580006: TX mode 6

    9 rows selected.

  3. Neat ! Couldn’t find “P1TRANSL” in the code in sw.sql

  4. tanelp says:

    Thanks Hemant,

    I forgot to upload the most recent version. it’s done now + i wrote a another quick blog entry about it!

  5. Virag Sharma says:

    “Whenever you diagnose a single session hang or performance problem in Oracle, you should first look into V$SESSION_WAIT” ..

    I think DBA still need to see OS level stats before peeping into V$SESSION_WAIT / v$session because it was noticed some time other processes consume most of CPU i.e. CPU usage 100% … and oracle processes not getting enough CPU resource …. Some time found sql execution plan changed due to stats collection / DB upgrade and session start consuming 70-80 % CPU …… in that case also OS command like top / ps help to immediately find out PID->SID-> SQL_ID -> execution plan and if needed set event on session and get all wait event details

    As Old oracle Performance document suggest , if you are on production go left to right

    OS-> Wait-> Lock/latch etc-> Design

    Correct me , if am wrong :-)

  6. tanelp says:

    Hi Virag,

    Note the context of my post, I deliberately mentioned “first round session troubleshooting” in italic in my post.

    E.g. the first thing you check when you want to troubleshoot a specific session.

    You can get there both ways, either start from v$session_wait or OS level stats.

    The difference is that the v$session_wait will give you better, narrower focus (in other words it rules out more potential causes).

    When you check for CPU usage of a process, then no matter what result you get (either the process is sleeping, is 50% on CPU or 100% on CPU), you still have to log on to database and continue your investigation there.

    If you start from logging on to database and look into v$session_wait, then it can rule out many things which would require logging on to OS and using OS tools. For example (again) if you see the session waiting for TX enqueue for last 30minutes or for “log file switch (archiving needed”) then you probably do not care about the current CPU usage of the problem session.

    Of course, checking “environmental factors” like current CPU runqueue lengths, disk IO queuing and service times, can very well be necessary in many cases – but it’s worth taking your time to check these only when a simple query against v$session_wait has ruled out many other “usual suspects”.

    Note that after I wrote Snapper, I use sw.sql much less. Depending on the type of problem (and previous experience regarding that environment) I may start with Snapper instead as it gives a better overview of what’s going on when a session is actually doing something and not hung.

  7. Virag Sharma says:

    Thanks Tanel , your script sw.sql is very useful.

  8. Pingback: 8 Useful Technical Posts … - Sinoptix

  9. Thanks Hemant, i wrote a another quick blog entry about it!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>