Advanced Oracle Troubleshooting Guide, Part 3: More adventures in process stack

…or rather thread stack as nowadays decent operating systems execute threads (or tasks as they’re called in Linux kernel).

Anyway, stack trace gives you the ultimate truth on what your program is doing, exactly right now. There are couple of but’s like stack corruptions and missing symbol information which may make the traces less useful for us, but for detailed hang & performance troubleshooting the stack traces are a goldmine.

So, I present another case study – how to diagnose a complete database hang when you can’t even log on to the database.

I was doing some performance diagnosis work and wanted to find out where some pointers in some stack traces were pointing in shared pool. The X$KSMSP is the place where you’d find out what type of chunk happens to own the given address and which one is the parent heap of that chunk.

However, at Hotsos Symposium a year or two ago I remember Jonathan Lewis mentioning to be very careful with selecting from X$KSMSP on busy systems with large shared pools.

The reason is that X$KSMSP will take the shared pool latches and start scanning through shared pool heaps, reporting all shared pool chunks in it.

  • This means that shared pool latches are held for very long time (the larger the shared pool, the more chunks you’ve got in SP, the more it takes).
  • This in turn means that you will not be able to allocate or deallocate any chunk from these pools during there latches are held
  • This in turn means that you will not be able to log on to database (as it allocates space in shared pool for session parameter values), you will not be able to hard parse a statement (as it needs to allocate SP space for your cursor and execution plan), you will not be able to even do some soft parses (as those may also use shared pool latches in some cases).

The shared pool size was set to 5GB, due some memory leak issues in past.

I decided to not run just select * from x$ksmsp on the full X$ table, I ran a select * from x$ksmsp where rownum <=10 instead, just to see whether this approach of fetching few rows at a time would work better. Obviously I did this in a test environment first… and it got completely hung. Oh man.

So I started up another sqlplus session to see what’s going on – and it got hung too. By then it was pretty clear to me that it was this killer X$KSMSP query, which didn’t care about my “where rownum <= 10″ trick and probably kept sweeping through the whole shared pool – holding the shared pool latch all the time.

As I wasn’t able to log on to Oracle, the next thing was to check what my process was doing from OS level. The problem with that was that I didn’t have an idea what was my sessions process ID, I didn’t know when exactly I had logged on with my session. I didn’t have access to /dev/kmem to use lsof tool either which would have pointed me out which SPIDs correspond to TCP connections from my workstation’s IP.

Note that even though a test environment, this was a busy system, used heavily by various other processes and batch jobs. So I needed to be really sure which one is my process before killing it at OS level.

I started with ps and prstat:

$ prstat 1
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 24331 xxx_user 8312M 8254M cpu515  10    0   1:22.44 4.1% oracle/1
 26136 oracle   8347M 8285M cpu528  10    0   0:10.25 4.1% oracle/1
 23949 xxx_user 8306M 8248M cpu520   0    0   1:23.21 4.1% oracle/1
 21853 oracle     20G   59M sleep   58    0   0:00.00 2.2% oracle/21
 21849 oracle     20G   59M cpu16   31    0   0:19.52 2.2% oracle/21
 21860 oracle     20G   58M cpu512  32    0   0:23.32 2.0% oracle/19
 24681 oracle     20G   35M cpu531  42    0   8:06.28 0.9% oracle/15
 26516 oracle     20G   13M sleep   58    0  11:34.29 0.0% oracle/1
 27522 oracle     20G   11M sleep   58    0   0:57.33 0.0% oracle/152
 13742 oracle   2064K 1800K cpu19   56    0   0:00.00 0.0% prstat/1
 27526 oracle     20G   11M sleep   58    0   0:58.45 0.0% oracle/149
 27528 oracle     20G   11M sleep   58    0   0:57.07 0.0% oracle/148
 27520 oracle     20G   11M sleep   58    0   0:57.28 0.0% oracle/157
 27524 oracle     20G   11M sleep   58    0   0:58.18 0.0% oracle/166

Even though the CPU usage and the process owner names gave me some hints which one could have been the troublemaker, I wanted to get proof before taking any action. And pstack was very helpful for that (note that due this hang I wasn’t even able to log on with another session!).

So I ran pstack on the main suspect process, 26136. I picked this one as it was the only remote Oracle process using 100% CPU (on a 24-CPU server). The other two top processes were spawned by a local user using bequeath, thus their username was different.

$ pstack 26136
26136:  oracleDBNAME01 (LOCAL=NO)
 000000010301f6c0 kghsrch (105068700, 0, ffffffff7bc77a68, 118, ffffffff7bc77ad8, 0) + 20
 0000000103021f20 kghprmalo (7fffffff, 0, ffffffff7bc77a68, 0, ffffffff77130b90, 140) + 460
 0000000103024804 kghalp (ffffffff7fffa898, ffffffff7bc77a68, 58, 7ffffffc, fffffffffffffffc, 104b86ce8) + 884
 0000000100bb78a4 ksmspc (0, 38004d988, ffffffff7bc7ad80, 54e50a7d0, 458, 104e22180) + 44
 000000010303564c kghscn (104e22180, 38004d988, 1000000000000000, 7ffffff8, 105068700, 54e50a7d0) + 68c
 0000000103034750 kghnwscn (0, ffffffff7bc7ad80, 100bb7860, 1858, 2fa0, 105068700) + 270
 0000000100bb7a44 ksmshp (100bb7, 1, 380000030, ffffffff7bc77a68, 100bb7860, ffffffff7bc7ad80) + 64
 00000001024e67a8 qerfxFetch (10449c000, 0, 10506ae10, ffffffff7bc7ad40, 5651c5518, 1043ae680) + 328
 00000001024e7edc qercoFetch (541d07900, 10449dba0, 4e, 1, ffffffff7bc7ad98, 1024e6480) + fc
 0000000101aa6f24 opifch2 (2, 5, 60, 1, 104400, 1050685e8) + a64
 0000000101a4c694 kpoal8 (0, 1, ffffffff7fffdc70, 0, 10434c0a8, 1) + c34
 00000001002d0058 opiodr (14, 10506ae10, 10434ce70, 10506a, 105000, 104000) + 598
 0000000102cded94 ttcpip (105071450, 18, ffffffff7fffdc70, ffffffff7fffcf68, 104229c98, ffffffff7fffcf64) + 694
 00000001002cd3e8 opitsk (1002cf000, 1, 0, ffffffff7fffddc8, 105071450, 105071458) + 428
 0000000101aaf564 opiino (105070000, 105000, 57a321e48, 105000, dc, 105070290) + 404
 00000001002d0058 opiodr (4, 10506ae10, 10434c920, 10000, 105071, 105000) + 598
 00000001002cc174 opidrv (0, 4, 10506a, 105071450, 0, 3c) + 354
 00000001002c9828 sou2o (ffffffff7fffea98, 3c, 4, ffffffff7fffea78, 104aa6000, 104aa6) + 48
 00000001002a7b34 main (2, ffffffff7fffeb78, ffffffff7fffeb90, 0, 0, 100000000) + 94
 00000001002a7a7c _start (0, 0, 0, 0, 0, 0) + 17c

So, reading the bold sections from bottom up (and using the Metalink note 175982.1 I’ve mentioned earlier)

  1. opifch2: this is a FETCH call being executed
  2. qercoFetch: is a COUNT row source in execution plan
  3. qerfxFetch: is a FIXED TABLE row source (in other words, accessing an X$ table)
  4. ksmshp: looks like this is the function which is called under the covers when X$KSMSP is accessed
  5. kghscn: sounds a lot like Kernel Generic Heap SCaN :)

So, this stack trace proves that the process was definitely executing code which was doing a fetch from an X$ table (the qerfxFetch) and this had eventually resulted in an operation which sounded like a heap scanner ( kghscn ). This is exactly what X$KSMSP does, it scans through the entire shared pool heap and returns a row for every memory chunk in there.

Why was there a qercoFetch function involved even though my query did not have a count(*) in it? The explanation is below:

SQL> select * from x$ksmsp where rownum <=1;

ADDR           INDX    INST_ID   KSMCHIDX   KSMCHDUR KSMCHCOM
-------- ---------- ---------- ---------- ---------- ---------------
04AA0A00          0          1          1          1 free memory

SQL> @x

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------

--------------------------------------------------------------------
| Id  | Operation            |  Name       | Rows  | Bytes | Cost  |
--------------------------------------------------------------------
|   0 | SELECT STATEMENT     |             |       |       |       |
|*  1 |  COUNT STOPKEY       |             |       |       |       |
|   2 |   FIXED TABLE FULL   | X$KSMSP     |       |       |       |
--------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=1)

The qercoFetch rowsource function was the one which should have enforced the “where rownum <=10″ part of my query. It should have stopped calling the qerfxFetch function after it had received its 10 rows, but for some reason it continued on. Or the reason could be that the qerfxFetch function is not sophisticated enough to return control back to calling function with only part of its resultset, leaving no chance to qercoFetch to decide whether to stop fetching half-way through. (Many row sources have such flexibility of returning data in cascading fashion up in stack. For example nested loops are cascading (qerjoFetch, qerjotFetch), hash joins are cascading starting from the moment the hash join build partition has built in memory (qerhjFetch). Sort merge joins are not cascading from high-level perspective, as both of the joined row sources need to be sorted before any rows can be returned. However when both row sources have been sorted (either in memory or temp tablespace), the matched rows can be returned in a cascading fashion again, meaning that the calling function can fetch only X rows, do its processing and fetch next X rows etc. The X is normally dependent on arraysize – the amount of rows you fetch from Oracle at a time. Anyway, I leave this interesting topic for a next post ;)

Continuing with my hang diagnosis, after I had identified the troublemaker X$KSMSP scanner process, using prstat and pstack, I remembered that I could also have checked the sqlplus.exe’s timestamp against the server process start time (accounting the time zone difference and clock drift of course). This can be helpful when you have all your processes either trying to be on CPU or sleeping – unless you like to pstack through all of them.

I had identified one such process with logon time close to my sqlplus.exe’s start time and I took pstack on it:

$ pstack 4848
4848:   oracleDBNAME01 (LOCAL=NO)
 ffffffff7c9a5288 semsys   (2, 6d0a000a, ffffffff7fffa79c, 1, 10)
 0000000102f782c8 sskgpwwait (ffffffff7fffab48, 10501dba8, 0, 200, ffffffffffffffff, 2b) + 168
 0000000100af4b74 kslges (31c, 0, 57f338090, 0, 876, 578802b68) + 4f4
 0000000102d70ea0 kglhdgn (3000, ffffffff7fffb3d0, 12, 0, 0, 380003910) + 180
 0000000102d58974 kglget (105068700, ffffffff7fffb360, fc, 1000, 0, 1) + b94
 000000010128f454 kkttrex (8, 1050683c0, 10506ae08, 105000, 10506a000, 9) + 4b4
 000000010128e6c8 kktexeevt0 (31a8, 57a3ae0a0, 105068f74, 2, 4, 10506ae08) + 3a8
 0000000101a77b60 kpolon (51, 51, 40002d91, 105071450, 80000, ffffffff7fffdc70) + c0
 00000001002d0058 opiodr (1a, 10506ae10, 10434cc68, 10506a, 105000, 105000) + 598
 0000000102cded94 ttcpip (105071450, 20, ffffffff7fffdc70, ffffffff7fffcf68, 104228a98, ffffffff7fffcf64) + 694
 00000001002cd3e8 opitsk (1002cf000, 1, 0, ffffffff7fffddc8, 105071450, 105071458) + 428
 0000000101aaf564 opiino (105070000, 105000, 57f337fd8, 105000, e3, 105070290) + 404
 00000001002d0058 opiodr (4, 10506ae10, 10434c920, 10000, 105071, 105000) + 598
 00000001002cc174 opidrv (0, 4, 10506a, 105071450, 0, 3c) + 354
 00000001002c9828 sou2o (ffffffff7fffea98, 3c, 4, ffffffff7fffea78, 104aa6000, 104aa6) + 48
 00000001002a7b34 main (2, ffffffff7fffeb78, ffffffff7fffeb90, 0, 0, 100000000) + 94
 00000001002a7a7c _start (0, 0, 0, 0, 0, 0) + 17c

Looks like this one was my other session I tried to start up after my first session got hung.

Let’s read the highlighted sections of stack from the top this time:

  1. semsys is a Solaris system call allowing to wait for semaphore count (value) to increase above 0. In other words this system call allows a thread to sleep until someone posts it through the semaphore
  2. sskgpwwait is an Oracle operating system dependent (OSD) layer function which allows Oracle process to wait for an event
  3. kslges is a latch get function (with sleep and timeout capability)
  4. kglhdgn is a call for creating a new library cache object (handle)
  5. kglget is a lookup call for locating a library cache object (if it doesn’t find one, a library cache miss is incremented and kglhdgn() above is called)
  6. kkt calls are related to firing some internal triggers (logon triggers and auditing, anyone?)
  7. kpolon is related to logon and session setup as far as I know

So this stack indicates ( I might not be entirely correct ) that this process had got stuck while trying to create a library cache object in shared pool for an internal trigger during logon ( this database used session auditing by the way ).

Anyway, when I killed my original session (the one who was scanning through X$KSMSP), the database worked ok again.
The advantage of stack tracing was once again shown in a case where Oracle’s instrumentation was not usable (due my stupid mistake ;). This again illustrates once more that you should test everything out thoroughly in test environments even if you think it should work ok.

This entry was posted in Oracle and tagged , , , , . Bookmark the permalink.

6 Responses to Advanced Oracle Troubleshooting Guide, Part 3: More adventures in process stack

  1. John Ospino says:

    Hi Tanel, I already paid for the seminar in Madrid, so I gonna meet this May 11 .).
    First all, I’m really impressed about the way you treat the problems, and second I have a lot of questions to you, sorry in advance if I don’t let you breath.
    :)
    But, Where did you find defintions like “opidrv”, “ttcpip”, etc.?
    It’s really impresive if you just conclude by yourself witout Oracle support.
    Anyway thank you for share your knowlegde for free :), at least a part.

  2. Tanel Poder says:

    There’s that magic metalink note 175982.1 which documents the function name prefixes pretty well. I’ll speak about few techniques for identifying actual function names at my seminar (make sure you ask for it as sometimes I will skip through that section due lack of time)

  3. Jon Adams says:

    I have seen hangs when querying sys objects before and was able to get a decent response by adding the RULE hint to the query. This is probably due to a lack of system object statistics, which no one seems to want to run anymore.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>