There are two ways for diagnosing problems:
- Checking for usual suspects and hoping to find a matching one
- Following a systematic approach
Checking for usual suspects and hoping to find a matching one
The first approach relies on previous experience (both in particular subject area/technology and about the context/environment the problem occurs). For example if a patient comes to doctor complaining about pain in chest, then for doctor (and also for the patient) it would definitely be beneficial to know more relevant info about the patient – the context. If the patient had just fell off a 10-foot ladder, then it’d be more suitable to look for broken ribs. On the other hand, if the patient has been a long-time smoker and was watching TV on a couch when the pain started, then perhaps it’d be more suitable to start with an EKG (note that I’m not an expert on how human body works so should anybody complain about any pain in their chest to you, send them to real doctor immediately!)
Anyway, if you’ve been administering a database full time for last 5 years, you will probably know where to look immediately when a specific problem occurs. Note that I wrote “where to look immediately” here, not “what to change immediately”. Using previous experience to identify root causes of problems is obviously a perfectly valid approach which may get you to the solution very fast (again and again) – but it can work well only if you do have lots of previous experience in solving problems for that technology and that particular environment (in other words, you know the context). The big risk here is that if a “new” problem expresses itself in similar symptoms like the “old-and-well-known-problem”, we could easily end up looking for and fixing the wrong issue. And if that doesn’t work, continuing to try out a solution which helped with another problem last year. And if that doesn’t help then the cycle continues, we dig up even more unlikely fixes which have been useful once in past and we apply them. And then we resort to googling and trying out whatever solutions anyone has suggested for problems others have experienced.
We have ended up in Desperate Switch Flipping state. Usually this leads to flipping even larger switches, starting from Oracle session/instance parameters and adding random SQL hints up to restarting servers, upgrading databases, operating systems, hardware – usually without any luck. This wastes time, doesn’t solve our problem and may cause even more trouble. This is not good. Checking out the usual suspects may help solving common recurring problems (hey, shouldn’t good specialists avoid recurring problems rather than fixing them again and again?) . However it is very important to draw the line between checking for usual suspects and falling into DSF state, as looks like happened to someone at Oracle-L.
“I ran Statspack reports at the highest level of detail until I was blue in the face. I ran traces. I set events. But I also am by nature intuitive and tend often to use intuition to solve a problem with facts to back up my intuitive conclusion. So after providing all of this stuff to Oracle Support, they were at a loss, well, they were very eager to look at corruption as a cause, because they didn’t have another solution.”
(Btw, I think the DSF state would be a suitable addition to pathological DBA problems list which Gaja once started with CTD)
The other way to diagnose problems is to follow a systematic approach, an appropriate methodology based on knowledge how computers work.
Following a systematic approach
I personally allow myself to check for the usual suspects based on my (sic!) experience in diagnosing problems on that system (or similar systems) and often do a quick Metalink search, but if this doesn’t reveal the root cause immediately, I will get systematic by starting measuring the dynamics of the task I’m interested in. I will not start randomly shooting in the dark or doing DSF based on random internet-posts showing up in search engines.
It’s easy to be systematic in Oracle world. This is mainly for following reasons:
- Computers and computer systems are deterministic – they always generate exactly the same set of output conditions for same set of input conditions. One could argue that human interaction, distributed systems and possible lower-level bugs/corruptions make large computer systems non-deterministic. However when troubleshooting a specific problematic task it is enough if we identify only the immediate points of system interaction which our task is experiencing. This is where instrumentation comes into play.
- Oracle is well-instrumented – it is easy to narrow down problems with your task as with various tracing, oracle wait interface and statistic counters you have good understanding what Oracle is doing when a problem occurs.
- Computer software is not a black box – and Oracle’s no exception. Any piece of software is just a bunch of computer instructions physically clustered into functions which the CPU then executes in a deterministic way (I avoided saying very deterministic here because you can’t have a more or less deterministic system, it’s either deterministic or not. And computer CPUs are definitely deterministic in their machine code execution). So it is fairly easy to follow where in the kernel code path Oracle is, even if the session statistics and wait interface are not enough (as I started blogging in my previous post in this series)
I normally start measuring the problematic task I’m diagnosing from high-level. Based on the results I will decide where to drill down and measure the next set of data, more focused on a specific area. I will look at the data then, measure again where appropriate and usually I get closer to the root cause step-by-step.
So, how does all this look in practice? I think the best way is to provide some examples from real life.There are various types of problems you could experience with Oracle, but one of the most complex (and interesting) type of problem is a performance problem. And this is where I start. I will start from a simple performance problem though – a single identifyable Oracle session experiencing performance issues.
Case study 1: Session hung – no wait no CPU
I often have to deal with issues where a session has apparently got hung and hasn’t responded for a while. Usually these cases are something simple as waiting for some lock or a query plan gone wild, doing nested loop (cartesian) joins over millions of rows. Such cases are easy to solve, however the following case was a pretty interesting one ( Oracle 10.1 on Solaris 8 Sparc 64bit ):
1) Check V$SESSION_WAIT for the session
I queried v$session_wait for that session using my little script sw.sql ( sw = Session Wait ). As my script formats some data for readability I added the output of a plain “raw” query against v$session_wait as well:
SQL> select state, seq#, event, seconds_in_wait from v$session_wait where sid = 197;
STATE SEQ# EVENT SECONDS_IN_WAIT
——————- ———- —————————– —————
WAITED KNOWN TIME 63 SQL*Net message from client 1505
SQL> @sw 197
SID STATE EVENT SEQ# SECONDS_IN_WAIT P1 P2 P3
———- ——- ——————– ——- ————— ———— ———- ———-
197 WORKING On CPU / runqueue 63 1514 1413697536 1 0
From output of both queries we see that Oracle thinks the session is NOT waiting on anything (as status is WAITED not WAITING), thus being on CPU, apparently for last 1500 seconds. In such cases I always check what the OS has to say, as OS tools will have more accurate picture on process’s state (and the reason is that OS tools will query the OS process tables directly, while Oracle’s V$SESSION_WAIT is populated by Oracle itself).
2) Confirm resource utilization from OS level
SQL> select spid from v$process where addr = (select paddr from v$session where sid = 197);
$ prstat -p 9597
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
9597 oracle 1145M 1070M sleep 59 0 0:00.00 0.0% oracle/1
$ ps -flp 9597
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
c S oracle 9597 1 0 40 20 ? 146564 ? 11:09:44 ? 0:00 oracleXXXXXX01 (LOCAL=NO)
Both prstat and ps output show that this process is sleeping and hasn’t used any noticeable CPU time at all, even though it had been active over 1500 seconds.
3) Check where in the Oracle kernel code path the process is sleeping
The two-step diagnosis above has already proven that this session has been waiting for something vast majority of it’s lifetime. And as Oracle’s wait interface has failed to record this wait then there’s no point in looking further into v$session_event and such. We go directly to the source and get what’s going on.
By going to the source I don’t mean looking into source code, as I don’t have it (and never had it). I rather mean that instead of making hypothesis and wild guesses where the process could be stuck, we will just go and see where it is stuck.
Any debugger can provide us this information, also in Solaris and Linux we have a handy pstack utility to use:
$ pstack 9597 9597: oracleXXXXXX01 (LOCAL=NO) ffffffff7c9a7854 sigsuspend (ffffffff7fff6d90) ffffffff7c956d2c _libc_sleep (0, 5, 0, 0, 0, ffffffff7fff6f21) + f8 0000000100cc0188 qm_init_uga_helper (380028000, 380028, 3ba048b68, 380000, 380028000, 380028000) + 6c8 00000001002d4808 rpiswu2 (0, 104556000, ffffffff7fff70d8, 2, 104556418, ffffffff7fff7a80) + 1a8 0000000100cc5af0 qm_run_as_ksys (ffffffff7fff761c, 100cbfac0, ffffffff7fff7541, 1042dbd80, 1042db000, 1042db) + b0 0000000100cc0314 qm_init_uga (3ee8, 1042969f0, 104296000, 104296, 100cbf000, 100cbf) + 14 0000000100d2a5f4 qmeGetResRelByHdl (105000, 0, 105140578, 21, 0, 0) + 54 0000000100cac7e0 qmtkLookupSchemaOIDPath (e, 1428, ffffffff7fff7fb0, 105069b28, 10506ae00, 104400) + 180 0000000100cac5cc qmtkLookupSchemaOID (105068700, 6, d, ffffffff7fff87d8, ffffffff7fff7ce4, 0) + ac 00000001002ba9b0 qmtLookupSchemaElement (104556, ffffffff7b3410d8, d, ffffffff7b340090, 7, ffffffff7fff87d8) + 30 000000010309a834 qmu_preparse_xml2 (42400, 3c0bfcf49, d, 0, e, ffffffff7fff7fb0) + 454 00000001030bccac qmxtgGetUrlAndElNameCS (c, ffffffff7b9701f8, ffffffff7fff87e8, 10501dba8, 10501dba8, ffffffff7fff8808) + 14c 00000001030bd0e0 qmxtgCreateFromLob (105068700, ffffffff7bb61e88, ffffffff7fff87e8, 0, 1a, ffffffff7b9701f8) + 180 0000000102649ca4 qmxtkCreateFromLob (ffffffff7fff9fe8, ffffffff7bb61e88, 0, ffffffff7fff8938, ffffffffffffffff, 0) + 164 0000000102649d30 qmxtkCreateFromClob2 (0, 0, 0, ffffffff7bb69960, ffffffffffffffff, 0) + 50 000000010264ae2c qmxtkConsFromClob (ffffffff7bb69980, 0, ffffffff7bb69960, 0, 0, 0) + 4c 0000000103a21b3c spefcpfa (0, 10264ade0, ffffffff7bb69828, ffffffff7fff9088, ffffffff7bb69960, 103a216a4) + 4dc 00000001039d8a70 spefmccallstd (ffffffff7fffa058, ffffffff7fff9b90, ffffffff7fff9c60, ffffffff7fff9bd0, 2c, ffffffff7fff9bd0) + 1f0 00000001038f42c8 peftrusted (ffffffff7fff9bd0, 104556000, ffffffff7fffac78, ffffffff7fffa058, 105068700, ffffffff7fff9c60) + 88 0000000103a9c678 psdexsp (380007, 104556, ffffffff7fff9278, 10506b000, 10506b, 105068860) + b8 00000001002d4808 rpiswu2 (0, 104556000, ffffffff7fff9438, 2, 104556418, ffffffff7fff98c8) + 1a8 0000000103a9c3c0 psdextp (10506a000, 103a9c000, 0, 0, a, 9) + 180 00000001038f3eac pefccal (ffffffff7fffac78, ffffffff7fffa058, ffffffff7fff9c60, ffffffff7ba5e050, 105068700, 100b0aa80) + 12c 00000001038f3c14 pefcal (ffffffff7fffa058, 1, 105068860, 4004, 4326, 4000) + 94 000000010377407c pevm_FCAL (ffffffff7ba5e050, 3ba9eed48, ffffffff7ba5e0b8, 3ba9f0e18, 1, ffffffff7fff9fe8) + 7c 00000001037400cc pfrinstr_FCAL (ffffffff7ba5e050, 3ba9f0bde, ffffffff7ba5e0b8, 10457d, 104400, 3ba9f0be4) + 4c 00000001037362c8 pfrrun_no_tool (ffffffff7ba5e050, 3ba9f0bdc, ffffffff7ba5e0b8, 10457c9d8, 2001, 2001) + 48 00000001037372d0 pfrrun (ffffffff7ba5e0b8, 200000, 0, 200000, ffffffff7ba5e050, 3ba97022c) + 2f0 0000000103783374 plsql_run (ffffffff7ba6e4a0, 1, 0, ffffdfff, ffffffff7fffac78, 0) + 274 0000000103722554 peicnt (ffffffff7fffac78, 105068860, 6, ffffffff7fffaae8, 41d8, 1050685e8) + d4 000000010327b784 kkxexe (105000, 104000, 105068, 104296000, 1050685e8, ffffffff7ba5e050) + 284 0000000101ad0228 opiexe (4, ffffffff7ba3a6a8, ffffffff7fffb6c0, 0, 0, ffffffff7bb70420) + 33c8 0000000101a4c0a8 kpoal8 (40008, 1, ffffffff7fffe450, 0, 0, 3) + 648 00000001002d0058 opiodr (14, 10506ae10, 10434ce70, 10506a, 105000, 104000) + 598 0000000102cded94 ttcpip (105071450, 18, ffffffff7fffe450, ffffffff7fffd748, 104229c98, ffffffff7fffd744) + 694 00000001002cd3e8 opitsk (1002cf000, 1, 0, ffffffff7fffe5a8, 105071450, 105071458) + 428 0000000101aaf564 opiino (105070000, 105000, 3c0a60c98, 105000, e4, 105070290) + 404 00000001002d0058 opiodr (4, 10506ae10, 10434c920, 10000, 105071, 105000) + 598 00000001002cc174 opidrv (0, 4, 10506a, 105071450, 0, 3c) + 354 00000001002c9828 sou2o (ffffffff7ffff278, 3c, 4, ffffffff7ffff258, 104aa6000, 104aa6) + 48 00000001002a7b34 main (2, ffffffff7ffff358, ffffffff7ffff370, 0, 0, 100000000) + 94 00000001002a7a7c _start (0, 0, 0, 0, 0, 0) + 17c
This is the full stack trace of the current state of server process 9597.
Even though at beginning it may look unreadable, it’s actually easy! Just start reading from bottom up:
- From the lowest line we see that _start() function has been invoked (normally by fork() + exec() syscalls in Unix) and the execution in that function has got to 0x17c bytes from the function start address, where a call to next function up in the list has been made. Note that the function start address is specified in the 1st column of output, immediately before the function name.
- In our case main() was the next function called (sounds familiar? ;) and the program execution has got to 0x94 bytes from its start when it called sou2o().
- sou2o called opidrv() after reaching 0x48th byte from its start
- opidrv (Oracle Program Interface DRiVer) called opiodr() after reaching 0x354th byte from its start. Note that the numbers in brackets are the arguments passed to the called function. These may not always be complete and entirely accurate though, as they are read from thread stack storage but some parameters may be passed in CPU registers, thus never be written to stack at all.
- opiodr called opiino() after reaching 0x598th byte from its start. Note that this doesn’t mean that only 0x598 bytes worth of instructions have been executed, a lot of other function calls may have been made, which returned after doing their work, also there may have been lots of jumping back and forth inside the function itself (looping for example)
- opiino called opitsk(). This is the place where the main low-level task dispatching is done. When there is no work to do, then opitsk normally waits in opikndf2() which in turn waits for n* functions which hold Oracle’s network-handling code. Once the network handling functions return, they usually have some data sent from client, for which opitsk then calls an appropriate handling function. This is usually ttcpip as also seen in stack trace above.
- Despite its name, ttcpip() doesn’t have directly to do with TCP/IP, it actually means Two-Task Common PIPe read/write (TTC pipe). As the name suggests, this is the separating layer and gatekeeper between “inner-side” of the server code and the “outer-side” exposed to the world. I think of it like a system call interface into Oracle kernel. Many sanity checks and also conversions are done in TTC layer.
- Skipping few rows, we see opiexe() which is OPI function for execute requests ( they translate to EXEC lines in sql trace, also there are corresponding opipar and opifch functions for PARSE and FETCH )
- Continuing on, we see peicnt, plsql_run and pfrrun which are PL/SQL related functions, so obviously this process must be running some PL/SQL
- Continuing on, we see rpiswu2 function – RPI stands for Recursive Program Interface and is used for setting up recursive calls like recursive SQL, autonomous transactions, executing SQL from PL/SQL and vice versa etc etc etc. Basically RPI calls set up a separate context in form of variables, state objects etc for calls which need to be executed before the current call can complete.
- Continuing, we see a peftrusted() call. This one is an interesting one, it allows to call external libraries in trusted-mode, meaning that they already are in (or are loaded in to) Oracle server processes address space (as opposed to calling the library function through extproc). Oracle’s XMLType type and oracle_loader & oracle_datapump ODCI cartridges use this feature.
- Continuing, we see a bunch of qm* calls, these handle XML datatypes and XML schemas.
- Continuing towards the top of the stack we see some UGA initialization related functions and a function which name indicates that some code is going to be executed as ksys (I don’t know what ksys means, but as that qm_run_as_ksys function calls rpiswu2 again, this is a good example how Oracle sets up a separate context to execute code in elevated privilege mode).
Now, 0x6c8 bytes into qm_init_uga_helper function we see something different happening – this function calls a next function _libc_sleep() which in turn has issued sigsuspend() system call, but from the first column we see that these functions reside in a way different location in that processes address space (at 0xffffffff7c956d2c as opposed to 0x000000010xxxxxxx…). So what’s going on?
Such difference in function start addresses can be explained by examining the processes address space:
$ pmap 9597 | grep -i ^000000010 0000000100000000 81016K read/exec /apps/oracle/product/10.1.0.3/bin/oracle 000000010501C000 864K read/write/exec /apps/oracle/product/10.1.0.3/bin/oracle 00000001050F4000 432K read/write/exec [ heap ] $ pmap 9597 | grep -i ^ffffffff7c9 FFFFFFFF7C900000 728K read/exec /usr/lib/sparcv9/libc.so.1
From above output we see that Oracle binary itself has been mapped to address 0x100000000 in the process address space
- libc.so.1 has been mapped to ffffffff7c900000 and it’s total size is 728kB, which means that the _libc_sleep function is somewhere in the middle of library there (at address ffffffff7c956d2c – 56d2c is roughly 355k). The function starting address relative to starting address of the library image could be verified using nm utility:
$ nm /usr/lib/sparcv9/libc.so.1 | egrep "Size|_libc_sleep" [Index] Value Size Type Bind Other Shndx Name  | 355380| 396|FUNC |LOCL |0 |9 |_libc_sleep
So, after doing this stack reading we have conclusive evidence where our Oracle process is its execution. It’s stuck – in a sleep system call made by some XML code, and it is not instrumented by Oracle wait interface. This is definitely abnormal and is not just some perfrormance or user code issue. Reading a stack trace became very handy.
Armed with knowledge that this wasn’t just an user process waiting uninstrumented on IO or spinning on the CPU and after seen the rpiswu2 / init_uga / _libc_sleep calls in stack, I started wondering whether this could be something related to memory allocation or system library calls (yes, this is where the previous experience kicked in again).
I looked into Unix environment variables used for starting the Oracle instance and there it was – $ORACLE_HOME/lib32 directory coming before $OH/lib in LD_LIBRARY_PATH. This caused some XMLDB functions to fail (maybe trying to load 32-bit library into 64-bit address space and failing had messed something up, perhaps in that processes PLT section where addresses of shared object-library functions are loaded).
While checking the stack trace in this example did not point the environment variable problem out to me directly, it helped to immediately eliminate many other causes for such hangs like uninstrumented IO or missing an IPC post from another process thus not being woken up etc. Also the Oracle kernel functions reported in top of the stack gave some indication on the nature of work done ( RPI setting up another call environment, UGA initialization – which can mean memory allocation etc).
So, stack tracing can be very helpful where Oracle instrumentation doesn’t deliver.The key thing about stack traces is that from there you can see the ultimate computer reality, the most accurate place to check what an Oracle server process is doing. No statistic or wait event can’t beat the accuracy and reliability of stack traces as this info comes directly from “the source” – the thread stack memory structure organized by hardware ABI (Application Binary Interface) standard.
Note that there are few issues (as always) with getting and interpreting stack traces and there may be cases where even stack traces don’t give you accurate picture (stack corruptions for example). However, I’ve just realized, to my surprise, that it’s 1am already, so I’ll blog a about it some next time ;-)
So, where am I getting at?
Well I just got started, I have couple examples from my experience of diagnosing ASSM performance bugs using stack tracing. When I have time to blog these, it should be evident how crucial tool stack tracing is for advanced and accurate troubleshooting. Nevertheless, stack tracing may not be required at all if Oracle/OS instrumentation can show you the accurate picture. Or even such tools may not always be required – if you have some serious previous experience on the system you’re troubleshooting ( anyone firefought bad query plans due bind variable peeking with comment on table T is ‘blah’ every Monday morning? ;-)
So, where I’m getting at is stated below – and I intend to write more posts about the second point below:
- Checking for the usual suspects may get you there, very fast.
- Following a systematic approach will get you there.
Metalink Note 175982.1 ORA-600 Lookup Error Categories ( many Oracle kernel function prefixes are documented there ). Update: This note has been taken down from MOS (as of 2014), so you can search for “ORA-600 Lookup Error Categories” in Google to find some “cached” copies.
Metalink Note 453521.1 ORA-04031 “KSFQ Buffers” ksmlgpalloc ( some common Oracle kernel functions are documented there )