Advanced Oracle Troubleshooting Guide, Part 8: Even more detailed latch troubleshooting using LatchProfX

Tanel Poder

2008/07/22

In my last AOT post I published my LatchProf script which is able to sample detailed latchholder data from V$LATCHHOLDER.

Latchprof allows you to drill down into your latching problems at session level (which V$LATCH, V$LATCH_PARENT and V$LATCH_CHILDREN can’t do). It allows you to get valuable details about individual sessions who are holding a latch the most, therefore likely contributing to the latch contention problem the most.

However after you have discovered the troublemaking session, then what next? One way forward is looking into V$SESSTAT counters using Snapper tool. Depending on what latch is the problematic one, you would look for different stats like various buffer get stats for cache buffers chains latches and parsing/executing stats when looking into library cache latches. However if those stats look “normal”, is there any other way do drill down further?

Yeah, there is and lets look into it!

A side comment: Nowadays I try to avoid using X$ tables in favor of V$ views (because in my experience in many client environments you don’t always have easy SYS access or X$ table “proxy views” created). Also, if you can get your job done using simple V$ views, then why bother with more complex X$ tables? (yeah, it’s much cooler, I know ;)

Anyway, here is an exception: LatchProfX script, which means Latch holder Profiler eXtended, because it provides some very interesting extended information about latch holders. It shows why the latch currently held was taken after all. It uses X$KSUPRLAT instead of V$LATCHHOLDER as the latter doesn’t externalize all required information. I’m talking about latch Where statistics.

By “Where statistics” I mean that Oracle kernel’s latching module coders have instrumented their code well enough – whenever they get a latch, they mark down the code location ID of the latch getter within the latch structure itself.

Let’s not keep you waiting anymore and show some examples. The output is quite wide so you may need to size your browser window accordingly.

Normally you would run LatchProfX with specific enough parameters to monitor only one or few latches which you know are the troublemakers. However the first example run I do across all instance latches (the third parameter is %) to illustrate the usefulness of LatchProfX result data better.

LatchProfX syntax is similar to the original LatchProf script, with the addition of ksllwnam and ksllwlbl field names you can sample. These are exactly the fields (from X$KSLLW) which give us the descriptions of “where in kernel the latch was aquired from” opcodes we get from X$KSUPRLAT (see the script itself for syntax and internals).

So (finally), lets run it:

SQL> @latchprofx name,ksllwnam,ksllwlbl % % 100000

-- LatchProfX 1.06 by Tanel Poder (  )

NAME                                KSLLWNAM                                 KSLLWLBL                   Held       Gets  Held %     Held ms Avg hold ms
----------------------------------- ---------------------------------------- -------------------- ---------- ---------- ------- ----------- -----------
shared pool                         kghalo                                                              9705       5987    9.71     185.366        .031
ges resource hash list              kjuscl: lock close request                                          2429       2350    2.43      46.394        .020
kks stats                           kksAllocChildStat                                                   2001       1460    2.00      38.219        .026
ges resource hash list              kjrmas1: lookup master node                                         1780       1724    1.78      33.998        .020
shared pool                         kghupr1                                  Chunk Header               1618        891    1.62      30.904        .035
row cache objects                   kqrpre: find obj                                                     988        986     .99      18.871        .019
shared pool simulator               kglsim_unpin_simhp                                                   804         97     .80      15.356        .158
ges process parent latch            kjlalc: lock allocation from lmon                                    779        766     .78      14.879        .019
active service list                 kswsgetso: get service object                                        663        658     .66      12.663        .019
shared pool                         kghalp                                                               635        628     .64      12.129        .019
ges process parent latch            kjatioc                                                              518        518     .52       9.894        .019
shared pool simulator               kglsim_upd_newhp                                                     402        401     .40       7.678        .019
ges resource hash list              kjrref: find matched resource                                        374        371     .37       7.143        .019
row cache objects                   kqreqd                                                               360        360     .36       6.876        .019
ges process parent latch            kjlrem: detach lock from group                                       235        229     .24       4.489        .020
enqueues                            ksqdel                                                               162        162     .16       3.094        .019
enqueues                            ksqgel: create enqueue                   parent obj                  148        148     .15       2.827        .019
row cache objects                   kqreqd: reget                                                        108        105     .11       2.063        .020
enqueue hash chains                 ksqrcl                                   resource addr               101        101     .10       1.929        .019
shared pool                         kgh_heap_sizes                                                        99         99     .10       1.891        .019
enqueue hash chains                 ksqgtl3                                  acquiring session            93         93     .09       1.776        .019
ges enqueue table freelist          kjlalc: lock allocation                                               85         83     .09       1.624        .020
shared pool                         kghfre                                   Chunk Header                 74         65     .07       1.413        .022
shared pool simulator               kglsim_scan_lru: scan                                                 69         69     .07       1.318        .019
ges process parent latch            kjucll: closing lock                                                  68         61     .07       1.299        .021
shared pool simulator               kglsim_chg_simhp_free                                                 48         47     .05        .917        .020
ges domain table                    kjdmlad: add a lock to an xid-hashed loc                              42         41     .04        .802        .020
shared pool simulator               kglsim_chg_simhp_noob                                                 41         41     .04        .783        .019
shared pool simulator               kglsim_unpin_simhp: fast path                                         39         39     .04        .745        .019
ksuosstats global area              ksugetosstat                                                          29          1     .03        .554        .554
ges group table                     kjgdgll: move a lock from group lock lis                              13          6     .01        .248        .041
kokc descriptor allocation latch    kokcdlt: regular free                    latch                        11          4     .01        .210        .053
ges enqueue table freelist          kjlfr: remove lock from parent object                                 11         11     .01        .210        .019
kokc descriptor allocation latch    kokcdlt: allocation                      latch                         5          4     .01        .096        .024
cache buffers chains                kcbgtcr: fast path                                                     3          3     .00        .057        .019
ges caches resource lists           kjrchc: cached obj cleanup                                             2          2     .00        .038        .019
process allocation                  ksufap: active procs                                                   2          2     .00        .038        .019
ges group table                     kjgagll: move a lock from process lock l                               2          2     .00        .038        .019
ges domain table                    kjdmlrm: remove a lock from its xid-hash                               1          1     .00        .019        .019
ges caches resource lists           kjchc: clean resource cache                                            1          1     .00        .019        .019
ges resource hash list              kjruch: move resource from cache to free                               1          1     .00        .019        .019
name-service memory objects         kjxgsnep: get a value-block object                                     1          1     .00        .019        .019
active service list                 kswsite: service iterator                                              1          1     .00        .019        .019

43 rows selected.

Elapsed: 00:00:02.06

What do we see? While LatchProf would just have provided latch holder info about which (child) latch was held and by which SID, then LatchProfX will show you additional why this latch was gotten.

From above example you see that there were various reasons why the shared pool latch was gotten, but the reason which caused the SP latch to be held the most time, was a code location kghalo, which means something like Kernel Generic Heap Allocate. Which essentially is the function used fol searching and allocating free space from shared pool. This kind of makes sense as stuff like freeing a chunk and updating heap global stats are simple operations as Oracle already knows the memory location it needs to update, but searching for free chunks for allocation can require scanning through a lot of chunks if there are no suitable ones in freelists and a LRU scan is needed.

Next, there’s a kghupr1 location which has caused the shared pool latches held for 1.62% of total sampling time. This location, in function kghupr (un-pin recreatable) needs to take shared pool latch for putting the unpinned chunk back to heap LRU list. An Oracle kernel coder has also specified a label (action description) as “Chunk Header” which is seen from KSLLWLBL column. Its in place to state that this kernel location gets shared pool latch as it needs to update a chunk header – putting a chunk back to LRU list means updating the chunk header (and its neighbor(s) headers).

There are also kgh_heap_sizes and kghfre locations, which update global heap statistics (the stuff you see from v$sgastat for example) and free a chunk in heap respectively.

Wherever you have estabilished which latch is the troublemaker (under too much contention) and you know the “why” reason for it, you probably also want to know the sessions contributing to the problem. And LatchProfX can show that:

SQL> @latchprofx sid,name,ksllwnam,ksllwlbl % "shared pool" 100000

-- LatchProfX 1.06 by Tanel Poder (  )

       SID NAME                                KSLLWNAM                                 KSLLWLBL                   Held       Gets  Held %     Held ms Avg hold ms
---------- ----------------------------------- ---------------------------------------- -------------------- ---------- ---------- ------- ----------- -----------
       113 shared pool                         kghalo                                                              5038       2787    5.04      92.699        .033
       112 shared pool                         kghalo                                                              4723       2820    4.72      86.903        .031
       113 shared pool                         kghupr1                                  Chunk Header                875        369     .88      16.100        .044
       113 shared pool simulator               kglsim_unpin_simhp                                                   728         48     .73      13.395        .279
       112 shared pool                         kghupr1                                  Chunk Header                623        438     .62      11.463        .026
       112 shared pool simulator               kglsim_unpin_simhp                                                   273         37     .27       5.023        .136
       112 shared pool                         kghalp                                                               232        232     .23       4.269        .018
       113 shared pool simulator               kglsim_upd_newhp                                                     230        230     .23       4.232        .018
       113 shared pool                         kghalp                                                               224        223     .22       4.122        .018
       112 shared pool simulator               kglsim_upd_newhp                                                     194        194     .19       3.570        .018
       112 shared pool                         kgh_heap_sizes                                                        49         49     .05        .902        .018
       113 shared pool                         kgh_heap_sizes                                                        49         49     .05        .902        .018
       112 shared pool                         kghfre                                   Chunk Header                 37         37     .04        .681        .018
       113 shared pool simulator               kglsim_scan_lru: scan                                                 32         32     .03        .589        .018
       112 shared pool simulator               kglsim_scan_lru: scan                                                 28         27     .03        .515        .019
       112 shared pool simulator               kglsim_chg_simhp_free                                                 27         27     .03        .497        .018
       113 shared pool simulator               kglsim_chg_simhp_free                                                 26         25     .03        .478        .019
       113 shared pool                         kghfre                                   Chunk Header                 24         23     .02        .442        .019
       112 shared pool simulator               kglsim_unpin_simhp: fast path                                         23         23     .02        .423        .018
       113 shared pool simulator               kglsim_chg_simhp_noob                                                 22         22     .02        .405        .018
       112 shared pool simulator               kglsim_chg_simhp_noob                                                 18         18     .02        .331        .018
       113 shared pool simulator               kglsim_unpin_simhp: fast path                                         18         18     .02        .331        .018
       112 shared pool sim alloc               kglsim_chk_objlist: alloc                                              1          1     .00        .018        .018
       122 shared pool                         kghfre                                   Chunk Header                  1          1     .00        .018        .018

24 rows selected.

Elapsed: 00:00:01.93
SQL>

From above you see are 2 main sessions contributing to shared pool latch contention and both are doing lots of kghalo calls, thus allocating memory. When V$SESSTAT for those sessions shows high “parse count (hard)” statistic it would be obvious that the heavy shared pool activity is due hard parsing (which is kind of common knowledge as excessive hard parsing very often the reason for shared pool latch contention). However check out another sample with same parameters:

SQL> @latchprofx sid,name,ksllwnam,ksllwlbl % "shared pool" 100000

-- LatchProfX 1.06 by Tanel Poder (  )

       SID NAME                                KSLLWNAM                                 KSLLWLBL                   Held       Gets  Held %     Held ms Avg hold ms
---------- ----------------------------------- ---------------------------------------- -------------------- ---------- ---------- ------- ----------- -----------
          shared pool                         kghalo                                                             10714          5   10.71     212.137      42.427
       122 shared pool                         kghalo                                                              7685         42    7.69     152.163       3.623
       112 shared pool                         kghalo                                                              3874       2305    3.87      76.705        .033
       113 shared pool                         kghalo                                                              3616       2318    3.62      71.597        .031
       112 shared pool                         kghupr1                                  Chunk Header                659        327     .66      13.048        .040
       113 shared pool                         kghupr1                                  Chunk Header                597        353     .60      11.821        .033
       112 shared pool simulator               kglsim_unpin_simhp                                                   416         29     .42       8.237        .284
[...snip...]
       122 shared pool simulator               kglsim_upd_newhp                                                       2          2     .00        .040        .020
       117 shared pool                         kghupr1                                  Chunk Header                  2          2     .00        .040        .020
       122 shared pool                         kghalp                                                                 2          2     .00        .040        .020
       117 shared pool                         kghalp                                                                 1          1     .00        .020        .020

38 rows selected.

Elapsed: 00:00:02.14

SQL>

What’s this? Why is there a session with SID=0 which has called kghalo function and held the latch for much longer on average than any other sessions? LatchProfX has detected only 5 latch gets by that session from kghalo location which still have taken the most time in total. And average hold time for that latch is estimated over 42 milliseconds compared to most others which are in microseconds (even though the 3.6 milliseconds average hold time for a latch is quite long too).

One of the reason for such long hold times could be CPU preemption – a latch holder process was switched off the CPU and it took a while before it got back on it to complete its work and release the latch again. However on my 4 CPU test machine the was plenty of CPU time available so such preemption should have not been the problem. The SID 0 gives a clue. When a new session is started in Oracle, apparently it’s reported as SID 0 during the session initialization sequence in some places, which includes the memory allocation for things like the v$sesstat and v$session_event arrays and also the session parameter values and set events in some cases. Considering that total size of these structures is quite large in normal shared pool usage context (tens of kB), then if freelists are empty and LRU scan must happen, this inevitably takes longer time – and shared pool latch is held all that time. (phew, I must write shorter statements).

Anyway, I recommend playing around with this script in a test environment, run different workloads like hard parses, soft parses, physical IOs and buffer gets and you should get quite a good overview of some Oracle internal workings that way. This hopefully also answers how do I know some of the Oracle function names and what’s their purpose – the answer is that Oracle itself exposes quite a lot of this information in the X$ tables I’m using in my script. You just need to experiment, observe and try to figure out what are the real names behind function names like kghalo etc. Fun :)

Few more notes:

If you want to drill down to latch child level, then use LADDR as another field (among SID,NAME etc) in first parameter to LatchProfX (I’ve described this in AOT guide part 7).

Do not use LatchProf, LatchProfX and WaitProf when your server’s CPUs are 100% utilized, for two reasons:

  1. All those scripts never sleep during their sampling but spin on CPU – thus they would make the CPU starvation even worse
  2. More importantly, if all your CPUs are running at 100% utilization, the latch contention you see is probably just a symptom of the CPU starvation itself! The same goes to wait events as well – when your CPU runqueues are long, some of the CPU runqueue waiting time can be attributed to wait events such IO instead (for example when IO completes, but the process can’t get onto CPU to “end” the IO wait

I have updated both LatchProf and LatchProfX do detect whether they’re running on Oracle 9i or 10g+ and dynamically run different code based on version (on 9i, there is no GETS column in V$LATCHHOLDER).