Advanced Oracle Troubleshooting Guide, Part 8: Even more detailed latch troubleshooting using LatchProfX

In my last AOT post I published my LatchProf script which is able to sample detailed latchholder data from V$LATCHHOLDER.

Latchprof allows you to drill down into your latching problems at session level (which V$LATCH, V$LATCH_PARENT and V$LATCH_CHILDREN can’t do). It allows you to get valuable details about individual sessions who are holding a latch the most, therefore likely contributing to the latch contention problem the most.

However after you have discovered the troublemaking session, then what next? One way forward is looking into V$SESSTAT counters using Snapper tool. Depending on what latch is the problematic one, you would look for different stats like various buffer get stats for cache buffers chains latches and parsing/executing stats when looking into library cache latches. However if those stats look “normal”, is there any other way do drill down further?

Yeah, there is and lets look into it!

A side comment: Nowadays I try to avoid using X$ tables in favor of V$ views (because in my experience in many client environments you don’t always have easy SYS access or X$ table “proxy views” created). Also, if you can get your job done using simple V$ views, then why bother with more complex X$ tables? (yeah, it’s much cooler, I know ;)

Anyway, here is an exception: LatchProfX script, which means Latch holder Profiler eXtended, because it provides some very interesting extended information about latch holders. It shows why the latch currently held was taken after all. It uses X$KSUPRLAT instead of V$LATCHHOLDER as the latter doesn’t externalize all required information. I’m talking about latch Where statistics.

By “Where statistics” I mean that Oracle kernel’s latching module coders have instrumented their code well enough – whenever they get a latch, they mark down the code location ID of the latch getter within the latch structure itself.

Let’s not keep you waiting anymore and show some examples. The output is quite wide so you may need to size your browser window accordingly.

Normally you would run LatchProfX with specific enough parameters to monitor only one or few latches which you know are the troublemakers. However the first example run I do across all instance latches (the third parameter is %) to illustrate the usefulness of LatchProfX result data better.

LatchProfX syntax is similar to the original LatchProf script, with the addition of ksllwnam and ksllwlbl field names you can sample. These are exactly the fields (from X$KSLLW) which give us the descriptions of “where in kernel the latch was aquired from” opcodes we get from X$KSUPRLAT (see the script itself for syntax and internals).

So (finally), lets run it:

SQL> @latchprofx name,ksllwnam,ksllwlbl % % 100000

-- LatchProfX 1.06 by Tanel Poder ( http://www.tanelpoder.com )

NAME                                KSLLWNAM                                 KSLLWLBL                   Held       Gets  Held %     Held ms Avg hold ms
----------------------------------- ---------------------------------------- -------------------- ---------- ---------- ------- ----------- -----------
shared pool                         kghalo                                                              9705       5987    9.71     185.366        .031
ges resource hash list              kjuscl: lock close request                                          2429       2350    2.43      46.394        .020
kks stats                           kksAllocChildStat                                                   2001       1460    2.00      38.219        .026
ges resource hash list              kjrmas1: lookup master node                                         1780       1724    1.78      33.998        .020
shared pool                         kghupr1                                  Chunk Header               1618        891    1.62      30.904        .035
row cache objects                   kqrpre: find obj                                                     988        986     .99      18.871        .019
shared pool simulator               kglsim_unpin_simhp                                                   804         97     .80      15.356        .158
ges process parent latch            kjlalc: lock allocation from lmon                                    779        766     .78      14.879        .019
active service list                 kswsgetso: get service object                                        663        658     .66      12.663        .019
shared pool                         kghalp                                                               635        628     .64      12.129        .019
ges process parent latch            kjatioc                                                              518        518     .52       9.894        .019
shared pool simulator               kglsim_upd_newhp                                                     402        401     .40       7.678        .019
ges resource hash list              kjrref: find matched resource                                        374        371     .37       7.143        .019
row cache objects                   kqreqd                                                               360        360     .36       6.876        .019
ges process parent latch            kjlrem: detach lock from group                                       235        229     .24       4.489        .020
enqueues                            ksqdel                                                               162        162     .16       3.094        .019
enqueues                            ksqgel: create enqueue                   parent obj                  148        148     .15       2.827        .019
row cache objects                   kqreqd: reget                                                        108        105     .11       2.063        .020
enqueue hash chains                 ksqrcl                                   resource addr               101        101     .10       1.929        .019
shared pool                         kgh_heap_sizes                                                        99         99     .10       1.891        .019
enqueue hash chains                 ksqgtl3                                  acquiring session            93         93     .09       1.776        .019
ges enqueue table freelist          kjlalc: lock allocation                                               85         83     .09       1.624        .020
shared pool                         kghfre                                   Chunk Header                 74         65     .07       1.413        .022
shared pool simulator               kglsim_scan_lru: scan                                                 69         69     .07       1.318        .019
ges process parent latch            kjucll: closing lock                                                  68         61     .07       1.299        .021
shared pool simulator               kglsim_chg_simhp_free                                                 48         47     .05        .917        .020
ges domain table                    kjdmlad: add a lock to an xid-hashed loc                              42         41     .04        .802        .020
shared pool simulator               kglsim_chg_simhp_noob                                                 41         41     .04        .783        .019
shared pool simulator               kglsim_unpin_simhp: fast path                                         39         39     .04        .745        .019
ksuosstats global area              ksugetosstat                                                          29          1     .03        .554        .554
ges group table                     kjgdgll: move a lock from group lock lis                              13          6     .01        .248        .041
kokc descriptor allocation latch    kokcdlt: regular free                    latch                        11          4     .01        .210        .053
ges enqueue table freelist          kjlfr: remove lock from parent object                                 11         11     .01        .210        .019
kokc descriptor allocation latch    kokcdlt: allocation                      latch                         5          4     .01        .096        .024
cache buffers chains                kcbgtcr: fast path                                                     3          3     .00        .057        .019
ges caches resource lists           kjrchc: cached obj cleanup                                             2          2     .00        .038        .019
process allocation                  ksufap: active procs                                                   2          2     .00        .038        .019
ges group table                     kjgagll: move a lock from process lock l                               2          2     .00        .038        .019
ges domain table                    kjdmlrm: remove a lock from its xid-hash                               1          1     .00        .019        .019
ges caches resource lists           kjchc: clean resource cache                                            1          1     .00        .019        .019
ges resource hash list              kjruch: move resource from cache to free                               1          1     .00        .019        .019
name-service memory objects         kjxgsnep: get a value-block object                                     1          1     .00        .019        .019
active service list                 kswsite: service iterator                                              1          1     .00        .019        .019

43 rows selected.

Elapsed: 00:00:02.06

What do we see? While LatchProf would just have provided latch holder info about which (child) latch was held and by which SID, then LatchProfX will show you additional why this latch was gotten.

From above example you see that there were various reasons why the shared pool latch was gotten, but the reason which caused the SP latch to be held the most time, was a code location kghalo, which means something like Kernel Generic Heap Allocate. Which essentially is the function used fol searching and allocating free space from shared pool. This kind of makes sense as stuff like freeing a chunk and updating heap global stats are simple operations as Oracle already knows the memory location it needs to update, but searching for free chunks for allocation can require scanning through a lot of chunks if there are no suitable ones in freelists and a LRU scan is needed.

Next, there’s a kghupr1 location which has caused the shared pool latches held for 1.62% of total sampling time. This location, in function kghupr (un-pin recreatable) needs to take shared pool latch for putting the unpinned chunk back to heap LRU list. An Oracle kernel coder has also specified a label (action description) as “Chunk Header” which is seen from KSLLWLBL column. Its in place to state that this kernel location gets shared pool latch as it needs to update a chunk header – putting a chunk back to LRU list means updating the chunk header (and its neighbor(s) headers).

There are also kgh_heap_sizes and kghfre locations, which update global heap statistics (the stuff you see from v$sgastat for example) and free a chunk in heap respectively.

Wherever you have estabilished which latch is the troublemaker (under too much contention) and you know the “why” reason for it, you probably also want to know the sessions contributing to the problem. And LatchProfX can show that:

SQL> @latchprofx sid,name,ksllwnam,ksllwlbl % "shared pool" 100000

-- LatchProfX 1.06 by Tanel Poder ( http://www.tanelpoder.com )

       SID NAME                                KSLLWNAM                                 KSLLWLBL                   Held       Gets  Held %     Held ms Avg hold ms
---------- ----------------------------------- ---------------------------------------- -------------------- ---------- ---------- ------- ----------- -----------
       113 shared pool                         kghalo                                                              5038       2787    5.04      92.699        .033
       112 shared pool                         kghalo                                                              4723       2820    4.72      86.903        .031
       113 shared pool                         kghupr1                                  Chunk Header                875        369     .88      16.100        .044
       113 shared pool simulator               kglsim_unpin_simhp                                                   728         48     .73      13.395        .279
       112 shared pool                         kghupr1                                  Chunk Header                623        438     .62      11.463        .026
       112 shared pool simulator               kglsim_unpin_simhp                                                   273         37     .27       5.023        .136
       112 shared pool                         kghalp                                                               232        232     .23       4.269        .018
       113 shared pool simulator               kglsim_upd_newhp                                                     230        230     .23       4.232        .018
       113 shared pool                         kghalp                                                               224        223     .22       4.122        .018
       112 shared pool simulator               kglsim_upd_newhp                                                     194        194     .19       3.570        .018
       112 shared pool                         kgh_heap_sizes                                                        49         49     .05        .902        .018
       113 shared pool                         kgh_heap_sizes                                                        49         49     .05        .902        .018
       112 shared pool                         kghfre                                   Chunk Header                 37         37     .04        .681        .018
       113 shared pool simulator               kglsim_scan_lru: scan                                                 32         32     .03        .589        .018
       112 shared pool simulator               kglsim_scan_lru: scan                                                 28         27     .03        .515        .019
       112 shared pool simulator               kglsim_chg_simhp_free                                                 27         27     .03        .497        .018
       113 shared pool simulator               kglsim_chg_simhp_free                                                 26         25     .03        .478        .019
       113 shared pool                         kghfre                                   Chunk Header                 24         23     .02        .442        .019
       112 shared pool simulator               kglsim_unpin_simhp: fast path                                         23         23     .02        .423        .018
       113 shared pool simulator               kglsim_chg_simhp_noob                                                 22         22     .02        .405        .018
       112 shared pool simulator               kglsim_chg_simhp_noob                                                 18         18     .02        .331        .018
       113 shared pool simulator               kglsim_unpin_simhp: fast path                                         18         18     .02        .331        .018
       112 shared pool sim alloc               kglsim_chk_objlist: alloc                                              1          1     .00        .018        .018
       122 shared pool                         kghfre                                   Chunk Header                  1          1     .00        .018        .018

24 rows selected.

Elapsed: 00:00:01.93
SQL>

From above you see are 2 main sessions contributing to shared pool latch contention and both are doing lots of kghalo calls, thus allocating memory. When V$SESSTAT for those sessions shows high “parse count (hard)” statistic it would be obvious that the heavy shared pool activity is due hard parsing (which is kind of common knowledge as excessive hard parsing very often the reason for shared pool latch contention). However check out another sample with same parameters:

SQL> @latchprofx sid,name,ksllwnam,ksllwlbl % "shared pool" 100000

-- LatchProfX 1.06 by Tanel Poder ( http://www.tanelpoder.com )

       SID NAME                                KSLLWNAM                                 KSLLWLBL                   Held       Gets  Held %     Held ms Avg hold ms
---------- ----------------------------------- ---------------------------------------- -------------------- ---------- ---------- ------- ----------- -----------
         0 shared pool                         kghalo                                                             10714          5   10.71     212.137      42.427
       122 shared pool                         kghalo                                                              7685         42    7.69     152.163       3.623
       112 shared pool                         kghalo                                                              3874       2305    3.87      76.705        .033
       113 shared pool                         kghalo                                                              3616       2318    3.62      71.597        .031
       112 shared pool                         kghupr1                                  Chunk Header                659        327     .66      13.048        .040
       113 shared pool                         kghupr1                                  Chunk Header                597        353     .60      11.821        .033
       112 shared pool simulator               kglsim_unpin_simhp                                                   416         29     .42       8.237        .284
[...snip...]
       122 shared pool simulator               kglsim_upd_newhp                                                       2          2     .00        .040        .020
       117 shared pool                         kghupr1                                  Chunk Header                  2          2     .00        .040        .020
       122 shared pool                         kghalp                                                                 2          2     .00        .040        .020
       117 shared pool                         kghalp                                                                 1          1     .00        .020        .020

38 rows selected.

Elapsed: 00:00:02.14

SQL>

What’s this? Why is there a session with SID=0 which has called kghalo function and held the latch for much longer on average than any other sessions? LatchProfX has detected only 5 latch gets by that session from kghalo location which still have taken the most time in total. And average hold time for that latch is estimated over 42 milliseconds compared to most others which are in microseconds (even though the 3.6 milliseconds average hold time for a latch is quite long too).

One of the reason for such long hold times could be CPU preemption – a latch holder process was switched off the CPU and it took a while before it got back on it to complete its work and release the latch again. However on my 4 CPU test machine the was plenty of CPU time available so such preemption should have not been the problem. The SID 0 gives a clue. When a new session is started in Oracle, apparently it’s reported as SID 0 during the session initialization sequence in some places, which includes the memory allocation for things like the v$sesstat and v$session_event arrays and also the session parameter values and set events in some cases. Considering that total size of these structures is quite large in normal shared pool usage context (tens of kB), then if freelists are empty and LRU scan must happen, this inevitably takes longer time – and shared pool latch is held all that time. (phew, I must write shorter statements).

Anyway, I recommend playing around with this script in a test environment, run different workloads like hard parses, soft parses, physical IOs and buffer gets and you should get quite a good overview of some Oracle internal workings that way. This hopefully also answers how do I know some of the Oracle function names and what’s their purpose – the answer is that Oracle itself exposes quite a lot of this information in the X$ tables I’m using in my script. You just need to experiment, observe and try to figure out what are the real names behind function names like kghalo etc. Fun :)

Few more notes:

If you want to drill down to latch child level, then use LADDR as another field (among SID,NAME etc) in first parameter to LatchProfX (I’ve described this in AOT guide part 7).

Do not use LatchProf, LatchProfX and WaitProf when your server’s CPUs are 100% utilized, for two reasons:

  1. All those scripts never sleep during their sampling but spin on CPU – thus they would make the CPU starvation even worse
  2. More importantly, if all your CPUs are running at 100% utilization, the latch contention you see is probably just a symptom of the CPU starvation itself! The same goes to wait events as well – when your CPU runqueues are long, some of the CPU runqueue waiting time can be attributed to wait events such IO instead (for example when IO completes, but the process can’t get onto CPU to “end” the IO wait

I have updated both LatchProf and LatchProfX do detect whether they’re running on Oracle 9i or 10g+ and dynamically run different code based on version (on 9i, there is no GETS column in V$LATCHHOLDER).

Note that this year’s only Advanced Oracle Troubleshooting class takes place in the end of April/May 2014, so sign up now if you plan to attend this year!

This entry was posted in Cool stuff, Oracle and tagged , , , . Bookmark the permalink.

15 Responses to Advanced Oracle Troubleshooting Guide, Part 8: Even more detailed latch troubleshooting using LatchProfX

  1. Hello Tanel,
    a colleague pointed me to your blog – and it’s an impressing deep look into the oracle kernel; All these things where i used to close my eyes due to it’s complexity.
    will read your blog in future;

    Thank you and kind regards
    Karl

  2. Pingback: Advanced Oracle Troubleshooting @OOW 2008 presentation slides and scripts « Tanel Poder’s blog: Core IT for geeks and pros

  3. Pingback: Another LatchProfX use case | Tanel Poder's blog: Core IT for Geeks and Pros

  4. Pingback: Great Expectations: An Interview with Tanel Poder « So Many Oracle Manuals, So Little Time

  5. Pingback: latch: cache buffers chains latch contention - a better way for finding the hot block | Tanel Poder's blog: Core IT for Geeks and Pros

  6. yasser says:

    Are these scripts compatible for Oracle RAC??

  7. Tanel Poder says:

    Yes. Any script which works in single instance will also work in RAC as RAC is just a bunch of instances…

    But if you ask whether these scripts gather info from all RAC nodes, then no. It’s not needed either as latches aren’t global, they’re instance-local memory structures only…

  8. Raj says:

    Tanel

    Even though your scripts are meant for single instance , it can be very well used for RAC instances as well as for parallel queries.
    In fact, I regularly use your scripts for troubleshooting. These are what I do.

    1. stored all the relevant scripts like snapper , latchprof, waitprof etc.. in a directory of one node [say node 1] of the cluster.
    2. created tnsnames entry for all the other nodes from the node [node 1] where all the scripts are stored.
    3. Log into the instances in node 1,find the issue in any node by using any of the views like gv$session or gv$session_wait or gv$session_event.
    For example , this query gives an overview of any wait event in any of the instance:
    select inst_id inst, sid, module, action, osuser, username, logon_time, event, seconds_in_wait sec from gv$session where username is not null and event not like ‘SQL*Net%’ and seconds_in_wait > 10 order by seconds_in_wait;

    4. Create a remote sql session to affected instance using the tnsname created in step 2 from the directory of node 1 where all the scripts are stored in step 1.
    And use the relevant script to get to the root of the issue.

    In my opinion all your scripts are very much relevant for all environments with just a bit tweak here and there.

    —Raj

  9. Richard says:

    Hi Tanel,

    Great script!

    I have a question: any idea why I see (consistently) “.607″ for the “Avg hold ms”?

    All other values in the output columns appear normal.

    Regards – Richard

  10. Tanel Poder says:

    @Richard

    Hi Richard,

    Can you paste the output?

    I calculate the “Avg hold ms” by checking how fast does the GETS increase during the sampling loop. If each iteration of the V$LATCHHOLDER/X$KSUPRLAT scan takes 607 microseconds, then the max precision of the “Avg hold ms” I can report is 607 microsecond. So in reality the Avg hold ms may be even smaller. The only way to have better precision would be to sample faster, but my script is already sampling at the max speed (the bigger the V$PROCESS array is, the longer it takes to sample).

    However, this is not a problem, the main column to look at is how busy a latch is in total (the Held %) and other columns are just extra supporting information.

  11. Richard says:

    An example (avg now .715, in most cases):

    NAME KSLLWNAM KSLLWLBL Held Gets Held % Held ms Avg hold ms
    ———————————– —————————————- ——————– ———- ———- ——- ———– ———–
    cache buffers chains kcbgtcr: fast path 1697 1697 16.97 1213.355 .715
    cache buffers chains kcbgtcr: kslbegin excl 551 551 5.51 393.965 .715
    cache buffers chains kcbrls: kslbegin 160 160 1.60 114.400 .715
    resmgr group change latch kskrefreshattr 144 144 1.44 102.960 .715
    cache buffers chains kcbgtcr: kslbegin shared 86 86 .86 61.490 .715
    cache buffers chains kclcls_3 86 86 .86 61.490 .715
    simulator lru latch kcbs_simulate: simulate set setid 85 85 .85 60.775 .715
    redo copy kcrfw_redo_gen: nowait 71 71 .71 50.765 .715
    cache buffers chains kcbgtcr ptr to buffer header 61 61 .61 43.615 .715
    library cache kglpnp: child child 38 38 .38 27.170 .715
    library cache kglpndl: child: after processing latch 37 37 .37 26.455 .715
    library cache kglpndl: child: before processing latch 32 32 .32 22.880 .715
    cache buffers lru chain kcbzgws 30 30 .30 21.450 .715
    session idle bit ksupuc: set busy session ptr 27 27 .27 19.305 .715
    client/application info kskirefrattrmap 26 26 .26 18.590 .715
    ges resource hash list kjakcai: search for resp by resname 25 2 .25 17.875 8.938
    cache buffers chains kcbgcur: kslbegin 19 19 .19 13.585 .715
    ges resource hash list kjucvl: lock convert request 19 19 .19 13.585 .715
    cache buffers chains kcbget: pin buffer 14 14 .14 10.010 .715
    library cache pin kglpnp: child child 14 14 .14 10.010 .715
    ASM map operation hash table kffmTranslate 14 14 .14 10.010 .715
    gcs resource hash kjblocalfile_nolock 14 14 .14 10.010 .715
    ASM map operation hash table kffmDoDone_1 13 13 .13 9.295 .715
    process allocation ksufap: active procs 13 13 .13 9.295 .715
    gcs resource hash kclzcl 11 11 .11 7.865 .715
    row cache objects kqrpre: find obj 11 11 .11 7.865 .715
    cache buffers chains kcbzib: finish free bufs 10 10 .10 7.150 .715
    library cache lock kgllkdl: child: cleanup latch 9 9 .09 6.435 .715
    session idle bit ksupuc: clear busy session ptr 9 9 .09 6.435 .715
    library cache pin kglpndl child 8 8 .08 5.720 .715
    cache buffers chains kcbnew: new latch again 8 8 .08 5.720 .715
    library cache kglLockCursor 7 7 .07 5.005 .715
    gcs resource hash kjbopen:affinity 7 7 .07 5.005 .715
    gcs resource hash kjbcropen:affinity 7 7 .07 5.005 .715
    library cache lock kgllkal: child: multiinstance latch 7 7 .07 5.005 .715
    enqueue hash chains ksqgtl3 acquiring session 7 7 .07 5.005 .715
    library cache kglobpn: child: latch 6 6 .06 4.290 .715
    library cache kgllkdl: child: no lock handle latch 6 6 .06 4.290 .715
    SQL memory manager workarea list la qesmmIRegisterWorkArea 6 6 .06 4.290 .715
    client/application info kqp327i 6 6 .06 4.290 .715
    shared pool simulator kglsim_unpin_simhp: fast path 6 6 .06 4.290 .715
    enqueues ksqgel: create enqueue parent obj 6 6 .06 4.290 .715
    ges resource hash list kjrref: find matched resource 5 5 .05 3.575 .715
    name-service request queue kjxgnapq: process a request 5 5 .05 3.575 .715
    SQL memory manager latch qesmmIDeamonCb 5 3 .05 3.575 1.192
    cache buffers chains kcbchg: kslbegin: bufs not pinned 4 4 .04 2.860 .715
    KCL gc element parent latch kclnfndnewm 4 4 .04 2.860 .715
    cache buffers chains kcbbxsv 4 4 .04 2.860 .715
    gcs partitioned table hash 27 4 4 .04 2.860 .715
    ges process parent latch kjatioc 4 4 .04 2.860 .715
    row cache objects kqreqd 4 4 .04 2.860 .715
    ges resource hash list kjrmas1: lookup master node 3 3 .03 2.145 .715
    ges process parent latch kjlalc: lock allocation from lmon 3 3 .03 2.145 .715
    session allocation ksufap: active sessions 3 3 .03 2.145 .715

  12. Tanel Poder says:

    @Richard
    It’s 0.715 ms as that’s the max precision latchprof could achieve during that run (it depends on how fast one scan through x$ksuprlat/v$latchholder takes).

    Note that if some latch is held for a lot of time it’s not necessarily a problem. Only troubleshoot latch contention of anyone is actually waiting for that latch!

  13. Pingback: Latch get “where” and “why”. « Latch, mutex and beyond

  14. Pingback: New versions of LatchProf and LatchProfX for latch contention troubleshooting and tuning | Tanel Poder's blog: IT & Mobile for Geeks and Pros

  15. KG says:

    Hi Tanel
    we have plsql code calling java which is stored in database.
    i see the plsql code many times waiting on latches such as library cache pin or ‘latch: row cache objects’
    I tried LatchProf but did not get information related to query or database object in contention
    Also is there any way to find what java code or java stored procedure session is executing.
    When i see ADDM reports or awr reports always java executions shows in top
    but can drill down to exact java stored procedure in question

    Any help regarding this is appreciated

    regards
    KG

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>