In my last AOT post I published my LatchProf script which is able to sample detailed latchholder data from V$LATCHHOLDER.
Latchprof allows you to drill down into your latching problems at session level (which V$LATCH, V$LATCH_PARENT and V$LATCH_CHILDREN can’t do). It allows you to get valuable details about individual sessions who are holding a latch the most, therefore likely contributing to the latch contention problem the most.
However after you have discovered the troublemaking session, then what next? One way forward is looking into V$SESSTAT counters using Snapper tool. Depending on what latch is the problematic one, you would look for different stats like various buffer get stats for cache buffers chains latches and parsing/executing stats when looking into library cache latches. However if those stats look “normal”, is there any other way do drill down further?
Yeah, there is and lets look into it!
A side comment: Nowadays I try to avoid using X$ tables in favor of V$ views (because in my experience in many client environments you don’t always have easy SYS access or X$ table “proxy views” created). Also, if you can get your job done using simple V$ views, then why bother with more complex X$ tables? (yeah, it’s much cooler, I know ;)
Anyway, here is an exception: LatchProfX script, which means Latch holder Profiler eXtended, because it provides some very interesting extended information about latch holders. It shows why the latch currently held was taken after all. It uses X$KSUPRLAT instead of V$LATCHHOLDER as the latter doesn’t externalize all required information. I’m talking about latch Where statistics.
By “Where statistics” I mean that Oracle kernel’s latching module coders have instrumented their code well enough – whenever they get a latch, they mark down the code location ID of the latch getter within the latch structure itself.
Let’s not keep you waiting anymore and show some examples. The output is quite wide so you may need to size your browser window accordingly.
Normally you would run LatchProfX with specific enough parameters to monitor only one or few latches which you know are the troublemakers. However the first example run I do across all instance latches (the third parameter is %) to illustrate the usefulness of LatchProfX result data better.
LatchProfX syntax is similar to the original LatchProf script, with the addition of ksllwnam and ksllwlbl field names you can sample. These are exactly the fields (from X$KSLLW) which give us the descriptions of “where in kernel the latch was aquired from” opcodes we get from X$KSUPRLAT (see the script itself for syntax and internals).
So (finally), lets run it:
SQL> @latchprofx name,ksllwnam,ksllwlbl % % 100000 -- LatchProfX 1.06 by Tanel Poder ( http://www.tanelpoder.com ) NAME KSLLWNAM KSLLWLBL Held Gets Held % Held ms Avg hold ms ----------------------------------- ---------------------------------------- -------------------- ---------- ---------- ------- ----------- ----------- shared pool kghalo 9705 5987 9.71 185.366 .031 ges resource hash list kjuscl: lock close request 2429 2350 2.43 46.394 .020 kks stats kksAllocChildStat 2001 1460 2.00 38.219 .026 ges resource hash list kjrmas1: lookup master node 1780 1724 1.78 33.998 .020 shared pool kghupr1 Chunk Header 1618 891 1.62 30.904 .035 row cache objects kqrpre: find obj 988 986 .99 18.871 .019 shared pool simulator kglsim_unpin_simhp 804 97 .80 15.356 .158 ges process parent latch kjlalc: lock allocation from lmon 779 766 .78 14.879 .019 active service list kswsgetso: get service object 663 658 .66 12.663 .019 shared pool kghalp 635 628 .64 12.129 .019 ges process parent latch kjatioc 518 518 .52 9.894 .019 shared pool simulator kglsim_upd_newhp 402 401 .40 7.678 .019 ges resource hash list kjrref: find matched resource 374 371 .37 7.143 .019 row cache objects kqreqd 360 360 .36 6.876 .019 ges process parent latch kjlrem: detach lock from group 235 229 .24 4.489 .020 enqueues ksqdel 162 162 .16 3.094 .019 enqueues ksqgel: create enqueue parent obj 148 148 .15 2.827 .019 row cache objects kqreqd: reget 108 105 .11 2.063 .020 enqueue hash chains ksqrcl resource addr 101 101 .10 1.929 .019 shared pool kgh_heap_sizes 99 99 .10 1.891 .019 enqueue hash chains ksqgtl3 acquiring session 93 93 .09 1.776 .019 ges enqueue table freelist kjlalc: lock allocation 85 83 .09 1.624 .020 shared pool kghfre Chunk Header 74 65 .07 1.413 .022 shared pool simulator kglsim_scan_lru: scan 69 69 .07 1.318 .019 ges process parent latch kjucll: closing lock 68 61 .07 1.299 .021 shared pool simulator kglsim_chg_simhp_free 48 47 .05 .917 .020 ges domain table kjdmlad: add a lock to an xid-hashed loc 42 41 .04 .802 .020 shared pool simulator kglsim_chg_simhp_noob 41 41 .04 .783 .019 shared pool simulator kglsim_unpin_simhp: fast path 39 39 .04 .745 .019 ksuosstats global area ksugetosstat 29 1 .03 .554 .554 ges group table kjgdgll: move a lock from group lock lis 13 6 .01 .248 .041 kokc descriptor allocation latch kokcdlt: regular free latch 11 4 .01 .210 .053 ges enqueue table freelist kjlfr: remove lock from parent object 11 11 .01 .210 .019 kokc descriptor allocation latch kokcdlt: allocation latch 5 4 .01 .096 .024 cache buffers chains kcbgtcr: fast path 3 3 .00 .057 .019 ges caches resource lists kjrchc: cached obj cleanup 2 2 .00 .038 .019 process allocation ksufap: active procs 2 2 .00 .038 .019 ges group table kjgagll: move a lock from process lock l 2 2 .00 .038 .019 ges domain table kjdmlrm: remove a lock from its xid-hash 1 1 .00 .019 .019 ges caches resource lists kjchc: clean resource cache 1 1 .00 .019 .019 ges resource hash list kjruch: move resource from cache to free 1 1 .00 .019 .019 name-service memory objects kjxgsnep: get a value-block object 1 1 .00 .019 .019 active service list kswsite: service iterator 1 1 .00 .019 .019 43 rows selected. Elapsed: 00:00:02.06
What do we see? While LatchProf would just have provided latch holder info about which (child) latch was held and by which SID, then LatchProfX will show you additional why this latch was gotten.
From above example you see that there were various reasons why the shared pool latch was gotten, but the reason which caused the SP latch to be held the most time, was a code location kghalo, which means something like Kernel Generic Heap Allocate. Which essentially is the function used fol searching and allocating free space from shared pool. This kind of makes sense as stuff like freeing a chunk and updating heap global stats are simple operations as Oracle already knows the memory location it needs to update, but searching for free chunks for allocation can require scanning through a lot of chunks if there are no suitable ones in freelists and a LRU scan is needed.
Next, there’s a kghupr1 location which has caused the shared pool latches held for 1.62% of total sampling time. This location, in function kghupr (un-pin recreatable) needs to take shared pool latch for putting the unpinned chunk back to heap LRU list. An Oracle kernel coder has also specified a label (action description) as “Chunk Header” which is seen from KSLLWLBL column. Its in place to state that this kernel location gets shared pool latch as it needs to update a chunk header – putting a chunk back to LRU list means updating the chunk header (and its neighbor(s) headers).
There are also kgh_heap_sizes and kghfre locations, which update global heap statistics (the stuff you see from v$sgastat for example) and free a chunk in heap respectively.
Wherever you have estabilished which latch is the troublemaker (under too much contention) and you know the “why” reason for it, you probably also want to know the sessions contributing to the problem. And LatchProfX can show that:
SQL> @latchprofx sid,name,ksllwnam,ksllwlbl % "shared pool" 100000
-- LatchProfX 1.06 by Tanel Poder ( http://www.tanelpoder.com )
SID NAME KSLLWNAM KSLLWLBL Held Gets Held % Held ms Avg hold ms
---------- ----------------------------------- ---------------------------------------- -------------------- ---------- ---------- ------- ----------- -----------
113 shared pool kghalo 5038 2787 5.04 92.699 .033
112 shared pool kghalo 4723 2820 4.72 86.903 .031
113 shared pool kghupr1 Chunk Header 875 369 .88 16.100 .044
113 shared pool simulator kglsim_unpin_simhp 728 48 .73 13.395 .279
112 shared pool kghupr1 Chunk Header 623 438 .62 11.463 .026
112 shared pool simulator kglsim_unpin_simhp 273 37 .27 5.023 .136
112 shared pool kghalp 232 232 .23 4.269 .018
113 shared pool simulator kglsim_upd_newhp 230 230 .23 4.232 .018
113 shared pool kghalp 224 223 .22 4.122 .018
112 shared pool simulator kglsim_upd_newhp 194 194 .19 3.570 .018
112 shared pool kgh_heap_sizes 49 49 .05 .902 .018
113 shared pool kgh_heap_sizes 49 49 .05 .902 .018
112 shared pool kghfre Chunk Header 37 37 .04 .681 .018
113 shared pool simulator kglsim_scan_lru: scan 32 32 .03 .589 .018
112 shared pool simulator kglsim_scan_lru: scan 28 27 .03 .515 .019
112 shared pool simulator kglsim_chg_simhp_free 27 27 .03 .497 .018
113 shared pool simulator kglsim_chg_simhp_free 26 25 .03 .478 .019
113 shared pool kghfre Chunk Header 24 23 .02 .442 .019
112 shared pool simulator kglsim_unpin_simhp: fast path 23 23 .02 .423 .018
113 shared pool simulator kglsim_chg_simhp_noob 22 22 .02 .405 .018
112 shared pool simulator kglsim_chg_simhp_noob 18 18 .02 .331 .018
113 shared pool simulator kglsim_unpin_simhp: fast path 18 18 .02 .331 .018
112 shared pool sim alloc kglsim_chk_objlist: alloc 1 1 .00 .018 .018
122 shared pool kghfre Chunk Header 1 1 .00 .018 .018
24 rows selected.
Elapsed: 00:00:01.93
SQL>
From above you see are 2 main sessions contributing to shared pool latch contention and both are doing lots of kghalo calls, thus allocating memory. When V$SESSTAT for those sessions shows high “parse count (hard)” statistic it would be obvious that the heavy shared pool activity is due hard parsing (which is kind of common knowledge as excessive hard parsing very often the reason for shared pool latch contention). However check out another sample with same parameters:
SQL> @latchprofx sid,name,ksllwnam,ksllwlbl % "shared pool" 100000
-- LatchProfX 1.06 by Tanel Poder ( http://www.tanelpoder.com )
SID NAME KSLLWNAM KSLLWLBL Held Gets Held % Held ms Avg hold ms
---------- ----------------------------------- ---------------------------------------- -------------------- ---------- ---------- ------- ----------- -----------
0 shared pool kghalo 10714 5 10.71 212.137 42.427
122 shared pool kghalo 7685 42 7.69 152.163 3.623
112 shared pool kghalo 3874 2305 3.87 76.705 .033
113 shared pool kghalo 3616 2318 3.62 71.597 .031
112 shared pool kghupr1 Chunk Header 659 327 .66 13.048 .040
113 shared pool kghupr1 Chunk Header 597 353 .60 11.821 .033
112 shared pool simulator kglsim_unpin_simhp 416 29 .42 8.237 .284
[...snip...]
122 shared pool simulator kglsim_upd_newhp 2 2 .00 .040 .020
117 shared pool kghupr1 Chunk Header 2 2 .00 .040 .020
122 shared pool kghalp 2 2 .00 .040 .020
117 shared pool kghalp 1 1 .00 .020 .020
38 rows selected.
Elapsed: 00:00:02.14
SQL>
What’s this? Why is there a session with SID=0 which has called kghalo function and held the latch for much longer on average than any other sessions? LatchProfX has detected only 5 latch gets by that session from kghalo location which still have taken the most time in total. And average hold time for that latch is estimated over 42 milliseconds compared to most others which are in microseconds (even though the 3.6 milliseconds average hold time for a latch is quite long too).
One of the reason for such long hold times could be CPU preemption – a latch holder process was switched off the CPU and it took a while before it got back on it to complete its work and release the latch again. However on my 4 CPU test machine the was plenty of CPU time available so such preemption should have not been the problem. The SID 0 gives a clue. When a new session is started in Oracle, apparently it’s reported as SID 0 during the session initialization sequence in some places, which includes the memory allocation for things like the v$sesstat and v$session_event arrays and also the session parameter values and set events in some cases. Considering that total size of these structures is quite large in normal shared pool usage context (tens of kB), then if freelists are empty and LRU scan must happen, this inevitably takes longer time – and shared pool latch is held all that time. (phew, I must write shorter statements).
Anyway, I recommend playing around with this script in a test environment, run different workloads like hard parses, soft parses, physical IOs and buffer gets and you should get quite a good overview of some Oracle internal workings that way. This hopefully also answers how do I know some of the Oracle function names and what’s their purpose – the answer is that Oracle itself exposes quite a lot of this information in the X$ tables I’m using in my script. You just need to experiment, observe and try to figure out what are the real names behind function names like kghalo etc. Fun :)
Few more notes:
If you want to drill down to latch child level, then use LADDR as another field (among SID,NAME etc) in first parameter to LatchProfX (I’ve described this in AOT guide part 7).
Do not use LatchProf, LatchProfX and WaitProf when your server’s CPUs are 100% utilized, for two reasons:
- All those scripts never sleep during their sampling but spin on CPU – thus they would make the CPU starvation even worse
- More importantly, if all your CPUs are running at 100% utilization, the latch contention you see is probably just a symptom of the CPU starvation itself! The same goes to wait events as well – when your CPU runqueues are long, some of the CPU runqueue waiting time can be attributed to wait events such IO instead (for example when IO completes, but the process can’t get onto CPU to “end” the IO wait
I have updated both LatchProf and LatchProfX do detect whether they’re running on Oracle 9i or 10g+ and dynamically run different code based on version (on 9i, there is no GETS column in V$LATCHHOLDER).




Hello Tanel,
a colleague pointed me to your blog – and it’s an impressing deep look into the oracle kernel; All these things where i used to close my eyes due to it’s complexity.
will read your blog in future;
Thank you and kind regards
Karl
Are these scripts compatible for Oracle RAC??
Yes. Any script which works in single instance will also work in RAC as RAC is just a bunch of instances…
But if you ask whether these scripts gather info from all RAC nodes, then no. It’s not needed either as latches aren’t global, they’re instance-local memory structures only…
Tanel
Even though your scripts are meant for single instance , it can be very well used for RAC instances as well as for parallel queries.
In fact, I regularly use your scripts for troubleshooting. These are what I do.
1. stored all the relevant scripts like snapper , latchprof, waitprof etc.. in a directory of one node [say node 1] of the cluster.
2. created tnsnames entry for all the other nodes from the node [node 1] where all the scripts are stored.
3. Log into the instances in node 1,find the issue in any node by using any of the views like gv$session or gv$session_wait or gv$session_event.
For example , this query gives an overview of any wait event in any of the instance:
select inst_id inst, sid, module, action, osuser, username, logon_time, event, seconds_in_wait sec from gv$session where username is not null and event not like ‘SQL*Net%’ and seconds_in_wait > 10 order by seconds_in_wait;
4. Create a remote sql session to affected instance using the tnsname created in step 2 from the directory of node 1 where all the scripts are stored in step 1.
And use the relevant script to get to the root of the issue.
In my opinion all your scripts are very much relevant for all environments with just a bit tweak here and there.
—Raj
Hi Tanel,
Great script!
I have a question: any idea why I see (consistently) “.607″ for the “Avg hold ms”?
All other values in the output columns appear normal.
Regards – Richard
@Richard
Hi Richard,
Can you paste the output?
I calculate the “Avg hold ms” by checking how fast does the GETS increase during the sampling loop. If each iteration of the V$LATCHHOLDER/X$KSUPRLAT scan takes 607 microseconds, then the max precision of the “Avg hold ms” I can report is 607 microsecond. So in reality the Avg hold ms may be even smaller. The only way to have better precision would be to sample faster, but my script is already sampling at the max speed (the bigger the V$PROCESS array is, the longer it takes to sample).
However, this is not a problem, the main column to look at is how busy a latch is in total (the Held %) and other columns are just extra supporting information.
An example (avg now .715, in most cases):
NAME KSLLWNAM KSLLWLBL Held Gets Held % Held ms Avg hold ms
———————————– —————————————- ——————– ———- ———- ——- ———– ———–
cache buffers chains kcbgtcr: fast path 1697 1697 16.97 1213.355 .715
cache buffers chains kcbgtcr: kslbegin excl 551 551 5.51 393.965 .715
cache buffers chains kcbrls: kslbegin 160 160 1.60 114.400 .715
resmgr group change latch kskrefreshattr 144 144 1.44 102.960 .715
cache buffers chains kcbgtcr: kslbegin shared 86 86 .86 61.490 .715
cache buffers chains kclcls_3 86 86 .86 61.490 .715
simulator lru latch kcbs_simulate: simulate set setid 85 85 .85 60.775 .715
redo copy kcrfw_redo_gen: nowait 71 71 .71 50.765 .715
cache buffers chains kcbgtcr ptr to buffer header 61 61 .61 43.615 .715
library cache kglpnp: child child 38 38 .38 27.170 .715
library cache kglpndl: child: after processing latch 37 37 .37 26.455 .715
library cache kglpndl: child: before processing latch 32 32 .32 22.880 .715
cache buffers lru chain kcbzgws 30 30 .30 21.450 .715
session idle bit ksupuc: set busy session ptr 27 27 .27 19.305 .715
client/application info kskirefrattrmap 26 26 .26 18.590 .715
ges resource hash list kjakcai: search for resp by resname 25 2 .25 17.875 8.938
cache buffers chains kcbgcur: kslbegin 19 19 .19 13.585 .715
ges resource hash list kjucvl: lock convert request 19 19 .19 13.585 .715
cache buffers chains kcbget: pin buffer 14 14 .14 10.010 .715
library cache pin kglpnp: child child 14 14 .14 10.010 .715
ASM map operation hash table kffmTranslate 14 14 .14 10.010 .715
gcs resource hash kjblocalfile_nolock 14 14 .14 10.010 .715
ASM map operation hash table kffmDoDone_1 13 13 .13 9.295 .715
process allocation ksufap: active procs 13 13 .13 9.295 .715
gcs resource hash kclzcl 11 11 .11 7.865 .715
row cache objects kqrpre: find obj 11 11 .11 7.865 .715
cache buffers chains kcbzib: finish free bufs 10 10 .10 7.150 .715
library cache lock kgllkdl: child: cleanup latch 9 9 .09 6.435 .715
session idle bit ksupuc: clear busy session ptr 9 9 .09 6.435 .715
library cache pin kglpndl child 8 8 .08 5.720 .715
cache buffers chains kcbnew: new latch again 8 8 .08 5.720 .715
library cache kglLockCursor 7 7 .07 5.005 .715
gcs resource hash kjbopen:affinity 7 7 .07 5.005 .715
gcs resource hash kjbcropen:affinity 7 7 .07 5.005 .715
library cache lock kgllkal: child: multiinstance latch 7 7 .07 5.005 .715
enqueue hash chains ksqgtl3 acquiring session 7 7 .07 5.005 .715
library cache kglobpn: child: latch 6 6 .06 4.290 .715
library cache kgllkdl: child: no lock handle latch 6 6 .06 4.290 .715
SQL memory manager workarea list la qesmmIRegisterWorkArea 6 6 .06 4.290 .715
client/application info kqp327i 6 6 .06 4.290 .715
shared pool simulator kglsim_unpin_simhp: fast path 6 6 .06 4.290 .715
enqueues ksqgel: create enqueue parent obj 6 6 .06 4.290 .715
ges resource hash list kjrref: find matched resource 5 5 .05 3.575 .715
name-service request queue kjxgnapq: process a request 5 5 .05 3.575 .715
SQL memory manager latch qesmmIDeamonCb 5 3 .05 3.575 1.192
cache buffers chains kcbchg: kslbegin: bufs not pinned 4 4 .04 2.860 .715
KCL gc element parent latch kclnfndnewm 4 4 .04 2.860 .715
cache buffers chains kcbbxsv 4 4 .04 2.860 .715
gcs partitioned table hash 27 4 4 .04 2.860 .715
ges process parent latch kjatioc 4 4 .04 2.860 .715
row cache objects kqreqd 4 4 .04 2.860 .715
ges resource hash list kjrmas1: lookup master node 3 3 .03 2.145 .715
ges process parent latch kjlalc: lock allocation from lmon 3 3 .03 2.145 .715
session allocation ksufap: active sessions 3 3 .03 2.145 .715
@Richard
It’s 0.715 ms as that’s the max precision latchprof could achieve during that run (it depends on how fast one scan through x$ksuprlat/v$latchholder takes).
Note that if some latch is held for a lot of time it’s not necessarily a problem. Only troubleshoot latch contention of anyone is actually waiting for that latch!
Hi Tanel
we have plsql code calling java which is stored in database.
i see the plsql code many times waiting on latches such as library cache pin or ‘latch: row cache objects’
I tried LatchProf but did not get information related to query or database object in contention
Also is there any way to find what java code or java stored procedure session is executing.
When i see ADDM reports or awr reports always java executions shows in top
but can drill down to exact java stored procedure in question
Any help regarding this is appreciated
regards
KG