Advanced Oracle Troubleshooting Guide, Part 7: Sampling latch holder statistics using LatchProf

I have been too busy since getting back from vacation, thus no posts for a while. But I hope the waiting was worthwhile as I present you LatchProf, a tool for digging in to latch contention problems – using plain SQL and sqlplus!

As, I’m still busy, I make it short.

LatchProf is a script similar to WaitProf, only it samples latch holder statistics from V$LATCHHOLDER. As V$LATCHHOLDER contains a SID column (with session ID of a latch holder) it becomes possible to find who is hitting a latch the most (a way to prove that crappy monitoring tools which constantly scan through V$SQL DO cause library cache latch contention themselves).

Some day I will write longer about this (and check the script itself), but here’s the syntax:

@latchprof <what> <sid|%> <latch|%|childaddr> <#samples>

Here are the parameter definitions:

1) Columns to select actually map to V$LATCHHOLDER columns. So you can use SID,NAME,LADDR there for example.
2) SID is either a sid as number or % to see latches held by all SIDs
3) latch_name can be either % for all latches, a LIKE string, like %library% to see all latches with library in their name or the actual latch address! This allows us to monitor only specific child latches we want.
4) Number of samples to take. I normally use 100 000 samples as starting point.

As latchprof (and waitprof for that matter) never sleep, they burn as much CPU resource as they can. This is good for sampling accuracy (well, at least on multiprocessor systems with enough spare CPU capacity), but will consume one of your CPUs. So, you don’t want to run this for too long at a time ( you could implement a short wait in there with a nice trick by Jonathan Lewis ).

100k samples is usually done in around a second on my test kit (giving 100+ kHz sampling rate :) This will vary by your hardware and Oracle version and what’s your processes parameter (as V$LATCHHOLDER iterates through process state objects – the more processes you’ve got, the more time it takes).

Anyway, let say we have a performance issue and using wait interface we have determined that for a particular session, waits for cache buffers chains latches are unreasonably high.

Now, one way forward would be to identify the database blocks involved in this problem, by looking what blocks are protected by the latch (using X$BH.HLADDR = V$LATCH_CHILDREN.ADDR mapping) and trying to figure out the hottest block by X$BH.TCH touchcount. But where would that lead us? We would find out into which segment that block belongs, but it doesn’t really help us to understand who’s the troublemaker causing this issue.

Therefore we use another approach – sampling V$LATCHHOLDER to see which sessions hold the latches the most:

Just for demo purposes I start with listing general latchholder stats (not broken down by holding SID):

SQL> @latchprof name % % 100000

NAME                                 Held       Gets  Held %     Held ms Avg hold ms
------------------------------ ---------- ---------- ------- ----------- -----------
cache buffers chains                21248      17418   21.25     267.725        .015
simulator lru latch                  7192       7169    7.19      90.619        .013
simulator hash latch                   66         66     .07        .832        .013
enqueue hash chains                     4          4     .00        .050        .013
ges resource hash list                  3          3     .00        .038        .013
process allocation                      2          2     .00        .025        .013
library cache                           2          2     .00        .025        .013
name-service namespace bucket           2          2     .00        .025        .013
name-service memory objects             1          1     .00        .013        .013
ges caches resource lists               1          1     .00        .013        .013
library cache pin                       1          1     .00        .013        .013

11 rows selected.

Yes it looks that cache buffers chains latches are held the most. The Held column here means that during how many samples a latch was held. The Gets column shows roughly how many times that latch was gotten during sampling time.
As we took 100000 samples and during 21248 of them a cache buffers chains latch was held, we can conclude that this latch was busy (Held) 21.25% of the time. And as the sampling lasted about 1.3 seconds that time, we can extrapolate that of the 1.3 seconds, some cache buffers chains latches were in Held state for rougly 267 milliseconds. And as we know the number of gets, then we can compute the average latch hold time in milliseconds. This could help us detect very long latch holds like what can happen when someone is scanning through X$KSMSP or the server is CPU starved. The current estimate of 15 microseconds per latch hold is not that bad (note that with very few samples like 10000 or less, the avg hold time estimates become unreliable).

Anyway, let’s break down the individual latch holding sessions now:

SQL> @latchprof sid,name % % 100000

       SID NAME                                 Held       Gets  Held %     Held ms Avg hold ms
---------- ------------------------------ ---------- ---------- ------- ----------- -----------
        81 cache buffers chains                14916      14915   14.92     187.942        .013
        81 simulator lru latch                  6797       6745    6.80      85.642        .013
        85 cache buffers chains                 6195       2284    6.20      78.057        .034
        85 simulator lru latch                   524        506     .52       6.602        .013
        81 simulator hash latch                  121         85     .12       1.525        .018
       112 name-service namespace bucket           2          2     .00        .025        .013
       100 enqueue hash chains                     2          2     .00        .025        .013
        98 library cache                           2          2     .00        .025        .013
       112 process allocation                      2          2     .00        .025        .013
       112 ges caches resource lists               1          1     .00        .013        .013
       112 enqueues                                1          1     .00        .013        .013
       104 messages                                1          1     .00        .013        .013
       100 cache buffers chains                    1          1     .00        .013        .013
       100 active service list                     1          1     .00        .013        .013
       100 enqueues                                1          1     .00        .013        .013
        85 simulator hash latch                    1          1     .00        .013        .013
       100 library cache                           1          1     .00        .013        .013

17 rows selected.

We see that sessions 81 and 85 are hitting CBC latches the most. Let say (for demo purposes again) that the SID 85 was the one which experienced bad performance and led us to investigate this issue.

As there are tens of thousands (or even more) of CBC latches in a decent Oracle database, then of course looking only at aggregated latch name level stats is not enough. Do the sessions compete for a single child latch? Or do they just use lots of different CBC latches over time and not collide at all? This is the question where LatchProf can help.

Let’s include latch address (child latch address) into picture and query only “cache” related latches. This returns lots of rows, so I normally press CTRL+C after first pageful of results as they are most important (the results are ordered by most busy ones first):

SQL> @latchprof sid,name,laddr % cache 100000

       SID NAME                           LADDR          Held       Gets  Held %     Held ms Avg hold ms
---------- ------------------------------ -------- ---------- ---------- ------- ----------- -----------
        81 cache buffers chains           41AACEEC      15061      15017   15.06     186.756        .012
        85 cache buffers chains           41B2C37C         34         34     .03        .422        .012
        85 cache buffers chains           41B85D64         31         31     .03        .384        .012
        85 cache buffers chains           41BCF7FC         30         30     .03        .372        .012
        85 cache buffers chains           41B415EC         28         28     .03        .347        .012
        85 cache buffers chains           41BCA658         25         25     .03        .310        .012
        85 cache buffers chains           41B4E738         25         25     .03        .310        .012
        85 cache buffers chains           41AE4694         25         25     .03        .310        .012
        85 cache buffers chains           41AB02E0         25          9     .03        .310        .034
        85 cache buffers chains           427FA1F8         24         24     .02        .298        .012
        85 cache buffers chains           427F6E04         22         22     .02        .273        .012
        85 cache buffers chains           41B78D94         22         22     .02        .273        .012
        85 cache buffers chains           41BA2220         21         21     .02        .260        .012
        85 cache buffers chains           41BEDA68         21         21     .02        .260        .012
        85 cache buffers chains           41BC8D1C         21         21     .02        .260        .012
        85 cache buffers chains           41B486BC         20         20     .02        .248        .012
        85 cache buffers chains           42B99698         19         19     .02        .236        .012
        85 cache buffers chains           41AE6EA8         19         14     .02        .236        .017
        85 cache buffers chains           41B1F230         19         19     .02        .236        .012
        85 cache buffers chains           41B09E44         19         17     .02        .236        .014
        85 cache buffers chains           41AE7A88         19         19     .02        .236        .012
        85 cache buffers chains           41B69E98         18         18     .02        .223        .012
[...snip...]

And voila, we see SID 81 is hitting one single child latch in maniac fashion while SID 85 is behaving more normally, just hitting lots of different CBC latches (you see the number of gets of a single child during 100k sampling doesn’t exceed 34). Interesting.
So now that we know the child latch address (LADDR), we can refine our query to only monitor that child latch to see if any other sessions could be affected:

SQL> @latchprof sid,name,laddr % 41AACEEC 100000

       SID NAME                           LADDR          Held       Gets  Held %     Held ms Avg hold ms
---------- ------------------------------ -------- ---------- ---------- ------- ----------- -----------
        81 cache buffers chains           41AACEEC      14058      14047   14.06     177.131        .013
        85 cache buffers chains           41AACEEC         18         18     .02        .227        .013

As of now it looks like noone else except the maniac SID 81 and victim SID 85 contend for the latch (otherwise one would expect to see at least few successful holds against that latch).

So, as SID 81 is so evidently the troublemaker here, let’s check what it’s doing:

SQL> select sql_text from v$sql where hash_value = (select sql_hash_value from v$session where sid = 81);

SQL_TEXT
---------------------------------------------------------------------------------------------------------
select count(*) X from kill_cpu connect by n > prior n start with n = 1

1 row selected.

This is Jonathan Lewis’es kill_cpu script in action, which hammered the same datablock in a tight loop using connect by ( you need to set _old_connect_by_enabled=true on 9i+ for this test to work :)

I will blog about some more use cases and an extended version of LatchProf script in (hopefully) near future.

Note once more that this script only detects latch holders. A latch held the most doesn’t necessarily mean it’s the latch waited for the most. You still need to start your diagnosis from wait interface, using Snapper or WaitProf for example. Only if these tools show significant latch waits, you should drill down into latches further. And as Anjo Kolk recently (very well) said, latch contention is a symptom, not the problem. The root cause is somewhere else (in the application likely), once you fix that, the latch contention will go away itself.

Update: I have published a more detailed latch profiling tool (latchprofX) which uses X$ tables for getting the extended data and can give even more detailed info about the latch holders. Read this article plus links in it:

http://blog.tanelpoder.com/2009/08/27/latch-cache-buffers-chains-latch-contention-a-better-way-for-finding-the-hot-block/

Note that this year’s only Advanced Oracle Troubleshooting class takes place in the end of April/May 2014, so sign up now if you plan to attend this year!

This entry was posted in Cool stuff, Oracle and tagged , , . Bookmark the permalink.

15 Responses to Advanced Oracle Troubleshooting Guide, Part 7: Sampling latch holder statistics using LatchProf

  1. Pingback: Advanced Oracle Troubleshooting Guide, Part 8: Even more detailed latch troubleshooting using LatchProfX « Tanel Poder’s blog: Core IT for geeks and pros

  2. Jeet says:

    Hi Tanel,

    Your blog is just great. It has so much of unexplored content of oracle. I appreciate your idea of sharing this info with us.

    Sorry to ask a basic question about WaitProf and LatchProf script.

    Can you please explain me why do you need the sample. What is the logic of the sample which you have used in both these scripts.

    Thanks,
    Jeet

  3. Ricardo says:

    Hi Tanel,

    wonderful explanation. I have a doubt about you script latchprof. What´s the importance of the no_merge hint?. I have tested the execution with and without the no_merge hint and I am not able to get different results. Is this a truly mergeable view?

    Thank you Tanel!

  4. tanelp says:

    Hi Ricardo,

    I didn’t really think about it. I put the hint in “just in case” to get this potential issue out of my mind. If this view is not mergeable now, who knows maybe in 12g it is :)

    Tanel.

  5. Ricardo says:

    You are right ;)

  6. a reader says:

    Hi Tanel,

    a simple comment. rather than “press CTRL+C after first pageful of results”,
    I preferred to get the top 10 only as where rownum < 11

    thks&rgds.

  7. Pingback: latch: cache buffers chains latch contention – a better way for finding the hot block | Tanel Poder's blog: Core IT for Geeks and Pros

  8. Pingback: New versions of LatchProf and LatchProfX for latch contention troubleshooting and tuning | Tanel Poder's blog: Core IT for Geeks and Pros

  9. aliyar says:

    hi

    i used your latchprofile script during the performance problem in our production datawarehouse database .. tried to find any culprit session casuing more latch activity .. however script shows everytime different sessions …

    so that shows no session hold latch long time and no problem with latch part .. is that correct ?

    please correct me if i am wrong

    thanks for your script .. really helpfull ..

    Thanks
    Aliyar

  10. Tanel Poder says:

    @aliyar

    if you don’t see a single session causing the trouble (which is usually the case) then you should report/group by something else, like SQL_ID… Did you start for the wait interface to determine which exact child latch (P1 in v$session) had the contention etc?

  11. aliyar says:

    Thx Tanel ….

    will check

  12. Shamim says:

    Hello,

    Can you please send me your scripts?

    Thanks and regards
    -Shamim

  13. anil akduygu says:

    I can not reach latchprof web site. Please make it work.

  14. Pingback: Log Buffer #105, A Carnival of the Vanities for DBAs

  15. gerardo muñoz says:

    It is an excellent tool. It helped to me in order to get which one was the problem which a latch situation.
    Thanks Tanel

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>