Another LatchProfX use case

Riyaj Shamsudeen wrote an excellent article about systematic latch contention troubleshooting.

Especially if the latch contention problem is ongoing, looking into system wide stats (like v$latch.sleep columns) is not the best idea in busy systems. This may sometimes lead you to fixing the wrong problem.

This is because sometimes the latch contention is not caused by some system wide inefficiency but rather by one or few sessions.

The right approach would be to measure the following things:

  1. Which latch (and exact child latches) the problem session is waiting for (query v$session_wait, ASH, sql_trace)
  2. Why is the problem session trying to get that latch so often (query v$sesstat counters for problem session and check execution plan if only single/few statements executed)
  3. Why this latch is busy: who’s holding it (query v$latchholder, LatchProf)
  4. Why is that someone holding the latch so much (query v$sesstat counters for that session or run LatchProfX)

Riyaj used this approach and successfully found out the troublemaker causing heavy library cache latch contention. He used my LatchProfX tool for part of the diagnosis (and I’m very happy to see that my advanced oracle troubleshooting tools find real world use in hands of other troubleshooters/tuners too!).

Here’s an excerpt from latchprofx output (from a little test case I put together), showing which session is holding which library cache latch how much, hold mode (shared/exclusive) and also the actual reason (function name) why the latch is held:

SQL> @latchprofx sid,name,laddr,ksllwnam,ksllwlbl,hmode 159 % 100000

-- LatchProfX 1.08 by Tanel Poder ( http://www.tanelpoder.com )

 SID NAME           LADDR            KSLLWNAM  KSLLWLBL  HMODE        Held  Gets  Held %  Held ms Avg hold ms
---- -------------- ---------------- --------- --------- ---------- ------ ----- ------- -------- -----------
 159 library cache  00000003A93513E0 kglic     child     exclusive   10044  9314   10.04  301.320        .032
 159 library cache  00000003A9351340 kglic     child     exclusive    9895  8458    9.90  296.850        .035
 159 library cache  00000003A9351660 kglic     child     exclusive    9761  8440    9.76  292.830        .035
 159 library cache  00000003A9351700 kglic     child     exclusive    9737  8924    9.74  292.110        .033
 159 library cache  00000003A9351480 kglic     child     exclusive    8999  7765    9.00  269.970        .035
 159 library cache  00000003A93515C0 kglic     child     exclusive    8553  7832    8.55  256.590        .033
 159 library cache  00000003A9351520 kglic     child     exclusive    6852  6828    6.85  205.560        .030

7 rows selected.

kglic means Kernel Generic Library cache Iterate Chain (AFAIK), it’s the function which is executed when you access most X$KGL tables. And this is where Riyaj got the evidence that the latch contention comes from inefficient scanning of library cache, caused by a session running queries against V$SQL_PLAN views with a bad plan.

So, check out Riyaj’s articele, my LatchProfX script, then my Advanced Oracle Troubleshooting Seminar :)

Note that this year’s only Advanced Oracle Troubleshooting class takes place in the end of April/May 2014, so sign up now if you plan to attend this year!

This entry was posted in Oracle and tagged , , , . Bookmark the permalink.

12 Responses to Another LatchProfX use case

  1. Muthu says:

    Hi Tanel,

    A great blog indeed. I marked it for regular reading.

    Sorry, if this looks pretty long. I’ve a performance issue. We’re an e-Business environment 11.5.9, running on Oracle 9.2.0.7.

    Today, I came across with 4 threads of a concurrent request running for over 15 hours. I checked them up and found that they’ve been waiting on “cache buffer chains” for a very long time.

    Here’s the query that’s being executed by all the 4 sessions :

    ****

    SELECT COUNT(*)
    FROM
    APPS.CSI_ITEM_INSTANCES CSI ,APPS.CSI_IEA_VALUES CIV WHERE CSI.ATTRIBUTE5=
    TO_CHAR(:B1 ) AND CSI.INSTANCE_ID=CIV.INSTANCE_ID AND CIV.ATTRIBUTE_ID=
    10023 AND CIV.ATTRIBUTE_VALUE=’ADVO Printed’

    ****

    tkprof output :

    ********************************************************************************

    SELECT COUNT(*)
    FROM
    APPS.CSI_ITEM_INSTANCES CSI ,APPS.CSI_IEA_VALUES CIV WHERE CSI.ATTRIBUTE5=
    TO_CHAR(:B1 ) AND CSI.INSTANCE_ID=CIV.INSTANCE_ID AND CIV.ATTRIBUTE_ID=
    10023 AND CIV.ATTRIBUTE_VALUE=’ADVO Printed’

    call count cpu elapsed disk query current rows
    ——- —— ——– ———- ———- ———- ———- ———-
    Parse 1 0.00 0.00 0 0 0 0
    Execute 4930 0.40 0.33 0 0 0 0
    Fetch 4929 51338.85 55326.72 491641 1914573414 0 4929
    ——- —— ——– ———- ———- ———- ———- ———-
    total 9860 51339.25 55327.05 491641 1914573414 0 4929

    Misses in library cache during parse: 0
    Optimizer goal: CHOOSE
    Parsing user id: 173 (APPS) (recursive depth: 1)

    Rows Execution Plan
    ——- —————————————————
    0 SELECT STATEMENT GOAL: CHOOSE
    0 SORT (AGGREGATE)
    0 NESTED LOOPS
    0 TABLE ACCESS GOAL: ANALYZED (BY INDEX ROWID) OF
    ‘CSI_IEA_VALUES’
    0 INDEX GOAL: ANALYZED (RANGE SCAN) OF
    ‘ADVOCSI_IEA_VALUES_N1′ (NON-UNIQUE)
    0 TABLE ACCESS GOAL: ANALYZED (BY INDEX ROWID) OF
    ‘CSI_ITEM_INSTANCES’
    0 INDEX GOAL: ANALYZED (UNIQUE SCAN) OF
    ‘CSI_ITEM_INSTANCES_U01′ (UNIQUE)

    ********************************************************************************

    Table stats :

    Table : CSI_IEA_VALUES
    # Rows : 278,157,192
    Sample size : 25873594

    Table : CSI_ITEM_INSTANCES
    # Rows : 6,193,034
    Sample size : 579319

    Index stats :

    There’re seperate indexes on each of the columns in the where clause of the 2 tables.

    Querying from x$bh & dba_extents from the latch address, I found that the same set of extents/blocks show up to be the hot blocks (with TCH between 20 & 40)

    Since this is production, I’m not sure if there’s any way to resolve this on the fly, other than asking the users to cancel the requests and re-submit them.

    Can you help me on this plz ?

    Thanks
    Muthu

  2. Tanel Poder says:

    Hi Muthu,

    How did you determine there were long waits for that latch? Did you look into v$session_wait.event perhaps *without* realizing that the state column was not waiting? Unfortunately Oracle doesn’t clear up the last wait event name when going back onto CPU, so that often misleads people (including me, that’s why I use a case statement to remove the wait event name when not waiting anymore in my sw.sql script)

    Your tkprof script shows CPU time as big as elapsed time so there should be no significant waits going on (did you enable sql_trace with waits=>true – in that case tkprof should also report total time spent waiting for different events)

  3. SJois says:

    Hi Tanel,

    Today I had this issue of latch_free waits for one of the session. I have a database system dump using oradebug dump systemstats. I believe there is a way to locate the latch blocking session from looking at dump file. Can you please guide me how to do this?
    I appreciate your help.

    Thanks

  4. Tanel Poder says:

    You mean systemstate dump, right?

    How did you conclude that you had latching problem for that session?

    From systemstate dump you can search for “latch:” or “latch free” (depending on db version and latch type) to find the waiters and latch address and then you can look the latch holder up by that latch address.

    But before going there, tell me how exactly (what view, what columns) you concluded that there was a latch free problem in the system?

  5. SJois says:

    Tanel,

    Yes, it is system state dump. I checked the session_wait for that particular session and it was latch_free event..latches were ‘cache buffers chains’ and ‘mostly latch-free SCN’. This session has been running almost about 16 hrs. Again at the system level db was normal. My db version is 9208 on SunOS 64 bit.I think I had seen an example on your site where you were showing how to track back to the latch blocking session.

    Thanks,

    Thanks,

  6. SJois says:

    Tanel,

    Info: This is what I have in my dump file for this latch free waiting session..I did a grep on address 380010668 in the entire dump file and I don’t see any reference to this address. So I am puzzled ..did I use wrong address?

    *******************************************

    SO: 2760886dd8, type: 4, owner: 276c962000, flag: INIT/-/-/0×00
    (session) trans: 27ef7a0c08, creator: 276c962000, flag: (100041) USR/- BSY/-/-/-/-/-
    DID: 0001-03F0-00000002, short-term DID: 0000-0000-00000000
    txn branch: 0
    oct: 2, prv: 0, sql: 27d892ab90, psql: 27d892ab90, user: 79/XYZUSER
    O/S info: user: xxxeng04, term: , ospid: 25364, machine: xxxxap4p
    program: exxxxeng@xxxxap4p (TNS V1-V3)
    application name: yyyyyy4, hash value=1554640259
    action name: Thread 25 (type T4: Object-Histo, hash value=1445968872
    last wait for ‘latch free’ blocking sess=0×0 seq=30610 wait_time=204
    address=380010668, number=58, tries=0
    temporary object counter: 5

    *******************************************

  7. Tanel Poder says:

    That’s the reason why I asked – I suspected that the session was not actually stuck waiting for the event but you managed to misinterpret the v$session_wait instead. If STATE != WAITING in v$session_wait then ignore whatever the event columns says as you’re not waiting anymore, but burning CPU! The event column just shows the LAST wait waited for, but its not relevant anymore as you’re not waiting anymore.

    That’s why I use my sw.sql script which masks the last event name and replaces it with “on cpu” when not waiting.

    And systemstate dump confirms that as well, if you were waiting you’d see “waiting for …” in there instead of “last wait …”

    I would look into v$sesstat counter of that session (with snapper for example) and see what counters are incrementing (especially whether the execute count is going up) and then see into the execution plan of the prevalent SQL_ID/sql_hash_value in v$session for that session. This may be caused by just a bad execution plan, but snapper would confirm it.

  8. SJois says:

    Tanel,

    I think this explanation answers why I don’t see the same address in the dump file.

    I do see in dump , session info shows as last wait..however, I was checking the v$session status which was active.
    From my observation, if the session is waiting
    on anything in the database, the v$session status shows as active ( I might be wrong ).

    So is it possible that during the dump duration when it was scannig this session, the session was doing some work?

    Now about next time, when this happens, if I use sw.sql and v$sesstat, I think I should get more info on this issue.
    However, do you forsee any other info to be collected – just based on what you have seen?
    Your insight is greatly appreciated. Thanks,

  9. Tanel Poder says:

    The v$session.status = ACTIVE when the session is servicing a user request. Whenever a new request comes in (through TTC layer) then one first thing Oracle does is update session idle bit which says that session is active. The session stays active regardless CPU usage or waits for some events. The session is marked inactive only when the response to application is sent back to client and db call ends.

    What you probably meant was v$session_wait.state (statE, not statUS) which shows the wait state, whether you’re waiting for something or not. The statE column is also in v$session starting from 10g, but they’re different. STATE shows whether you’re waiting/not on CPU or working/ON CPU, but STATUS shows whether the session is servicing a user request or not.

    Regarding systemstate dumps, yes it’s possible to get “unlucky” so that when the dump is taken the session just happens to be doing something else than the problem operation… this is in case you don’t have a hang but just bad performance. That’s why you should take a few systemstate dumps to see if anything is changing over time. Note that systemstate dump is quite expensive, better use @sw/v$session_wait if just one session is slow and hanganalyze at level 3 if there seems to be a hang affecting multiple sessions.

    I will post a little script in my blog for information gathering hopefully today, but what I usually gather first for a problem session is v$session_wait, v$sesstat and exec plan info from v$sql_plan.

  10. SJois says:

    Thanks Tanel.. I am looking forward to it.

    Thanks,

  11. Pingback: latch: cache buffers chains latch contention - a better way for finding the hot block | Tanel Poder's blog: Core IT for Geeks and Pros

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>