Riyaj Shamsudeen wrote an excellent article about systematic latch contention troubleshooting.
Especially if the latch contention problem is ongoing, looking into system wide stats (like v$latch.sleep columns) is not the best idea in busy systems. This may sometimes lead you to fixing the wrong problem.
This is because sometimes the latch contention is not caused by some system wide inefficiency but rather by one or few sessions.
The right approach would be to measure the following things:
- Which latch (and exact child latches) the problem session is waiting for (query v$session_wait, ASH, sql_trace)
- Why is the problem session trying to get that latch so often (query v$sesstat counters for problem session and check execution plan if only single/few statements executed)
- Why this latch is busy: who’s holding it (query v$latchholder, LatchProf)
- Why is that someone holding the latch so much (query v$sesstat counters for that session or run LatchProfX)
Riyaj used this approach and successfully found out the troublemaker causing heavy library cache latch contention. He used my LatchProfX tool for part of the diagnosis (and I’m very happy to see that my advanced oracle troubleshooting tools find real world use in hands of other troubleshooters/tuners too!).
Here’s an excerpt from latchprofx output (from a little test case I put together), showing which session is holding which library cache latch how much, hold mode (shared/exclusive) and also the actual reason (function name) why the latch is held:
SQL> @latchprofx sid,name,laddr,ksllwnam,ksllwlbl,hmode 159 % 100000 -- LatchProfX 1.08 by Tanel Poder ( http://www.tanelpoder.com ) SID NAME LADDR KSLLWNAM KSLLWLBL HMODE Held Gets Held % Held ms Avg hold ms ---- -------------- ---------------- --------- --------- ---------- ------ ----- ------- -------- ----------- 159 library cache 00000003A93513E0 kglic child exclusive 10044 9314 10.04 301.320 .032 159 library cache 00000003A9351340 kglic child exclusive 9895 8458 9.90 296.850 .035 159 library cache 00000003A9351660 kglic child exclusive 9761 8440 9.76 292.830 .035 159 library cache 00000003A9351700 kglic child exclusive 9737 8924 9.74 292.110 .033 159 library cache 00000003A9351480 kglic child exclusive 8999 7765 9.00 269.970 .035 159 library cache 00000003A93515C0 kglic child exclusive 8553 7832 8.55 256.590 .033 159 library cache 00000003A9351520 kglic child exclusive 6852 6828 6.85 205.560 .030 7 rows selected.
kglic means Kernel Generic Library cache Iterate Chain (AFAIK), it’s the function which is executed when you access most X$KGL tables. And this is where Riyaj got the evidence that the latch contention comes from inefficient scanning of library cache, caused by a session running queries against V$SQL_PLAN views with a bad plan.
NB! If you want to move to the "New World" - and benefit from the awesomeness of Hadoop, without having to re-engineer your existing applications - check out Gluent, my new startup that will make history! ;-)