Library cache latches gone in Oracle 11g

In Oracle 11g even more library cache operations have been changed to use KGX mutexes instead of latches.

In Oracle the library cache pin latch usage was replaced with mutexes whenever _kks_use_mutex_pin was true, also few other things like V$SQLSTATS arrays and parent cursor examination were protected by mutexes. However the traversing of library cache hash chains (the right child cursor lookup using kksfbc()) was still protected by library cache latches which could become a problem with frequent soft parsing combined with too little cursor cache and long library cache hash chains (remember, the library cache latches were always taken exclusively even for plain hash chain scanning).

In 11g all library cache related latches except “library cache load lock” are gone and corresponding operations are protected by mutexes instead. The “library cache” latches have been replaced by “Library Cache” mutexes for example.

Here are couple queries which illustrate the change.

Executed on

SQL> select name from v$latch where lower(name) like '%library%';

library cache pin allocation
library cache lock allocation
library cache hash chains
library cache lock
library cache
library cache pin
library cache load lock

7 rows selected.

SQL> select name from v$event_name where name like '%library%';

latch: library cache
latch: library cache lock
latch: library cache pin
library cache pin
library cache lock
library cache load lock
library cache revalidation
library cache shutdown

8 rows selected.

Same queries executed on and the bold lines above are gone:

SQL> select name from v$latch where lower(name) like '%library%';

library cache load lock

SQL> select name from v$event_name where name like '%library%';

library cache pin
library cache lock
library cache load lock
library cache: mutex X
library cache: mutex S
OSD IPC library
library cache revalidation
library cache shutdown

8 rows selected.

Looks like the developers have thrown out library cache latching mechanism in 11g as the mutexes introduced in 10.2 have proven to work fine (and they have managed to implement mutexes for protecting (almost) the full set of library cache operations).
So there is no way to revert back to old behaviour using _kks_use_mutex_pin=false (and you wouldn’t need to do this anyway, btw). The parameter is still there though and out of interest I checked what happens if I set it to false and bounced the instance.

As, expected, I started getting error messages like following right after startup:

ORA-03113: end-of-file on communication channel

ORA-00600: internal error code, arguments: [kglGetSessionUOL], [7], [], [], [], [], [], []

This is a good example of dangers with undocumented parameters – they may work ok in one version (and platform) but could cause serious trouble anywhere else.

Anyway, back to mutexes. In 10g you see there are 3 types of mutexes used:

SQL> select * from v$mutex_sleep;

MUTEX_TYPE                       LOCATION                                     SLEEPS  WAIT_TIME
-------------------------------- ---------------------------------------- ---------- ----------
Cursor Stat                      kksFindCursorStat [KKSSTALOC3]                  339        775
Cursor Parent                    kkspsc0 [KKSPRTLOC26]                          1507     123969
Cursor Parent                    kksLoadChild [KKSPRTLOC5]                       170        372
Cursor Parent                    kksLoadChild [KKSPRTLOC4]                       385       3799
Cursor Parent                    kksfbc [KKSPRTLOC2]                            1649      22484
Cursor Parent                    kksfbc [KKSPRTLOC1]                             128       1599
Cursor Pin                       kksLockDelete [KKSCHLPIN6]                     3505     928387
Cursor Pin                       kkslce [KKSCHLPIN2]                           15343  160394917
Cursor Pin                       kksfbc [KKSCHLFSP2]                            3219    9065433

9 rows selected.

In 11g there are couple additional mutexes, one (and most important) of them is Library Cache mutex:

SQL> select distinct mutex_type from v$mutex_sleep_history;

Library Cache
Cursor Pin
Cursor Parent
Cursor Stat

(I had to sample v$mutex_sleep_history instead of v$mutex_sleep on 11g as the latter was empty on 11g… it may be that the mutex get operations have been tuned further to not maintain the counters just to save even few more CPU cycles every get)

So, starting from 11g, each library cache bucket is protected by a separate mutex (yes all 131072 of them!).

Previously we had all those buckets were hashed to and protected by max 67 library cache latches which inevitably were prone to unnecessary false contention in case of many cursors being executed concurrently. Now this issue should be resolved once and for all. Of course the library cache mutexes still don’t solve all problems in the world (especially the ones related to excessive hard parsing!), for example there’s still a chance of hash collision of two entirely different cursors. Also, if there are many child cursors under a parent and the application cursor management is poor (e.g. cursors are closed after every execution and no session cursor caching is done) then you could still have contention on the mutex due continuous library cache hash chain traversing.

This entry was posted in Oracle and tagged , . Bookmark the permalink.

63 Responses to Library cache latches gone in Oracle 11g

  1. maclean says:

    Hi tanel,
    The latch:library cache pin and latch:library cache lock has gone, then How does oracle allocate library cache pin/lock structure for non-cursor object?

  2. Mahesh says:

    There are infact so many bugs related to “library cache: mutex X” and some of them seem to be fixed (list available at the Mutex Contention section of the metalink article 1179583.1) in patchset which was released on 11/17. You might want to search for the patch number 10098816 in metalink. I’m going to apply this patch on on Tux64 and hoping that it’ll solve my probable bug 10145558(Selects on library cache V$/X$ views cause “library cache: mutex X” waits)!!

  3. Tanel Poder says:

    In fact, queries frequently walking through the entire library cache (V$SQL, V$SQLAREA, V$SQLPLAN* views) have always been a problem. Before libcache mutexes were introduced, such queries caused library cache latch contention (as walking through the complex library cache structure & linked lists requires some protection). The solution before 10.2 was to not query these views so frequently (stop that GUI “performance tuning” tool which polls V$SQL every 5 seconds etc).

    Starting from 10.2 you should query V$SQLSTATS instead where possible, that doesn’t visit the library cache, but a separate Cursor Stats array maintained by Oracle during SQL execution. If you need to access the library cache views, then you should identify which SQLIDs/hash values you want and search for these using the “=” condition (or a properly hinted IN) so that Oracle wouldn’t do “full scans” through the library cache, but a hash value based lookup instead.

  4. Nigel Antell says:

    @Tanel Poder

    Probably worth adding 4) to this list.

    A new undocumented feature of 11gR2 is that memory resizes can still happen despite AMM and ASMM being switched off. If a ORA-4031 is encountered, an automatic and instant shared pool resize will be performed. This is controllable via the hidden parameter _MEMORY_IMM_MODE_WITHOUT_AUTOSGA.
    I had this issue regularly ocurring after an upgrade to 11.2, and Oracle support didn’t even know about the feature. Took a fair while to get the information out of the Oracle devs, but it is now published on Metalink under article 1269139.1

    Obviously switching this off will result in 4031 errors appearing so care should be taken.

  5. Tanel Poder says:

    @Nigel Antell

    Thanks for the info, I didn’t know about this feature. Although I see where it comes from, the ASMM resizes are based on the MMAN decisions, which it bases on from the varios memory pool advisors. The out-of-memory resize is, as the parameter says, an immediate resize, which is done when you’re out of memory in some heap (shared pool, large pool, java pool etc). So they’re different things internally.

    By the way, increasing shared pool isn’t so much of a problem, especially if it’s done for avoiding an ORA-4031. I’m talking performance wise. When you increase some heap at the expense of buffer cache, you might see brief “buffer extermination” waits and then everything continues normally. But the trouble starts when you shrink heap-based pools (shared pool) as it’s way more complicated to flush out shared pool objects compared to buffers in buffer cache. You’ll need to invalidate or clear corresponding library cache objects, which results in library cache lookups and modifications, which means library cache latching or mutex gets, long holds – and contention. And when you hit some mutex (non-)sleeping issue, this can mean a huge performance spike. There used to be a bug which caused the whole instance to get hung due shared pool downsizes IIRC.

    Anyway, I think this is a good thing really as it avoids some ORA-4031s, you’ll just need to keep an eye on shared pool growth, so that you wouldn’t unknowingly end up with a 16 MB buffer cache and 32GB shared pool if you have some shared pool leak or application design issue…

  6. Nigel Antell says:

    @Tanel Poder

    Well the only reason I came across it was because of some very significant spikes on the system with the cursor: pin S wait on X event prevailent. So it did give me some very bad contention even on growth, but the database has some extremely high execution rates on single queries, so any issues like this are highlighted. Initially I was following all the threads on preemption and OS scheduling, as we did get a red lining 88 core machine as several hundred sessions spun on the mutex. But when patches were applied, and application tuning performed to remove the execution rate, the resizes still remained, albeit without a big impact.
    Now I just run with a bigger shared pool, which is the best all round. I’d have thought you’d have to be _really_ tight on memory to want to change this parameter to false.

  7. Tanel Poder says:

    @Nigel Antell
    I think there’s a chance that you did misinterpret the symptoms. If Oracle is going to extend the shared pool in emergency (immediate) mode, then this is only because you’ve run out of shared pool memory, you’ve tried freelists, *you’ve already flushed LRU lists*, but still no luck. And only then, to avoid ORA-4031, Oracle extended the shared pool.

    But the spike you saw likely happened because & during the excessive LRU flushing part in desperate search for some usable memory. So, the problem is memory shortage (fragmentation) and the increasing of shared pool is just another result of that shortage.

    Perhaps you had some large PL/SQL package (or other large object which can’t be broken down to pieces) loaded into shared pool just then…

  8. Nigel Antell says:

    @Tanel Poder
    Interesting. One thing I forgot to mention is that one of the steps I took was to pin all our user application code on instance startup in an effort to make the shared pool less fragmented and to reduce the number of reloads. The success of this (not had a single 4031/resize since) would fit in very well with your assessment, and it certainly fits in well with my memory of the symptoms. I guess if part of the LRU flushing tried to flush any of the child cursors associated with some of these high execution rate statements (several k per second), this is what could cause the pin s wait on X event to occur in such a bad way.

  9. shan chao says:

    Hi, tanel

    I read your ppt oraclelatchandmutexcontentiontroubleshooting-110228084020-phpapp01.pdf and try to make use of p1/p2/p3 of wait event “library cache: mutex X”.

    They’re something like 2185117608/0/85, 2185117608/0/106 ,2185117608/24859270709248/82. As i know how p3 works, Parameter1 (idn) and Parameter 2 (value) can’t help at all. There is no sql with hashvalue 2185117608(if it’s cursor: pin or cursor: mutex, yes this is hash value).

    We’re seeing high library cache: mutex X and library cache lock after we finished exchanging partition of one partition table.

    Is there any simple way to simulate the mutex on a test env? I can get library cache lock/pin easily by compiling package or add primary key to a table, however i found it hard to get mutex on my test env.

  10. Doug Burns says:

    Hi Tanel,

    A question and an observation.

    Question: Does this post imply that latchprof.sql can’t really be used to diagnose library cache issues in 11g because they’re protected by mutexes and not latches?

    Observation: The Performance Page and Top Activity Page in OEM still shows the old style latch wait events

  11. Tanel Poder says:

    @Doug Burns

    Well, library cache issues is a wider topic – such as determining whether/why existing cursors aren’t reused (nonshared.sql etc).

    The starting point of troubleshooting would be the same in 10g and 11g, regardless of latches or mutexes. You’d start from looking into P1 of the library cache related wait events – and in case of latches you can use latchprof, but with mutexes you can either use v$mutex_sleep_history (mutexprof.sql) or even easier, use the blocking session ID (in 11g, the mutex waiter resolves and populates blocking session ID in v$session/ASH) and see what that session was doing. So, the starting point – wait interface, p1,p2 – is more important and may be enough for diagnosing the issue… I use latchprof only in more advanced cases where wait interface and ASH doesn’t give enough detail…

    Yep, in 10g, all the library cache latches are still there and used for protecting PL/SQL and other library cache objects, while only the cursor pinning is protected by mutexes (cursor lookup, loading etc is still protected by latches).

  12. Tanel Poder says:

    @shan chao

    The P1 is the hash value of the LIBRARY CACHE OBJECT accessed. Not just SQL cursors only, but any library cache object. V$SQL only show SQL child cursors, but no other library cache object. You’ll have to query X$KGLOB by KGLNAHSH to see all library cache objects.

  13. Doug Burns says:

    @Tanel Poder
    Thanks Tanel. In the end, it was a simple case of using blocking session info in ASH. Should probably have looked there first, to be honest!

  14. Tanel Poder says:

    @Doug Burns

    Cool! Was it 11g+? In 10g (at least up to or .4 the mutex waiters didn’t populate blocking session)

  15. Doug Burns says:

    It was but what was interesting was that only one of the blocked sessions had the information populated – the first session blocked in the chain if that makes sense? All of the others that were blocked in turn by that first blocked session didn’t have the information populated. So, as usual in my experience, ASH blocked session info is limited.

  16. Jaxon says:

    I might be beanitg a dead horse, but thank you for posting this!

Leave a Reply

Your email address will not be published. Required fields are marked *