In Oracle 11g even more library cache operations have been changed to use KGX mutexes instead of latches.
In Oracle 10.2.0.2+ the library cache pin latch usage was replaced with mutexes whenever _kks_use_mutex_pin was true, also few other things like V$SQLSTATS arrays and parent cursor examination were protected by mutexes. However the traversing of library cache hash chains (the right child cursor lookup using kksfbc()) was still protected by library cache latches which could become a problem with frequent soft parsing combined with too little cursor cache and long library cache hash chains (remember, the library cache latches were always taken exclusively even for plain hash chain scanning).
In 11g all library cache related latches except “library cache load lock” are gone and corresponding operations are protected by mutexes instead. The “library cache” latches have been replaced by “Library Cache” mutexes for example.
Here are couple queries which illustrate the change.
Executed on 10.2.0.3:
SQL> select name from v$latch where lower(name) like '%library%'; NAME -------------------------------------------------- library cache pin allocation library cache lock allocation library cache hash chains library cache lock library cache library cache pin library cache load lock 7 rows selected. SQL> select name from v$event_name where name like '%library%'; NAME ---------------------------------------------------------------- latch: library cache latch: library cache lock latch: library cache pin library cache pin library cache lock library cache load lock library cache revalidation library cache shutdown 8 rows selected.
Same queries executed on 11.1.0.6 and the bold lines above are gone:
SQL> select name from v$latch where lower(name) like '%library%'; NAME ---------------------------------------------------------------- library cache load lock SQL> select name from v$event_name where name like '%library%'; NAME ---------------------------------------------------------------- library cache pin library cache lock library cache load lock library cache: mutex X library cache: mutex S OSD IPC library library cache revalidation library cache shutdown 8 rows selected.
Looks like the developers have thrown out library cache latching mechanism in 11g as the mutexes introduced in 10.2 have proven to work fine (and they have managed to implement mutexes for protecting (almost) the full set of library cache operations).
So there is no way to revert back to old behaviour using _kks_use_mutex_pin=false (and you wouldn’t need to do this anyway, btw). The parameter is still there though and out of interest I checked what happens if I set it to false and bounced the instance.
As, expected, I started getting error messages like following right after startup:
ORA-03113: end-of-file on communication channel ORA-00600: internal error code, arguments: [kglGetSessionUOL], [7], [], [], [], [], [], []
This is a good example of dangers with undocumented parameters – they may work ok in one version (and platform) but could cause serious trouble anywhere else.
Anyway, back to mutexes. In 10g you see there are 3 types of mutexes used:
SQL> select * from v$mutex_sleep; MUTEX_TYPE LOCATION SLEEPS WAIT_TIME -------------------------------- ---------------------------------------- ---------- ---------- Cursor Stat kksFindCursorStat [KKSSTALOC3] 339 775 Cursor Parent kkspsc0 [KKSPRTLOC26] 1507 123969 Cursor Parent kksLoadChild [KKSPRTLOC5] 170 372 Cursor Parent kksLoadChild [KKSPRTLOC4] 385 3799 Cursor Parent kksfbc [KKSPRTLOC2] 1649 22484 Cursor Parent kksfbc [KKSPRTLOC1] 128 1599 Cursor Pin kksLockDelete [KKSCHLPIN6] 3505 928387 Cursor Pin kkslce [KKSCHLPIN2] 15343 160394917 Cursor Pin kksfbc [KKSCHLFSP2] 3219 9065433 9 rows selected.
In 11g there are couple additional mutexes, one (and most important) of them is Library Cache mutex:
SQL> select distinct mutex_type from v$mutex_sleep_history; MUTEX_TYPE -------------------------------- Library Cache Cursor Pin Cursor Parent Cursor Stat
(I had to sample v$mutex_sleep_history instead of v$mutex_sleep on 11g as the latter was empty on 11g… it may be that the mutex get operations have been tuned further to not maintain the counters just to save even few more CPU cycles every get)
So, starting from 11g, each library cache bucket is protected by a separate mutex (yes all 131072 of them!).
Previously we had all those buckets were hashed to and protected by max 67 library cache latches which inevitably were prone to unnecessary false contention in case of many cursors being executed concurrently. Now this issue should be resolved once and for all. Of course the library cache mutexes still don’t solve all problems in the world (especially the ones related to excessive hard parsing!), for example there’s still a chance of hash collision of two entirely different cursors. Also, if there are many child cursors under a parent and the application cursor management is poor (e.g. cursors are closed after every execution and no session cursor caching is done) then you could still have contention on the mutex due continuous library cache hash chain traversing.




@Tanel Poder
Probably worth adding 4) to this list.
A new undocumented feature of 11gR2 is that memory resizes can still happen despite AMM and ASMM being switched off. If a ORA-4031 is encountered, an automatic and instant shared pool resize will be performed. This is controllable via the hidden parameter _MEMORY_IMM_MODE_WITHOUT_AUTOSGA.
I had this issue regularly ocurring after an upgrade to 11.2, and Oracle support didn’t even know about the feature. Took a fair while to get the information out of the Oracle devs, but it is now published on Metalink under article 1269139.1
Obviously switching this off will result in 4031 errors appearing so care should be taken.
@Nigel Antell
Thanks for the info, I didn’t know about this feature. Although I see where it comes from, the ASMM resizes are based on the MMAN decisions, which it bases on from the varios memory pool advisors. The out-of-memory resize is, as the parameter says, an immediate resize, which is done when you’re out of memory in some heap (shared pool, large pool, java pool etc). So they’re different things internally.
By the way, increasing shared pool isn’t so much of a problem, especially if it’s done for avoiding an ORA-4031. I’m talking performance wise. When you increase some heap at the expense of buffer cache, you might see brief “buffer extermination” waits and then everything continues normally. But the trouble starts when you shrink heap-based pools (shared pool) as it’s way more complicated to flush out shared pool objects compared to buffers in buffer cache. You’ll need to invalidate or clear corresponding library cache objects, which results in library cache lookups and modifications, which means library cache latching or mutex gets, long holds – and contention. And when you hit some mutex (non-)sleeping issue, this can mean a huge performance spike. There used to be a bug which caused the whole instance to get hung due shared pool downsizes IIRC.
Anyway, I think this is a good thing really as it avoids some ORA-4031s, you’ll just need to keep an eye on shared pool growth, so that you wouldn’t unknowingly end up with a 16 MB buffer cache and 32GB shared pool if you have some shared pool leak or application design issue…
@Tanel Poder
Well the only reason I came across it was because of some very significant spikes on the system with the cursor: pin S wait on X event prevailent. So it did give me some very bad contention even on growth, but the database has some extremely high execution rates on single queries, so any issues like this are highlighted. Initially I was following all the threads on preemption and OS scheduling, as we did get a red lining 88 core machine as several hundred sessions spun on the mutex. But when patches were applied, and application tuning performed to remove the execution rate, the resizes still remained, albeit without a big impact.
Now I just run with a bigger shared pool, which is the best all round. I’d have thought you’d have to be _really_ tight on memory to want to change this parameter to false.
@Nigel Antell
I think there’s a chance that you did misinterpret the symptoms. If Oracle is going to extend the shared pool in emergency (immediate) mode, then this is only because you’ve run out of shared pool memory, you’ve tried freelists, *you’ve already flushed LRU lists*, but still no luck. And only then, to avoid ORA-4031, Oracle extended the shared pool.
But the spike you saw likely happened because & during the excessive LRU flushing part in desperate search for some usable memory. So, the problem is memory shortage (fragmentation) and the increasing of shared pool is just another result of that shortage.
Perhaps you had some large PL/SQL package (or other large object which can’t be broken down to pieces) loaded into shared pool just then…
@Tanel Poder
Interesting. One thing I forgot to mention is that one of the steps I took was to pin all our user application code on instance startup in an effort to make the shared pool less fragmented and to reduce the number of reloads. The success of this (not had a single 4031/resize since) would fit in very well with your assessment, and it certainly fits in well with my memory of the symptoms. I guess if part of the LRU flushing tried to flush any of the child cursors associated with some of these high execution rate statements (several k per second), this is what could cause the pin s wait on X event to occur in such a bad way.
Hi, tanel
I read your ppt oraclelatchandmutexcontentiontroubleshooting-110228084020-phpapp01.pdf and try to make use of p1/p2/p3 of wait event “library cache: mutex X”.
They’re something like 2185117608/0/85, 2185117608/0/106 ,2185117608/24859270709248/82. As i know how p3 works, Parameter1 (idn) and Parameter 2 (value) can’t help at all. There is no sql with hashvalue 2185117608(if it’s cursor: pin or cursor: mutex, yes this is hash value).
We’re seeing high library cache: mutex X and library cache lock after we finished exchanging partition of one partition table.
Is there any simple way to simulate the mutex on a test env? I can get library cache lock/pin easily by compiling package or add primary key to a table, however i found it hard to get mutex on my test env.
Hi Tanel,
A question and an observation.
Question: Does this post imply that latchprof.sql can’t really be used to diagnose library cache issues in 11g because they’re protected by mutexes and not latches?
Observation: The Performance Page and Top Activity Page in OEM 10.2.0.5 still shows the old style latch wait events
@Doug Burns
Well, library cache issues is a wider topic – such as determining whether/why existing cursors aren’t reused (nonshared.sql etc).
The starting point of troubleshooting would be the same in 10g and 11g, regardless of latches or mutexes. You’d start from looking into P1 of the library cache related wait events – and in case of latches you can use latchprof, but with mutexes you can either use v$mutex_sleep_history (mutexprof.sql) or even easier, use the blocking session ID (in 11g, the mutex waiter resolves and populates blocking session ID in v$session/ASH) and see what that session was doing. So, the starting point – wait interface, p1,p2 – is more important and may be enough for diagnosing the issue… I use latchprof only in more advanced cases where wait interface and ASH doesn’t give enough detail…
Yep, in 10g, all the library cache latches are still there and used for protecting PL/SQL and other library cache objects, while only the cursor pinning is protected by mutexes (cursor lookup, loading etc is still protected by latches).
@shan chao
The P1 is the hash value of the LIBRARY CACHE OBJECT accessed. Not just SQL cursors only, but any library cache object. V$SQL only show SQL child cursors, but no other library cache object. You’ll have to query X$KGLOB by KGLNAHSH to see all library cache objects.
@Tanel Poder
Thanks Tanel. In the end, it was a simple case of using blocking session info in ASH. Should probably have looked there first, to be honest!
@Doug Burns
Cool! Was it 11g+? In 10g (at least up to 10.2.0.3 or .4 the mutex waiters didn’t populate blocking session)
It was 11.2.0.2 but what was interesting was that only one of the blocked sessions had the information populated – the first session blocked in the chain if that makes sense? All of the others that were blocked in turn by that first blocked session didn’t have the information populated. So, as usual in my experience, ASH blocked session info is limited.
I might be beanitg a dead horse, but thank you for posting this!