cursor: pin S waits, sporadic CPU spikes and systematic troubleshooting

I recently consulted one big telecom and helped to solve their sporadic performance problem which had troubled them for some months. It was an interesting case as it happened in the Oracle / OS touchpoint and it was a product of multiple “root causes”, not just one, an early Oracle mutex design bug and a Unix scheduling issue – that’s why it had been hard to resolve earlier despite multiple SRs opened etc.

Martin Meyer, their lead DBA, posted some info about the problem and technical details, so before going on, you should read his blog entry and read my comments below after this:

Problem:

So, the problem was, that occasionally the critical application transactions which should have taken very short time in the database (<1s), took 10-15 seconds or even longer and timed out.

Symptoms:

  1. When the problem happened, the CPU usage also jumped up to 100% for the problem duration (from few tens of seconds up to few minutes).
  2. In AWR snapshots (taken every 20 minutes), the “cursor: pin S” popped into top TOP5 waits (around 5-10% of total instance wait time) and sometimes also “cursor: pin S wait on X” which is a different thing, also “latch: library cache” and interestingly “log file sync”. These waits had then much higher average wait times per wait occurrence than normal (tens or hundreds of milliseconds per wait, on average).
  3. The V$EVENT_HISTOGRAM view showed lots of cursor: pin S waits taking very long time (over a second, some even 30+ seconds) and this certainly isn’t normal (Martin has these numbers in his blog entry)

AWR and OS CPU usage measurement tools are system-wide tools (as opposed to session-wide tools).

Troubleshooting:

I can’t give you exact numbers or AWR data here, but will explain the flow of troubleshooting and reasoning.

  • As the symptoms involved CPU usage spikes, I first checked whether there were perhaps logon storms going on due a bad application server configuration, where the app server suddenly decides to fire up hundreds of more connections at the same time (that happens quite often, so it’s a usual suspect when troubleshooting such issues). A logon storm can consume lots of CPU as all these new processes need to be started up in OS, they attach to SGA (syscalls, memory pagetable set-up operations) and eventually they need to find & allocate memory from shared pool and initialize session structures. This all takes CPU.However the logons cumulative statistic in AWR didn’t go up almost at all during the 20 minute snapshot, so that ruled out a logon storm. As the number of sessions in the end of AWR snapshot (compared to the beginning of it) did not go down, this ruled out a logoff storm too (which also consumes CPU as now the exiting processes need to release their resources etc).
  • It’s worth mentioning that log file sync waits also went up by over an order of magnitude (IIRC from 1-2ms to 20-60 ms per wait) during the CPU spikes. However as log file parallel write times didn’t go up so radically, this indicated that the log file sync wait time was wasted somewhere else too – which is very likely going to be CPU scheduling latency (waiting on the CPU runqueue) when CPUs are busy.
  • As one of the waits which popped up during the problem was cursor: pin S, then I chcecked V$MUTEX_SLEEP_HISTORY and it did not show any specific cursor as a significant contention point (all contention recorded in that sleep history buffer was evenly spread across many different cursors), so that indicated to me that the problem was likely not related to a single cursor related issue (a bug or just too heavy usage of that cursor). Note that this view was not queried during the worst problem time, so there was a chance that some symptoms were not in there anymore (V$MUTEX_SLEEP_HISTORY is a circular buffer of few hundred last mutex sleeps).
  • So, we had CPU starvation and very long cursor: pin S waits popping up at the same time. cursor: pin S operation should happen really fast as it’s a very simple operation (few tens of instructions only) of marking the cursor’s mutex in-flux so its reference count could be bumped up for a shared mutex get.
  • Whenever you see CPU starvation (CPUs 100% busy and runqueues are long) and latch or mutex contention, then the CPU starvation should be resolved first, as the contention may just be a symptom of the CPU starvation. The problem is that if you get unlucky and a latch or mutex holder process is preempted and taken off CPU by the scheduler, the latch/mutex holder can’t release the latch before it gets back onto CPU to complete its operation! But OS doesn’t have a clue about this, as latches/mutexes are just Oracle’s memory structures in SGA. So the latch/mutex holder is off CPU and everyone else who gets onto CPU may want to take the same latch/mutex. They can’t get it and spin shortly in hope that the holder releases it in next few microseconds, which isn’t gonna happen in this case, as the latch/mutex holder is still off CPU!
  • And now comes a big difference between latches and mutexes in Oracle 10.2: When a latch getter can’t get the latch after spinning, it will go to sleep to release the CPU. Even if there are many latch getters in the CPU runqueue before the latch holder, they all spin quickly and end up sleeping again. But when a mutex getter doesn’t get the mutex after spinning, it will not go to sleep!!! It will yield() the CPU instead, which means that it will go to the end of runqueue and try to get back onto CPU as soon as possible. So, mutex getters in 10.2 are much less graceful, they can burn a lot of CPU when the mutex they want is held by someone else for long time.
  • But so what, if a mutex holder is preempted and taken off CPU by OS scheduler – it should get back onto CPU pretty fast, once it works its way through the CPU runqueue?
  • Well, yes IF all the processes in the system have the same priority.
  • This is where a second problem comes into play – Unix process priority decay. When a process eats a lot of CPU (and does little IO / voluntary sleeping) then the OS lowers that processes CPU scheduling priority so that other, less CPU hungry processes would still get their fair share of CPU (especially when coming back from an IO wait for example etc).
  • When a mutex holder has a lower priority than most other processes and is now taken off CPU, a thing called priority inversion happens. Even though other processes do have higher priority, they can not proceed, as the critical lock or resource they need, is already held by the other process with a lower priority who can’t complete its work as the “high priority” processes keep the CPUs busy.
  • In case of latches, the problem is not that bad as the latch getters go to sleep until they are posted when the latch is released by the holder process (I’ve written about it here). But the priority inversion takes a crazy turn in case of mutexes – as their getters don’t sleep (not even for a short time) by default, but yield the CPU and try to get back to it immediately and so on until they get the mutex. That can lead to huge CPU runqueue spikes, unresponsive systems and even hangs.
  • This is why starting from Oracle 11g the mutex getters do sleep instead of just yielding the CPU and also Oracle has backported the fix into Oracle 10.2.0.4, where a patch must be applied and where the _first_spare_parameter will specify the sleep duration in centiseconds.
  • So, knowing how mutexes worked in 10.2, all the symptoms led me to suspect this priority inversion problem, greatly amplified by how the mutex getters do never sleep by default. And we checked the effective priorities of all Oracle processes in the server, and we hit the jackpot – there was a number of processes with significantly lower priorities than all other processes had. And it takes only one process with low priority to cause all this trouble, just wait until it starts modifying a mutex and is preempted while doing this.
  • So, in order to fix both of the problems which amplified each other, we had to enable HPUX_SCHED_NOAGE Oracle parameter, to prevent priority decay of the processes and set the _first_spare_parameter to 10, which meant that default mutex sleep time will be 10 centiseconds (which is pretty long time in mutex/latching world, but better than crazily retrying without any sleeping at all). That way no process (the mutex holder) is pushed back and kept away from CPU for long periods of time.

This was not a trivial problem, as it happened in Oracle / OS touchpoint and happened not because a single reason, but as a product of multiple separate reasons, amplifying each other.

There are few interesting, non-technical points here:

  1. When troubleshooting, don’t let performance tools like AWR (or any other tool!) tell you what your problem is! Your business, your users should tell you what the problem is and the tools should only be used for symptom drilldown (This is what Cary Millsap has been constantly telling us). Note how I mentioned the problem and symptoms separately in the beginning of my post – and the problem was that some business transactions (systemwide) timed out because the database response time was 5-15 seconds!
  2. The detail and scope of your performance data must have at least as good detail and scope of your performance problem!
    In other words, if your problem is measured in few seconds, then your performance data should also be sampled at least every few seconds in order to be fully systematic.The classic issue in this case was that the 20 minute AWR reports still showed IO wait times as main DB time consumers, but that was averaged over 20 minutes. But our problem happened severely and shortly within few seconds in that 20 minutes, so the averaging and aggregation over long period of time did hide the extreme performance issue that happened in a very short time.

Next time when it seems to be impossible to diagnose a problem and if the troubleshooting effort ends up going in circles, then you should ask, “what’s the real problem and who and how is experiencing it” and see if your performance data’s detail and scope matches that problem!

Oh, this is a good point to mention that in addition to my Advanced Oracle Troubleshooting/SQL Tuning seminars I also actually perform advanced Oracle troubleshooting consulting too! I eat mutexes for breakfast ;-)

Update: Srivenu Kadiyala has experienced the same problem and has written about it here.

 

Note that this year’s only Advanced Oracle Troubleshooting class takes place in the end of April/May 2014, so sign up now if you plan to attend this year!

This entry was posted in Oracle and tagged , , , , . Bookmark the permalink.

43 Responses to cursor: pin S waits, sporadic CPU spikes and systematic troubleshooting

  1. Thiago Maciel says:

    Hello Tanel,

    First of all, i am a big fan of you. I always hit your blog to learn something new everyday.

    Thank you for such great information.

    This is going to solve our issue as well :)

    Cheers!

  2. Aswath Rao says:

    Tanel,

    Great blog. Learnt a lot from the above blog.

    Thanks
    Aswath

  3. Anthony Cordell says:

    Great post. It seems like this problem would not just affect HPUX, but also Solaris, Linux, etc. that have time share scheduling. So I am curious why Oracle has an HPUX parameter solution and not one for other OS schedulers. Is there something specific about HPUX that makes it more vulnerable requiring a special parameter? I primarily work with Solaris and Linux systems so I am unfamiliar with the intricacies of HPUX.

  4. Tim says:

    Appreciate you sharing your experiences, Tanel. Learned a lot.

  5. Great post!
    This will be the classic example of solving the mutex CPU thrashing instability.
    The problem actually affects all the platforms.
    I would like to bring your attention to
    Patch 7441165 Prevent preemption while holding a mutex
    This patch intended to solve the problem completely.
    And is really helps our customer.
    Patch exist only for HP-UX and Solaris.

  6. Tanel Poder says:

    @Anthony Cordell

    Yes, this problem affects other Unixes as well and Linux to some extent. It’s dependent on which scheduler is used of course and especially in Linux the scheduler algorithms have changed a lot over time. A good starting point would be to run ps on your system and see whether there are Oracle processes which *constantly* have lower priorities than other oracle processes.

  7. Tanel Poder says:

    @Andrey Nikolaev
    Thanks Andrey for this info, I wasn’t aware of this patch.

    Oracle port on Solaris has had the preemption control built in for latch gets for a while, but now apparently they implemented it for mutexes as well. Note that preemption control won’t work in extreme situations, as the OSes can’t completely obey an userland process’es request to stay on CPU without preemption. For badly behaving processes (who don’t yield the CPU soon enough after the critical section) the preemption control will be turned off by kernel, that’s in Solaris at least.

    I didn’t know that they have implemented this in HP-UX too now (with this patch). I checked around a little and HP-UX kernel sure supports this now, there’s a Mercury library and more specifically hg_setcrit() function which allows an userland process to request a preemption delay without a system call.

  8. Pingback: Log Buffer #186, a Carnival of the Vanities for DBAs | The Pythian Blog

  9. Vijay says:

    Hi Tanel,

    This is a very good post for entire Oracle community to learn. I’ve been involved in several performance issues in the past (some related to latches) and understand how difficult and time consuming it is to get to the root of such issues.

    Appreciate your urge to benefit others with your experience. This is what makes me a big fan of yours, Carl Milsap and Tom Kyte. You guys have come to my rescue several times. I am sure you are not aware of this fact :)

    Hats off to all of you.

    Cheers !

    Vijay

  10. Ash says:

    Hi Tanel,

    we upgraded oracle 9i to 11gr2 (9.2.0.5–> 9.2.0.8–>11.2.0.1). We are experiencing similar issue on Solaris 10. All apps used to work fine on 9i and after upgrade, we are seeing mutex error and library cache contention, which results in abnormal spike in CPU usage (100%) and system hangs. Whoever has connections with database can work with some delays, and new connection gets refused. The current hardware has double the power of what we had for 9i database. This is going on since last week and we are getting nowhere with oracle solutions(no offence to oracle). I have asked around many dbas and they have not experience this yet. Any help or input will be greatly appriciated. If you like I can provide my phone number to discuss this.

    Ash
    akpatel@vonage.com

  11. Adrian says:

    We have also had a problem with cursor pin s waits following an upgrade to 11g on our test system.

    The run queue spikes with 100% cpu utilisation. The same test on 10g ran at 32% Cpu utilisation.

    The fix where mutex getter do sleep instead of yielding CPU has not been included in 11g. The patch is available for 11.2.0.1 (bug 6904068).

    I had to rollback Critical Patch Update (9352237) due to a patch conflict so some further clarification is required regarding installing of CPU-9352237.

    Adrian

  12. Adrian says:

    Regarding my post above this was on Solaris 10 Sparc 64bit.

    Regards
    Adrian

  13. skodman says:

    Hi

    Case 1 on the following link describes similar behavior on AIX 5.3 (it doesn’t say it is related to ORACLE though):

    http://www.ibm.com/developerworks/aix/library/au-aix5_cpu/index.html

    Hope this helps.

    Cheers
    Skodman

  14. heather damiani says:

    Hello. You mention a patch to 10.2.0.4. Do you have a patch number? And am I understanding that you need to apply the patch, set the HPUX_NOSCHED and then _first_spare_parameter ?

  15. Tanel Poder says:

    @heather damiani
    Hi Heather,

    This is the bug & patch:

    Bug 6904068 High CPU usage when there are “cursor: pin S” waits

    Yes, you should do all 3 things you mentioned above…

    Also, check out this bug & patch too, it’s a separate improvement, but related (only useful on Solaris and HP-UX as other OS’es don’t support userland managed pre-emption control):

    Bug 7441165 – Prevent preemption while holding a mutex

  16. @Tanel Poder
    Hi Tanel,

    according to MOS interim patch for bug 6904068 is available for 11.2.0.1.1/2 too, it was released in July for 11gR2, so it means unpatched 11gR2 still has the same problems.

  17. @Timur Akhmadeev
    Oh, and one more thing: 7441165 is available for Linux x86-64. Can you comment on this?

  18. Angela Schofield says:

    Hi Tanel,

    I have just read your blog entries for the cursor: pin S waits.

    I support some Oracle databases on Linux 10.2.0.4 for a client.

    On one of the databases, we have seen sporadic performance problems which has caused application issues for many months.

    I have raised multiple SRs with Oracle support, who have not been able to assist further because it is impossible to perform session tracing due to the nature of the web application with a pool of connections to the database. We cannot identify
    the session which has the issue. The only thing Oracle have said is that the database itself seems to be under-utilized (from AWR reports uploaded).

    The AWR reports show significant waits for cursor: pin S wait on X during periods where performance is poor. I think thi may be significant. The AWR snapshots are running every 15 mins.

    The client finds that their application reports errors to the logs because some sql statements which normally execute quickly,take > 10s to execute and may time out.

    I would like some advice on how I can investigate these waits further to find the root cause.
    Any assistance you could provide, would be much appreciated.

    Thanks,
    Angela

  19. Pingback: A million kilometers in two years… | Tanel Poder's blog: Core IT for Geeks and Pros

  20. Tanel Poder says:

    @Angela Schofield
    Angela, you should probably look into ASH reports as they give you session level detail and also the sampling is done every second, instead of 15-minute AWR snapshots. The scope and detail of your performance data must match the scope and detail of the problem!

    I’d run ASH report for as low as 1 minute time period (or write custom queries which report waits + cpu usage at less than 1 minute granularity) and see whether there are any spikes in CPU usage or waits during the application problem time and drill down from there.

  21. Tanel Poder says:

    Additionally, run something like sar or vmstat to capture CPU utilization and runqueue length *every second*. Yes, every second, not minute or 5 minutes what is usually done by monitoring tools. If your problem is measured in seconds, the performance data should be also measured every second. You might find out that while 5 minute sar reports show 20% CPU utilization, you have 100% utilization spikes with huge runqueues which last for a few seconds only…

  22. Angela Schofield says:

    Hi Tanel,

    thankyou very much for your advice.

    Regards,
    Angela

  23. Pingback: Speaking engagements 2011 | Tanel Poder's blog: IT & Mobile for Geeks and Pros

  24. Robert says:

    Thank you very much for this information, Tanel. And congratulations on the blog, it is very informative, with detailed tehnical information.

  25. Richard says:

    Hi Tanel,

    The “select user from dual” causes “library cache: mutex X” contention.
    Is this normal behaviour? Do we need to apply patch to resolve this issue?

    I would appreciate your service to oracle community.

    Regards,
    Richard

    select * from v$version

    Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 – 64bit Production
    PL/SQL Release 11.1.0.7.0 – Production
    CORE 11.1.0.7.0 Production
    TNS for IBM/AIX RISC System/6000: Version 11.1.0.7.0 – Production
    NLSRTL Version 11.1.0.7.0 – Production

    SQL
    INST_ID ADDRESS HASH_VALUE SQL_ID COMMAND_TYPE PIECE SQL_TEXT
    —————-
    ————-
    ———-
    1 070000007DD06B78 225524178 d6vwqbw6r2ffk 3 0 SELECT USER FROM DUAL
    wait event:

    P1 P1TEXT P2 P2TEXT P3 P3TEXT SPID PROCESS EVENT WAIT_CLASS

    ———-
    ———-
    ———-
    —————
    ——————–
    225524178 idn 0 value 57 where 18772088 819426 library cache: mutex X Concurrency
    225524178 idn 0 value 57 where 17711190 823330 library cache: mutex X Concurrency
    225524178 idn 4.0802E+12 value 57 where 16851142 1237006 library cache: mutex X Concurrency

    mutex_sleep_history ( based on P1-idn)

    BLOCKING
    INST_ID MUTEX_IDENTIFIER SLEEP_TIMESTAMP MUTEX_TYPE GETS SLEEPS REQUESTING_SESSION SID LOCATION MUTEX_VALUE P1 P1RAW P2 P3 P4 P5
    —————-
    ——————————–
    ———-
    ———-
    —————-
    —————-
    ———-
    ———
    1 225524178 30-APR-11 06.25.28.323644 PM Cursor Pin 895106 38 981 950 kkslce [KKSCHLPIN2] 000003B600000000 0 00 0 0 0
    1 225524178 30-APR-11 06.25.05.143394 PM Cursor Pin 861746 45 969 981 kkslce [KKSCHLPIN2] 000003D500000000 0 00 0 0 0
    1 225524178 30-APR-11 05.26.44.973728 PM Cursor Pin 839251 33 950 980 kkslce [KKSCHLPIN2] 000003D400000001 0 00 0 0 0
    1 225524178 30-APR-11 04.50.46.454100 AM Cursor Pin 135508 4 988 971 kkslce [KKSCHLPIN2] 000003CB00000000 0 00 0 0 0
    1 225524178 30-APR-11 06.20.57.143668 PM Cursor Pin 864272 52 969 970 kkslce [KKSCHLPIN2] 000003CA00000000 0 00 0 0 0
    1 225524178 30-APR-11 06.25.51.204080 PM Cursor Pin 863854 39 950 969 kkslce [KKSCHLPIN2] 000003C900000000 0 00 0 0 0
    1 225524178 30-APR-11 04.54.08.571772 PM Cursor Pin 809666 22 950 928 kkslce [KKSCHLPIN2] 000003A000000000 0 00 0 0 0
    1 225524178 30-APR-11 06.25.49.394176 PM Cursor Pin 864774 29 969 950 kkslce [KKSCHLPIN2] 000003B600000000 0 00 0 0 0
    1 225524178 30-APR-11 06.25.21.093950 PM Cursor Pin 869850 39 950 981 kkslce [KKSCHLPIN2] 000003D500000000 0 00 0 0 0
    1 225524178 30-APR-11 05.27.24.592958 PM Cursor Pin 810896 35 969 980 kkslce [KKSCHLPIN2] 000003D400000000 0 00 0 0 0
    1 225524178 30-APR-11 04.50.01.705574 AM Cursor Pin 116153 12 969 971 kksLockDelete [KKSCHLPIN6] 0000000000000001 0 00 0 0 0
    1 225524178 30-APR-11 06.20.58.644283 PM Cursor Pin 866494 94 981 970 kkslce [KKSCHLPIN2] 00 0 00 0 0 0
    1 225524178 30-APR-11 06.25.23.269811 PM Cursor Pin 892968 28 981 969 kkslce [KKSCHLPIN2] 000003C900000000 0 00 0 0 0
    1 225524178 30-APR-11 04.54.29.441458 PM Cursor Pin 807424 29 950 928 kkslce [KKSCHLPIN2] 000003A000000000 0 00 0 0 0
    1 225524178 30-APR-11 06.25.49.421368 PM Cursor Pin 869598 39 969 950 kkslce [KKSCHLPIN2] 000003B600000000 0 00 0 0 0
    1 225524178 30-APR-11 06.25.33.891107 PM Cursor Pin 891374 36 969 981 kkslce [KKSCHLPIN2] 000003D500000000 0 00 0 0 0
    1 225524178 30-APR-11 05.26.01.979042 PM Cursor Pin 837950 25 970 980 kkslce [KKSCHLPIN2] 000003D400000000 0 00 0 0 0
    2 225524178 29-APR-11 11.30.24.124146 PM Cursor Pin 18857 134 1027 1002 kkslce [KKSCHLPIN2] 000003EA00000000 0 00 0 0 0
    2 225524178 29-APR-11 11.36.33.932762 PM Cursor Pin 2128672 3562 959 1010 kkslce [KKSCHLPIN2] 000003F200000001 0 00 0 0 0
    2 225524178 29-APR-11 11.35.30.576728 PM Cursor Pin 2161583 5145 1010 994 kkslce [KKSCHLPIN2] 000003E200000001 0 00 0 0 0
    2 225524178 29-APR-11 11.30.17.443374 PM Cursor Pin 37186 3 944 946 kkslce [KKSCHLPIN2] 000003B200000000 0 00 0 0 0
    1 225524178 30-APR-11 06.21.09.475763 PM Cursor Pin 864712 29 950 970 kkslce [KKSCHLPIN2] 000003CA00000000 0 00 0 0 0
    1 225524178 30-APR-11 06.25.25.829483 PM Cursor Pin 895066 38 950 969 kkslce [KKSCHLPIN2] 000003C900000000 0 00 0 0 0
    1 225524178 30-APR-11 05.47.15.303972 PM Cursor Pin 858702 34 950 0 kkslce [KKSCHLPIN2] 00 0 00 0 0 0
    1 225524178 30-APR-11 06.25.55.444597 PM Cursor Pin 865178 32 981 950 kkslce [KKSCHLPIN2] 000003B600000000 0 00 0 0 0
    1 225524178 30-APR-11 06.25.51.559740 PM Cursor Pin 864136 39 969 981 kkslce [KKSCHLPIN2] 00 0 00 0 0 0
    1 225524178 30-APR-11 05.27.34.686745 PM Cursor Pin 829040 41 969 980 kkslce [KKSCHLPIN2] 000003D400000000 0 00 0 0 0
    1 225524178 30-APR-11 04.50.18.400633 AM Cursor Pin 145131 9 981 971 kkslce [KKSCHLPIN2] 000003CB00000001 0 00 0 0 0
    1 225524178 30-APR-11 06.19.32.917632 PM Cursor Pin 890128 26 969 970 kkslce [KKSCHLPIN2] 000003CA00000000 0 00 0 0 0
    1 225524178 30-APR-11 06.25.37.957065 PM Cursor Pin 895320 31 950 969 kkslce [KKSCHLPIN2] 000003C900000000 0 00 0 0 0
    1 225524178 30-APR-11 04.54.50.437123 PM Cursor Pin 811647 29 980 928 kksLockDelete [KKSCHLPIN6] 000003A000000002 0 00 0 0 0
    1 225524178 30-APR-11 06.25.55.273357 PM Cursor Pin 895588 31 981 950 kkslce [KKSCHLPIN2] 000003B600000000 0 00 0 0 0
    1 225524178 30-APR-11 06.25.44.996870 PM Cursor Pin 895490 21 950 981 kkslce [KKSCHLPIN2] 000003D500000000 0 00 0 0 0
    1 225524178 30-APR-11 05.26.37.849250 PM Cursor Pin 806046 54 969 980 kkslce [KKSCHLPIN2] 000003D400000000 0 00 0 0 0
    1 225524178 30-APR-11 04.50.44.632959 AM Cursor Pin 131970 8 1026 971 kksLockDelete [KKSCHLPIN6] 000003CB00000002 0 00 0 0 0
    1 225524178 30-APR-11 06.21.09.033790 PM Cursor Pin 863064 82 969 970 kkslce [KKSCHLPIN2] 000003CA00000000 0 00 0 0 0
    1 225524178 30-APR-11 06.25.52.360874 PM Cursor Pin 892624 30 981 969 kkslce [KKSCHLPIN2] 000003C900000000 0 00 0 0 0
    1 225524178 30-APR-11 04.54.42.089257 PM Cursor Pin 810719 33 950 928 kkslce [KKSCHLPIN2] 000003A000000001 0 00 0 0 0
    1 225524178 30-APR-11 06.25.49.124653 PM Cursor Pin 880220 41 969 950 kkslce [KKSCHLPIN2] 000003B600000000 0 00 0 0 0
    1 225524178 30-APR-11 06.25.55.200082 PM Cursor Pin 892892 25 950 981 kkslce [KKSCHLPIN2] 000003D500000000 0 00 0 0 0
    1 225524178 30-APR-11 05.26.12.258306 PM Cursor Pin 807380 46 969 980 kkslce [KKSCHLPIN2] 000003D400000000 0 00 0 0 0
    1 225524178 30-APR-11 04.50.35.378955 AM Cursor Pin 148328 5 1026 971 kkslce [KKSCHLPIN2] 000003CB00000000 0 00 0 0 0
    1 225524178 30-APR-11 06.21.02.330949 PM Cursor Pin 890826 22 969 970 kkslce [KKSCHLPIN2] 000003CA00000000 0 00 0 0 0
    1 225524178 30-APR-11 06.25.54.174086 PM Cursor Pin 894814 40 981 969 kkslce [KKSCHLPIN2] 000003C900000000 0 00 0 0 0
    1 225524178 30-APR-11 05.04.04.939530 PM Cursor Pin 815446 39 970 0 kkslce [KKSCHLPIN2] 00 0 00 0 0 0
    1 225524178 30-APR-11 06.25.44.907603 PM Cursor Pin 880618 30 981 950 kkslce [KKSCHLPIN2] 00 0 00 0 0 0
    1 225524178 30-APR-11 06.25.55.372206 PM Cursor Pin 895540 29 969 981 kksLockDelete [KKSCHLPIN6] 000003D500000002 0 00 0 0 0
    1 225524178 30-APR-11 05.25.55.586330 PM Cursor Pin 835508 34 950 980 kkslce [KKSCHLPIN2] 000003D400000000 0 00 0 0 0
    1 225524178 30-APR-11 06.20.30.675408 PM Cursor Pin 891070 34 950 970 kkslce [KKSCHLPIN2] 000003CA00000000 0 00 0 0 0
    1 225524178 30-APR-11 06.25.38.179841 PM Cursor Pin 871418 22 981 969 kkslce [KKSCHLPIN2] 000003C900000000 0 00 0 0 0
    1 225524178 30-APR-11 04.54.49.932627 PM Cursor Pin 810910 37 950 928 kkslce [KKSCHLPIN2] 000003A000000000 0 00 0 0 0
    1 225524178 30-APR-11 08.37.35.814288 AM Cursor Pin 321836 18 980 935 kkslce [KKSCHLPIN2] 000003A700000000 0 00 0 0 0
    1 225524178 30-APR-11 06.25.55.064435 PM Cursor Pin 862588 39 969 950 kkslce [KKSCHLPIN2] 000003B600000000 0 00 0 0 0
    1 225524178 30-APR-11 06.25.47.679395 PM Cursor Pin 869738 24 969 981 kkslce [KKSCHLPIN2] 000003D500000000 0 00 0 0 0
    1 225524178 30-APR-11 06.19.38.696734 PM Cursor Pin 863078 62 950 970 kkslce [KKSCHLPIN2] 00 0 00 0 0 0
    1 225524178 30-APR-11 06.25.53.295681 PM Cursor Pin 862882 39 981 969 kkslce [KKSCHLPIN2] 000003C900000000 0 00 0 0 0
    1 225524178 30-APR-11 04.54.53.425947 PM Cursor Pin 810954 40 980 928 kkslce [KKSCHLPIN2] 000003A000000000 0 00 0 0 0
    1 225524178 30-APR-11 06.25.35.785176 PM Cursor Pin 879221 35 981 950 kkslce [KKSCHLPIN2] 000003B600000001 0 00 0 0 0
    1 225524178 30-APR-11 06.25.02.448874 PM Cursor Pin 869070 29 969 981 kkslce [KKSCHLPIN2] 000003D500000000 0 00 0 0 0
    1 225524178 30-APR-11 05.25.45.707055 PM Cursor Pin 838490 24 970 980 kksl
    ..

  26. Tanel Poder says:

    @Richard

    How frequently (and by how many sessions) is this query executed?

    I would also check how many child cursors (versions) of that query you have… maybe you’re hitting a bug which causes a lot of child cursors to be created and that contributes to the contention.

    Otherwise check the “library cache mutex X” wait event note in metalink, there are a lot of bugs reported there…

  27. Chris French says:

    Very interesting article and insights. We’ve experienced random performance problems and application server hangs. Around these times, we’ve noticed the pin s wait on cursor X wait events topping out on our awr reports. We’ve also seen stats running at the same time too, but not sure if it’s coincidence or not. I don’t understand why the exclusive lock mutex holder would be kicked off the cpu though if it’s just parsing a simple query. I’m just not understanding why a mutex holder would get stuck on a parse. Of course, I don’t know why the parse is happening either, maybe due to the stats job invalidating sql. Seems like if I’m seeing this event on mass scale then something is seriously wrong, such as a deadlock, perhaps a bug. I mean we’re seeing these waits for like 20 minutes. Just wondering if this thread priority issue seems possible. We are on AIX and I did ps -efl and noticed the oracle processes have different priorities, so I think it’s a possibility. Thoughts? Thanks.

  28. Pingback: Mutex waits. Part 1. “Cursor: Pin S” in Oracle 10.2-11.1. Invisible and aggressive. « Latch, mutex and beyond

  29. Guilherme says:

    Hey Tanel,

    Would you consider a program run by the cron to set oracle processes’ priority as an work-around for this problem?

    I believe I am seeing it on an different OS.

    [user@host /home/user/]$ uname -a
    Linux jbscmp01 2.6.18-194.32.1.0.1.el5xen #1 SMP Tue Jan 4 16:56:20 EST 2011 x86_64 x86_64 x86_64 GNU/Linux

  30. Pingback: Mutex waits. Part II. “Cursor: Pin S” in Oracle 11.2 _mutex_wait_scheme=0. Steps out of shadow. « Latch, mutex and beyond

  31. Pingback: SCHED_NOAGE and latch contention « Oracle

  32. Joe Cornell says:

    I remember seeing a very similar problem in 9i that we solved (in HP-UX Risc) by using a kernel setting implemented via a parameter – hpux_sched_noage.
    After that episode, I have always insisted we set this parameter on all HP-UX databases.

  33. Richard Rankin says:

    Tanel, For an interesting tale of a priority inversion problem, do a search on the following:

    What really happened on Mars?

    Then follow the trail.

  34. Pingback: AWR reports: interpreting CPU usage « Oracle Diagnostician

  35. Pingback: The case of a stuck query (mutexes in 10g) « Oracle Diagnostician

  36. Pingback: AWR report: load profile « Oracle Diagnostician

  37. Madhu says:

    Hi—- i experienced the same issue as below… and our Siebel application is suffering with performance issues very much now a days .
    Active% | SQL_ID | EVENT | WAIT_CLASS
    ———————————————————————–
    2300% | 3xn2uw6m87unz | cursor: pin S wait on X | Concurrency

    • Tanel Poder says:

      Your issue may be different than the one I described, as my issue was “cursor: pin s”, your’s is “cursor: pin s wait on x”. Someone holds some frequently used cursor pin in X mode and others can’t even use that cursor. Could be many things, but usually I take a look into V$SGA_RESIZE_OPS then, to make sure that the SGA_TARGET / ASMM hasn’t shrunk the shared pool around the time the spike happened. Another “usual suspect” would be some library cache child cursor “leak” where new child cursors are constantly created under a parent – search for parent cursors with thousands of children under it from V$SQL. But it could be a number of other things / bugs.

      • raza says:

        Hi Tanel,

        Very nice and informative blog.Currently i am also facing such issues.Could you pls let me know where exactly i need to set this parameter hpux_sched_noage.
        Thanks

        • Tanel Poder says:

          Hi Raza, just search for the hpux_sched_noage in Oracle documentation, it’s all there. And read a few of the top hits when you search for the same in MOS too, that’s how I approach new (and unknown to me) parameter changes…

  38. Pingback: Logon Storm | louis liu 的镜像blog

  39. Pingback: VMCD.ORG » Blog Archive » Logon Storm

  40. Pingback: Is CPU usage 100% really okay? | Oracle Diagnostician

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>