cursor: pin S waits, sporadic CPU spikes and systematic troubleshooting

Tanel Poder

2010/04/21

I recently consulted one big telecom and helped to solve their sporadic performance problem which had troubled them for some months. It was an interesting case as it happened in the Oracle / OS touchpoint and it was a product of multiple “root causes”, not just one, an early Oracle mutex design bug and a Unix scheduling issue – that’s why it had been hard to resolve earlier despite multiple SRs opened etc.

Martin Meyer, their lead DBA, posted some info about the problem and technical details, so before going on, you should read his blog entry and read my comments below after this:

Problem:

So, the problem was, that occasionally the critical application transactions which should have taken very short time in the database (<1s), took 10-15 seconds or even longer and timed out.

Symptoms:

  1. When the problem happened, the CPU usage also jumped up to 100% for the problem duration (from few tens of seconds up to few minutes).
  2. In AWR snapshots (taken every 20 minutes), the “cursor: pin S” popped into top TOP5 waits (around 5-10% of total instance wait time) and sometimes also “cursor: pin S wait on X” which is a different thing, also “latch: library cache” and interestingly “log file sync”. These waits had then much higher average wait times per wait occurrence than normal (tens or hundreds of milliseconds per wait, on average).
  3. The V$EVENT_HISTOGRAM view showed lots of cursor: pin S waits taking very long time (over a second, some even 30+ seconds) and this certainly isn’t normal (Martin has these numbers in his blog entry)

AWR and OS CPU usage measurement tools are system-wide tools (as opposed to session-wide tools).

Troubleshooting:

I can’t give you exact numbers or AWR data here, but will explain the flow of troubleshooting and reasoning.

This was not a trivial problem, as it happened in Oracle / OS touchpoint and happened not because a single reason, but as a product of multiple separate reasons, amplifying each other.

There are few interesting, non-technical points here:

  1. When troubleshooting, don’t let performance tools like AWR (or any other tool!) tell you what your problem is! Your business, your users should tell you what the problem is and the tools should only be used for symptom drilldown (This is what Cary Millsap has been constantly telling us). Note how I mentioned the problem and symptoms separately in the beginning of my post – and the problem was that some business transactions (systemwide) timed out because the database response time was 5-15 seconds!
  2. The detail and scope of your performance data must have at least as good detail and scope of your performance problem!

    In other words, if your problem is measured in few seconds, then your performance data should also be sampled at least every few seconds in order to be fully systematic.The classic issue in this case was that the 20 minute AWR reports still showed IO wait times as main DB time consumers, but that was averaged over 20 minutes. But our problem happened severely and shortly within few seconds in that 20 minutes, so the averaging and aggregation over long period of time did hide the extreme performance issue that happened in a very short time.

Next time when it seems to be impossible to diagnose a problem and if the troubleshooting effort ends up going in circles, then you should ask, “what’s the real problem and who and how is experiencing it” and see if your performance data’s detail and scope matches that problem!

Oh, this is a good point to mention that in addition to my Advanced Oracle Troubleshooting/SQL Tuning seminars I also actually perform advanced Oracle troubleshooting consulting too! I eat mutexes for breakfast ;-)

Update: Srivenu Kadiyala has experienced the same problem and has written about it here.