Advanced Oracle Troubleshooting Guide – Part 12: control file reads causing enq: SQ – contention waits?

Vishal Desai systematically troubleshooted an interesting case where the initial symptoms of the problem showed a spike of enq: SQ – contention waits, but he dug deeper – and found the root cause to be quite different. He followed the blockers of waiting sessions manually to reach the root cause – and also used my @ash/ash_wait_chains.sql and @ash/event_hist.sql scripts to extract the same information more conveniently (note that he had modified the scripts to take AWR snap_ids as time range parameters instead of the usual date/timestamp):

Definitely worth a read if you’re into troubleshooting non-trivial performance problems :)

NB! If you want to move to the "New World" - offload your data and workloads to Hadoop, without having to re-write your existing applications - check out Gluent. We are making history! ;-)

This entry was posted in Oracle. Bookmark the permalink.

8 Responses to Advanced Oracle Troubleshooting Guide – Part 12: control file reads causing enq: SQ – contention waits?

  1. sudhir says:

    One moral of the story – old is still gold –

    Alert log errors should be taken seriously!

    ORA-19815: WARNING: db_recovery_file_dest_size of 858993459200 bytes is 100.00% used, and has 0 remaining bytes available.
    ARC1: Error 19809 Creating archive log file to ‘+RECO1’

    I am thankful to this blog for reminding me of that

    Regards
    Sudhir

    • Laimis N says:

      Agree 100%. I am continuously repeating this to myself and my fellow DBAs: first step is always check the alert log. Then check the other log and trace files, OS included.
      Here is yet another moral in this story: never ever hit resource limits, for God’s sake. Can you realistically expect developers worrying about database behavior at edge cases ? Bring some order in your house then database will run for years and years without major failures.

      • Tanel Poder says:

        Also, whether I check for a usual suspect also depends on how easy (fast) it is to check it. For example, if I do have Oracle database level access, but don’t have alert log access (either at OS level or via the V$/X$ and other hacks are not feasible) then I will “get systematic” right away (which on the other hand means asking for OS level metrics anyway :-)

        The reason why I don’t *always* look into alert.log is that often you have some immediately obvious issues popping up from wait event or ASH data (a long-running blocking session, user hasn’t committed and is drinking coffee… or an execution plan flip etc). The wait interface / ASH approach is more universal for narrowing down the direction where to look at.

  2. Olivier Bernhard says:

    Nice demonstration about how ash_wait_chains.sql is a really must have tool.
    But from a systematic performance troubleshooting approach, maybe focusing on the very first awr wait event “log file switch (archiving needed)” would have been a bit faster :).

  3. Tanel Poder says:

    Olivier, thanks for pointing this out, this deserves some clarification about what I think systematic (Oracle) troubleshooting is about.

    Yep in this case the “Top 5 Events” approach would have worked, although the top event was still CPU. With a little background knowledge (that it’s not normal to wait for log file switch for so long), one could focus their attention to the wait event instead of the actual top consumer – CPU. I call this checking the “usual suspects” thanks to background knowledge about Oracle and your particular system.

    However, there are other cases where a single problematic blocker can cause many others to wait for some other, higher level, wait event. For example slow “log file parallel write” that increases “log file sync” waits that increase commit durations of recursive data dictionary queries (like AUD$ audit record insert or SEQ$ update). This kind of stuff can show up as “enq: SQ – contention” or “buffer busy waits” at a higher level when many sessions are trying to do the same operations. In such case you might not see the “log file switch (archiving needed)” wait event nor anything in alert log at all.

    Yes, nowadays Oracle alert log (or LGWR trace) can dump out warnings when a log file parallel write takes more than X milliseconds, but again, this is just one of the many scenarios where a low level blocker indirectly causes many higher level waits to show up.

    So, a fully systematic approach would not rely on special cases where Oracle happens to have instrumented a single problem that may show up as a special wait event or a line in alert.log, but measuring every session’s response time independently (and aggregate if needed), break it down by wait events and whenever a session isn’t doing useful work because it’s blocked by some other session, then recursively drilling down to the blocker and its blocker until reaching the “leaf node”, the ultimate blocker who causes all this.

    Nevertheless, it’s ok to *start* from a quick check of any usual suspects, especially if they are immediately visible like in this particular “Top 5 Timed Events” report.

  4. Olivier Bernhard says:

    Hi Tanel,
    Fair enough :) Thanks a lot for this detailed clarification !

  5. Prasad Narvekar says:

    Hi Tanel,
    We also had strange issue in our production environment related to “control file sequential read” wait. We had query which uses temp table transformation. We made one of the temp file in one of the temp tablespace offline due to block corruption issue. The query session use to spend time on “control file sequential read” wait event. After lot of debugging, we realised that the query takes time only when its session uses temp tablespace which had offline tempfile. The session use to keep checking status of the temp file to see if it become online. We removed that temp tablespace from the temp tablespace group and that fixed the issue.

Leave a Reply

Your email address will not be published. Required fields are marked *