Debugger dangers

Whenever I deliver training or conference presentations on advanced troubleshooting topics, I usually spend some time demonstrating how to get and interpret Oracle server process stack traces.
As I’ve mentioned before, stack traces are the ultimate indicators showing where in Oracle kernel (or whatever application) code the execution currently is (or where it was when a crash occurred). This is the reason Oracle Support asks for stack traces whenever there’s a crash or non-trivial hang involved, that’s why Oracle database dumps errorstacks when ORA-600′s and other exceptions occur.

There are multiple ways for getting stack traces for Oracle, but not all ways are equal. Some give you more contextual info, some less, but what I’m blogging about today is that some ways are less safe than others.

I was using pstack on Linux for diagnosing an IO related performance issue. I executed a create table as select statement and ran pstack in a loop for getting stack traces from the running process.

However in one of the test runs I got following error in my Oracle session:

SQL> create table t as select * from dba_source;
create table t as select * from dba_source
                                *
ERROR at line 1:
ORA-01115: IO error reading block from file 1 (block # 11161)
ORA-01110: data file 1: '/u01/oradata/LIN10G/system01.dbf'
ORA-27091: unable to queue I/O
ORA-27072: File I/O error
Additional information: 3
Additional information: 11145
Additional information: 32768

I suspected that this issue was due Linux pstack, stopped the pstack script and ran my CTAS from the same Oracle session again:

SQL> create table t as select * from dba_source;

Table created.

The command now succeeded ok.

I tried to reproduce the failure again, but during the few minutes I spent testing, it didn’t occur again.

The failure happened likely due the fact that pstack on Linux is actually just a wrapper script around gdb. GDB in turn suspends the process under investigation and attaches to it through ptrace() syscall. And ptrace() syscall (and debuggers in general) have historically caused issues to host processes when communicating with kernel and other processes. For example they can block some signals or interrupts from being propagated back up to the “host” process.

Normally I have warned people about using debugger-based stack tracing due exactly those reasons, and now I managed to capture nice evidence.
I recommend to stay away with debugger from critical background processes on production systems, unless things have already collapsed anyway (that ought to be common sense anyway). So it’s good to know that Linux pstack is actually just a script with GDB backtrace command in it.

So, what are the safer and less safer stack tracing options:

  • oradebug dump errorstack – unsafe for production (as dump errorstack actually alters the process under investigation from its original codepath)
  • debugger based errorstack (gdb,dbx,mdb and Linux pstack) – can be unsafe for production due missed signals & interrupts if you get unlucky. Therefore you should stay away from at least the critical background processes with such tools
  • pstack on Solaris (and procstack on AIX) – safe as they don’t use the ptrace() interface but just read the info required from /proc filesystem
  • DTrace – safe by design

I haven’t checked how HP-UX pstack works, so can’t advise on that.

Note that this year’s only Advanced Oracle Troubleshooting class takes place in the end of April/May 2014, so sign up now if you plan to attend this year!

This entry was posted in Uncategorized and tagged . Bookmark the permalink.

5 Responses to Debugger dangers

  1. Pingback: Advanced Oracle Troubleshooting Guide, Part 6: Understanding Oracle execution plans with os_explain « Tanel Poder’s blog: Core IT for geeks and pros

  2. Ganesh says:

    Hello:

    I suspect that the File I/O error during your pstack trace could be a mere coincidence.

    I have seen this in the past at client locations. Though there are published bugs for this if running 10.2.0 3 or less w/ AIX , I have seen it in versions 10.2.0.3 with the O/s fully patched.

    The only explanation I can come up with is that the disk was too busy w/ many PX operations going on. (What? Why? How? .. I don’t have an answer).

    As this was very difficult to reproduce I left it at that.

    HTH,
    Ganesh

  3. tanelp says:

    Hi Ganesh,

    I don’t think this a coincidence. I’ve used this kit for numerous long-running stress tests and have never hit this issue before.

    If it’s a coincidence, then a pretty big one! Decent (and well configured) OSes do not just reject IOs unless there’s some fundamental problem.

    Anyway I’ll try reproduce this some time in future, will update if get any results.

  4. Pingback: Not waiting and not on SQL*Net Message from client « OraStory

  5. Pingback: Debugger Dangers – Part 2 | Tanel Poder's blog: IT & Mobile for Geeks and Pros

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>