Advanced Oracle Troubleshooting Guide, Part 3: More adventures in process stack

…or rather thread stack as nowadays decent operating systems execute threads (or tasks as they’re called in Linux kernel).

Anyway, stack trace gives you the ultimate truth on what your program is doing, exactly right now. There are couple of but’s like stack corruptions and missing symbol information which may make the traces less useful for us, but for detailed hang & performance troubleshooting the stack traces are a goldmine.

So, I present another case study – how to diagnose a complete database hang when you can’t even log on to the database.

[Read more...]

Operating systems are lazy allocating memory

There was a discussion about whether Oracle really allocates all memory for SGA immediately on instance startup or not. And further, whether Oracle allocates memory beyond the SGA_TARET if SGA_MAX_SIZE is larger than it.
It’s worth reading this thread first: http://forums.oracle.com/forums/thread.jspa?threadID=535400&tstart=0

I will paste an edited version of my reply to here as well:

[Read more...]

Advanced Oracle Troubleshooting Guide, Part 2: No magic is needed, systematic approach will do

There are two ways for diagnosing problems:

  1. Checking for usual suspects and hoping to find a matching one
  2. Following a systematic approach

Checking for usual suspects and hoping to find a matching one

The first approach relies on previous experience (both in particular subject area/technology and about the context/environment the problem occurs). For example if a patient comes to doctor complaining about pain in chest, then for doctor (and also for the patient) it would definitely be beneficial to know more relevant info about the patient – the context. If the patient had just fell off a 10-foot ladder, then it’d be more suitable to look for broken ribs. On the other hand, if the patient has been a long-time smoker and was watching TV on a couch when the pain started, then perhaps it’d be more suitable to start with an EKG (note that I’m not an expert on how human body works so should anybody complain about any pain in their chest to you, send them to real doctor immediately!)

Anyway, if you’ve been administering a database full time for last 5 years, you will probably know where to look immediately when a specific problem occurs. Note that I wrote “where to look immediately” here, not “what to change immediately”. Using previous experience to identify root causes of problems is obviously a perfectly valid approach which may get you to the solution very fast (again and again) – but it can work well only if you do have lots of previous experience in solving problems for that technology and that particular environment (in other words, you know the context). The big risk here is that if a “new” problem expresses itself in similar symptoms like the “old-and-well-known-problem”, we could easily end up looking for and fixing the wrong issue. And if that doesn’t work, continuing to try out a solution which helped with another problem last year. And if that doesn’t help then the cycle continues, we dig up even more unlikely fixes which have been useful once in past and we apply them. And then we resort to googling and trying out whatever solutions anyone has suggested for problems others have experienced.

We have ended up in Desperate Switch Flipping state. Usually this leads to flipping even larger switches, starting from Oracle session/instance parameters and adding random SQL hints up to restarting servers, upgrading databases, operating systems, hardware – usually without any luck. This wastes time, doesn’t solve our problem and may cause even more trouble. This is not good. Checking out the usual suspects may help solving common recurring problems (hey, shouldn’t good specialists avoid recurring problems rather than fixing them again and again?) . However it is very important to draw the line between checking for usual suspects and falling into DSF state, as looks like happened to someone at Oracle-L.

Quoting:

“I ran Statspack reports at the highest level of detail until I was blue in the face. I ran traces. I set events. But I also am by nature intuitive and tend often to use intuition to solve a problem with facts to back up my intuitive conclusion. So after providing all of this stuff to Oracle Support, they were at a loss, well, they were very eager to look at corruption as a cause, because they didn’t have another solution.”

(Btw, I think the DSF state would be a suitable addition to pathological DBA problems list which Gaja once started with CTD)

The other way to diagnose problems is to follow a systematic approach, an appropriate methodology based on knowledge how computers work.

[Read more...]

Advanced Oracle Troubleshooting Guide: When the wait interface is not enough [part 1]

Welcome to read my first real post on this blog!

If I ever manage to post any more entries, the type and style of content will be pretty much as this one: some Oracle problem diagnosis and troubleshooting techniques with some OS and hardware touch in it. Mmm… internals ;-)

Nevertheless I am also a fan of systematic approaches and methods so I plan to propose some less known OS and Oracle techniques for reducing guesswork in advanced troubleshooting even further.

Ok, to the topic.
Troubleshooting. Troubleshooting = finding out what is going on.

This post covers one unexplained issue I once had with Oracle external tables – which eventually turned out to be a problem with Oracle wait interface instrumentation. I used some of these “what’s going on” techniques to find out… what’s going on. Solaris 10 x64 / Oracle 10.2.0.2.


I worked on a project for which I needed to read data through an external table from an Unix pipe ( ever wanted to load compressed flat file contents to Oracle on-the-fly? ;-)

I created a Unix pipe:

$ mknod /tmp/tmp_pipe p

I created an Oracle external table, reading from that pipe:


Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.2.0 - Production
With the Partitioning, OLAP and Data Mining options


USERNAME     INSTANCE_NAME    HOST_NAME                 VER        STARTED      SID SERIAL# SPID
------------ ---------------- ------------------------- ---------- -------- ------- ------- -------
TANEL        SOL01            solaris01                 10.2.0.2.0 20070618     470      14 724

Tanel@Sol01> CREATE DIRECTORY dir AS '/tmp';

Directory created.

Tanel@Sol01> CREATE TABLE ext (
  2     value number
  3  )
  4  ORGANIZATION EXTERNAL (
  5    TYPE oracle_loader
  6    DEFAULT DIRECTORY dir
  7      ACCESS PARAMETERS (
  8      FIELDS TERMINATED BY ';'
  9      MISSING FIELD VALUES ARE NULL
 10      (value)
 11      )
 12      LOCATION ('tmp_pipe')
 13    )
 14  ;

Table created.

Tanel@Sol01> select * from ext;

So far so good… unfortunately this select statement never returned any results. As it turned out later, the gunzip over remote ssh link which should have fed the Unix pipe with flat file data, had got stuck.
Without realizing that, I approached this potential session hang condition with first obvious check – a select from V$SESSION_WAIT:

Tanel@Sol01> select sid, event, state, seq#, seconds_in_wait, p1,p2,p3
  2  from v$session_wait
  3  where sid = 470;

    SID EVENT                          STATE                     SEQ# SECONDS_IN_WAIT         P1         P2         P3
------- ------------------------------ ------------------- ---------- --------------- ---------- ---------- ----------
    470 db file sequential read        WAITED KNOWN TIME          164            7338          1       1892          1

Tanel@Sol01> /

    SID EVENT                          STATE                     SEQ# SECONDS_IN_WAIT         P1         P2         P3
------- ------------------------------ ------------------- ---------- --------------- ---------- ---------- ----------
    470 db file sequential read        WAITED KNOWN TIME          164            7353          1       1892          1

Tanel@Sol01> /

    SID EVENT                          STATE                     SEQ# SECONDS_IN_WAIT         P1         P2         P3
------- ------------------------------ ------------------- ---------- --------------- ---------- ---------- ----------
    470 db file sequential read        WAITED KNOWN TIME          164            7374          1       1892          1

Tanel@Sol01>

The STATE and SECONDS_IN_WAIT columns in V$SESSION_WAIT say we have been crunching the CPU for last two hours, right? (as WAITED… means NOT waiting on any event, in this case the EVENT just shows the last event on which we waited before getting on CPU)

Hmm.. let’s check it out:

$ prstat -p 724
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
   724 oracle    621M  533M sleep   59    0   0:00:00 0.0% oracle/1

prstat reports that this process is currently in sleep state, is not using CPU and has used virtually no CPU during its 2-hour “run” time!

Let’s check with ps (which is actually a quite powerful tool):

$ ps -o user,pid,s,pcpu,time,etime,wchan,comm -p 724
    USER   PID S %CPU        TIME     ELAPSED            WCHAN COMMAND
  oracle   724 S  0.0       00:01    02:18:08 ffffffff8135cadc oracleSOL01

ps also confirms that the process 724 has existed for over 2 hours 18 minutes (ELAPSED), but has only used roughly 1 second of CPU time (TIME). The state column “S” also indicates the sleeping status.

So, either Oracle V$SESSION_WAIT or standard Unix tools are lying to us. From above evidence it is pretty clear that it’s Oracle who’s lying (also, in cases like that, lower-level instrumentation always has a better chance to know what’s really going on at the upper level than vice versa).

So, let’s use truss (or strace on Linux, tusc on HP-UX) to see if our code is making any system calls or is sleeping within a system call…

$ truss -p 724
read(14, 0xFFFFFD7FFD6FDE0F, 524273) (sleeping...)

Hmm, as no followup is printed to this line, it looks like the process is waiting for a read operation on a file descriptor 14 to complete.
Which file is this fd 14 about?

$ pfiles 724
724:    oracleSOL01 (LOCAL=NO)

...snip...

  14: S_IFIFO mode:0644 dev:274,2 ino:4036320452 uid:100 gid:300 size:0
      O_RDONLY|O_LARGEFILE
      /tmp/tmp_pipe
...snip...

So from here it’s already pretty obvious where the problem is. There is no data coming from the tmp_pipe. This led me to check what was my gunzip doing on the other end of the pipe and it was stuck, in turn waiting for ssh to feed more data into it. And ssh had got stuck due some network transport issue.

The baseline is that you can rely on low-level (OS) tools to identify what’s really going on when higher level tools (like Oracle wait interface) provide weird or contradicting information, in this case the Oracle wait interface was not recording external table read wait events. I reported this info to Oracle people and I think it has been filed as a bug by now.


This was only a simple demo, identifying a pretty clear case of a session hang, however with use of a pretty intrusive tool ( I would not attach truss to a busy production instance process without thinking twice ).

However there are other options. In the next part of this guide ( when I manage to write it ) I will deal with more complex problems like what to do when the session is not reporting significant waits and is spinning heavily on CPU. Using Oracle and Unix tools it is quite easy to figure out the execution profile of a spinning server process, even without connecting to Oracle at all ( do I hear pstack, mdb and stack tracing? ;-)

As I’ve just started blogging, I would appreciate any feedback, including about things like blog layout, font sizes, readability, understandability etc. Also I think it will take few days before I manage to post the Part 2 of this troubleshooting guide.

Thank you for your patience reading through this :-)