Another use case for WaitProf – diagnosing “events in waitclass Other”

I recently diagnosed a performance issue where the “events in waitclass Other” occasionally took significant part of the session’s response time. For example Snapper (which reads wait event data from V$SESSION_EVENT) reported that during measuring 39.9% of the response time was spent on “events in waitclass Other”.

SQL> @sn 1 119

-- Session Snapper v1.07 by Tanel Poder ( http://www.tanelpoder.com )

---------------------------------------------------------------------------------------------------------------------------------------------
HEAD,     SID, SNAPSHOT START   ,  SECONDS, TYPE, STATISTIC                               ,         DELTA,  DELTA/SEC,     HDELTA, HDELTA/SEC
---------------------------------------------------------------------------------------------------------------------------------------------
DATA,     119, 20080621 05:22:05,        1, STAT, session logical reads                   ,         18284,      18284,     18.28k,     18.28k
DATA,     119, 20080621 05:22:05,        1, STAT, consistent gets                         ,         15301,      15301,      15.3k,      15.3k
DATA,     119, 20080621 05:22:05,        1, STAT, consistent gets from cache              ,         15228,      15228,     15.23k,     15.23k
DATA,     119, 20080621 05:22:05,        1, STAT, consistent gets from cache (fastpath)   ,         15136,      15136,     15.14k,     15.14k
DATA,     119, 20080621 05:22:05,        1, STAT, calls to get snapshot scn: kcmgss       ,            89,         89,         89,         89
DATA,     119, 20080621 05:22:05,        1, STAT, no work - consistent read gets          ,         14883,      14883,     14.88k,     14.88k
DATA,     119, 20080621 05:22:05,        1, STAT, table scans (short tables)              ,            21,         21,         21,         21
DATA,     119, 20080621 05:22:05,        1, STAT, table scan rows gotten                  ,       1429227,    1429227,      1.43M,      1.43M
DATA,     119, 20080621 05:22:05,        1, STAT, table scan blocks gotten                ,         17440,      17440,     17.44k,     17.44k
DATA,     119, 20080621 05:22:05,        1, WAIT, events in waitclass Other               ,        399831,     399831,   399.83ms,   399.83ms
--  End of snap 1

From Oracle 10g Oracle has consolidated lots of the events into “events in waitclass Other”. This is because saving all 900+ wait event stats for every session (in V$SESSION_EVENT array) would waste too much memory with giving little benefit (normally there’s only a handful of troublemaking events anyway). Therefore makes sense to aggregate the least likely happening events under some common category. Looks like Oracle kernel coders have set a threshold in event number above which all events are grouped under the “other” waitclass.

See below, this is from 11g:

[Read more...]

Killing an Oracle process from inside Oracle

I had a following situation few days ago – I was running a CREATE TABLE AS SELECT over a heterogenous services dblink. However I cancelled this command via pressing CTRL+C twice in Windows sqlplus (this actually just kills the client sqlplus and not the call).

Anyway, when I wanted to drop that table involved, this happened:

SQL> drop table MYTABLE;
drop table MYTABLE
           *
ERROR at line 1:
ORA-00054: resource busy and acquire with NOWAIT specified

I can’t drop a table as someone is holding a lock on it. Fair enough, this was a dev environment used only by me, so I used DBA_OBJECTS.OBJECT_ID to find out the object ID of that table:

SQL> @o MYTABLE

owner                     object_name                    object_type        CREATED           LAST_DDL_TIME     status           OID      D_OID
------------------------- ------------------------------ ------------------ ----------------- ----------------- --------- ---------- ----------
XYZ_DEV01_OWNER           MYTABLE                        TABLE              20080616 11:08:44 20080616 11:08:44 VALID          63764      63764

…and then I queried what enqueue locks were held on that object:

SQL> select * from v$lock where id1=63764;

ADDR     KADDR           SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK
-------- -------- ---------- -- ---------- ---------- ---------- ---------- ---------- ----------
40034278 40034290        130 TM      63764          0          6          0       2662          0

Ok, I see session 130 holding a TM lock on that table. I queried the corresponding SERIAL# from v$session as well and killed the session:

SQL> alter system kill session '130,8764';
alter system kill session '130,8764'
*
ERROR at line 1:
ORA-00031: session marked for kill

SQL> select * from v$lock where id1=63764;

ADDR     KADDR           SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK
-------- -------- ---------- -- ---------- ---------- ---------- ---------- ---------- ----------
40034278 40034290        130 TM      63764          0          6          0       2668          0

After hanging for 60 seconds, my kill command gave up (and marked my session for kill), but my lock was still not released… Now what?

[Read more...]

cursor_space_for_time To Be Deprecated

If you haven’t seen the Meatlink note 565424.1 in the news yet, cursor_space_for_time parameter will be deprecated in Oracle 10.2.0.5 and 11.1.0.7.

That’s kind of good news, I hope this will eventually reduce the number of expert DBAs who set this parameter to true whenever they see any kind of shared pool / library cache latch contention.

On the other hand, spin_count was made an undocumented parameter long time ago, but is still heavily abused worldwide so I wouldn’t be surprised if the same happens to future _cursor_space_for_time…

Advanced Oracle Troubleshooting Guide, Part 6: Understanding Oracle execution plans with os_explain

Get ready for some more adventures in Oracle process stack!

Before proceeding though, please read this post about safety of different stack sampling approaches.

I have had few non-trivial Oracle troubleshooting cases, related to query hangs and bad performance, where I’ve wanted to know where exactly in execution plan the current execution is.
Remember, Oracle is just another program executing instructions clustered in functions on your server, so stack sampling can help out here as well.

So, I was looking into the following stack trace taken from an Oracle 10.1 database on Solaris SPARC, running a SQL with this execution plan.

[Read more...]

Debugger dangers

Whenever I deliver training or conference presentations on advanced troubleshooting topics, I usually spend some time demonstrating how to get and interpret Oracle server process stack traces.
As I’ve mentioned before, stack traces are the ultimate indicators showing where in Oracle kernel (or whatever application) code the execution currently is (or where it was when a crash occurred). This is the reason Oracle Support asks for stack traces whenever there’s a crash or non-trivial hang involved, that’s why Oracle database dumps errorstacks when ORA-600′s and other exceptions occur.

There are multiple ways for getting stack traces for Oracle, but not all ways are equal. Some give you more contextual info, some less, but what I’m blogging about today is that some ways are less safe than others.

I was using pstack on Linux for diagnosing an IO related performance issue. I executed a create table as select statement and ran pstack in a loop for getting stack traces from the running process.

However in one of the test runs I got following error in my Oracle session:

SQL> create table t as select * from dba_source;
create table t as select * from dba_source
                                *
ERROR at line 1:
ORA-01115: IO error reading block from file 1 (block # 11161)
ORA-01110: data file 1: '/u01/oradata/LIN10G/system01.dbf'
ORA-27091: unable to queue I/O
ORA-27072: File I/O error
Additional information: 3
Additional information: 11145
Additional information: 32768

I suspected that this issue was due Linux pstack, stopped the pstack script and ran my CTAS from the same Oracle session again:

[Read more...]

Short note on KGX Mutexes

I received a question on what’s the point of the use of Mutexes for Oracle cursors in library cache. For short intro, I’m pasting one of my fairly recent answers in Oracle forums about Oracle mutexes here:

In Oracle, latches and mutexes are different things and managed using different modules. KSL* modules for latches and KGX* for mutexes.

General mutex operatins require less CPU instructions than latch operations (as they aren’t as sophisticated as latches and don’t maintain get/miss counts as latches do).

But the main scalability benefit comes from that there’s a mutex structure in each child cursor handle and the mutex itself acts as cursor pin structure. So if you have a cursor open (or cached in session cursor cache) you don’t need to get the library cache latch (which was previously needed for changing cursor pin status), but you can modify the cursor’s mutex refcount directly (with help of pointers in open cursor state area in sessions UGA).

Therefore you have much higher scalability when pinning/unpinning cursors (no library cache/library cache pin latching needed, virtually no false contention) and no separate pin structures need to be allocated/maintained.

Few notes:

  1. library cache latching is still needed for parsing etc, the mutexes address only the pinning issue in library cache
  2. mutexes are currently used for library cache cursors (not other objects like PL/SQL stored procs, table defs etc)
  3. As mutexes are a generic mechanism (not library cache specific) they’re used in V$SQLSTATS underlying structures too
  4. When mutexes are enabled, you won’t see cursor pins from X$KGLPN anymore (as X$KGLPN is a fixed table based on the KGL pin array – which wouldn’t be used for cursors anymore)

Generating lots of rows using connect by – safely!

Every now and then I need to generate a large number of rows into some table.

Not that I have any personal urge to do so, but you know, this is needed for some test cases and so on ;-)

It’s quite well known that it’s possible to generate lots of rows using CONNECT BY LEVEL <= #rows syntax.

However as the CONNECT BY LEVEL syntax is recursive by nature, it will consume more memory the more rows you query in one shot ( I think it was Mark Bobak who once posted this observation to Oracle-L, but I didn’t manage to find that thread anymore ).

So, here’s a test case:

SQL> select count(r)
  2  from (
  3     select rownum r from dual connect by rownum <= 100000000
  4  )
  5  /
        select rownum r from dual connect by rownum <= 100000000
                             *
ERROR at line 3:
ORA-04030: out of process memory when trying to allocate 44 bytes (kxs-heap-w,cursor work heap)

After running for a while the server process run out of private memory, used for the CONNECT BY cursor work heap.

Let’s investigate:

[Read more...]

Advanced Oracle Troubleshooting Guide, Part 5: Sampling V$ stuff with WaitProf. Really fast. Using SQL!

I bet you thought I’ll be writing about direct SGA access?! ;)

Nope!

Direct SGA access has excellent troubleshooting potential (as long as you know the shared memory data structures), but it has one major drawback – very few companies have such tools already in place in their production systems.

I have occasionally been called in to solve an urgent performance problem, happening right now and it needs solving immediately! And did I mention, these are critical production systems. Where you can’t just install binary executables freshly downloaded off internet. In fact you would want to diagnose the issue with minimal impact and changes required to those production environments (and that leaves sql tracing out the first round troubleshooting tools for me as well!)

So, I’ve developed myself a toolset for such purpose, Snapper and sw.sql and some process stack reading techniques I already have introduced in my blog.

Next in line is waitprof.sql which is a high-frequency V$SESSION_WAIT sampler – implemented in plain SQL (not PL/SQL).

Waitprof is basically a sampling session wait profiler. It’s like running a select against V$SESSION_WAIT in a very tight loop and aggregating results – but I have used a trick to do all this in plain SQL, which gives me performance advantage over PL/SQL based loops. Waitprof is able to sample V$SESSION_WAIT for a session up to 100 000 times per second!

This depends on your hardware of course and Oracle version too, but normally you’ll get 50-70kHz sampling rate with it.

Ok, you want to see an example? ;-)

[Read more...]

Performance Tools Quick Reference Guide

There’s a nice Metalink Note 438452.1 about various less known Oracle performance tuning utilities.

If you haven’t heard about things like StackX, LTOM, HangFG, SQLTXPLAIN, OS_Watcher or OPDG then it’s time to check this note out! :)

Advanced Oracle Troubleshooting Guide, Part 4: Diagnosing a long parsing issue

There was a recent thread in Oracle Forums about a session getting stuck somewhere when a specific SQL was issued. The SQL executed did not return at all unless ORDERED hint was used. Even the EXPLAIN PLAN command (which only parses the statement, doesn’t execute it) did never return.

Classic tracing + tkprof techniques didn’t show much (just some recursive queries consuming insignificant amounts of time).

The proven V$SESSION_WAIT sampling technique didn’t reveal anything as it showed the session being constantly on CPU (the wait state = ‘WAITED KNOWN TIME’ which means session is on CPU) and SEQ# didn’t increase (which means that wait state did not change over time).

Due the symptoms described above I was well prepared to troubleshoot this issue. This looks exactly like one of the troubleshooting use cases I demonstrate in of my Advanced Oracle Troubleshooting class (nice embedded advertisment, huh? ;)

In such a case where tracing and V$ views don’t provide any useful information about what the session is doing, I normally look into few stack traces of the server process. In this case I asked the poster to do this and here is the result:

[Read more...]