Search Results for: snapper

Advanced Oracle Troubleshooting Guide, Part 8: Even more detailed latch troubleshooting using LatchProfX

In my last AOT post I published my LatchProf script which is able to sample detailed latchholder data from V$LATCHHOLDER.

Latchprof allows you to drill down into your latching problems at session level (which V$LATCH, V$LATCH_PARENT and V$LATCH_CHILDREN can’t do). It allows you to get valuable details about individual sessions who are holding a latch the most, therefore likely contributing to the latch contention problem the most.

However after you have discovered the troublemaking session, then what next? One way forward is looking into V$SESSTAT counters using Snapper tool. Depending on what latch is the problematic one, you would look for different stats like various buffer get stats for cache buffers chains latches and parsing/executing stats when looking into library cache latches. However if those stats look “normal”, is there any other way do drill down further?

Yeah, there is and lets look into it!

[Read more...]

Advanced Oracle Troubleshooting Guide, Part 7: Sampling latch holder statistics using LatchProf

I have been too busy since getting back from vacation, thus no posts for a while. But I hope the waiting was worthwhile as I present you LatchProf, a tool for digging in to latch contention problems – using plain SQL and sqlplus!

As, I’m still busy, I make it short.

LatchProf is a script similar to WaitProf, only it samples latch holder statistics from V$LATCHHOLDER. As V$LATCHHOLDER contains a SID column (with session ID of a latch holder) it becomes possible to find who is hitting a latch the most (a way to prove that crappy monitoring tools which constantly scan through V$SQL DO cause library cache latch contention themselves).

[Read more...]

Another use case for WaitProf – diagnosing “events in waitclass Other”

I recently diagnosed a performance issue where the “events in waitclass Other” occasionally took significant part of the session’s response time. For example Snapper (which reads wait event data from V$SESSION_EVENT) reported that during measuring 39.9% of the response time was spent on “events in waitclass Other”.

SQL> @sn 1 119

-- Session Snapper v1.07 by Tanel Poder ( http://www.tanelpoder.com )

---------------------------------------------------------------------------------------------------------------------------------------------
HEAD,     SID, SNAPSHOT START   ,  SECONDS, TYPE, STATISTIC                               ,         DELTA,  DELTA/SEC,     HDELTA, HDELTA/SEC
---------------------------------------------------------------------------------------------------------------------------------------------
DATA,     119, 20080621 05:22:05,        1, STAT, session logical reads                   ,         18284,      18284,     18.28k,     18.28k
DATA,     119, 20080621 05:22:05,        1, STAT, consistent gets                         ,         15301,      15301,      15.3k,      15.3k
DATA,     119, 20080621 05:22:05,        1, STAT, consistent gets from cache              ,         15228,      15228,     15.23k,     15.23k
DATA,     119, 20080621 05:22:05,        1, STAT, consistent gets from cache (fastpath)   ,         15136,      15136,     15.14k,     15.14k
DATA,     119, 20080621 05:22:05,        1, STAT, calls to get snapshot scn: kcmgss       ,            89,         89,         89,         89
DATA,     119, 20080621 05:22:05,        1, STAT, no work - consistent read gets          ,         14883,      14883,     14.88k,     14.88k
DATA,     119, 20080621 05:22:05,        1, STAT, table scans (short tables)              ,            21,         21,         21,         21
DATA,     119, 20080621 05:22:05,        1, STAT, table scan rows gotten                  ,       1429227,    1429227,      1.43M,      1.43M
DATA,     119, 20080621 05:22:05,        1, STAT, table scan blocks gotten                ,         17440,      17440,     17.44k,     17.44k
DATA,     119, 20080621 05:22:05,        1, WAIT, events in waitclass Other               ,        399831,     399831,   399.83ms,   399.83ms
--  End of snap 1

From Oracle 10g Oracle has consolidated lots of the events into “events in waitclass Other”. This is because saving all 900+ wait event stats for every session (in V$SESSION_EVENT array) would waste too much memory with giving little benefit (normally there’s only a handful of troublemaking events anyway). Therefore makes sense to aggregate the least likely happening events under some common category. Looks like Oracle kernel coders have set a threshold in event number above which all events are grouped under the “other” waitclass.

See below, this is from 11g:

[Read more...]

Generating lots of rows using connect by – safely!

Every now and then I need to generate a large number of rows into some table.

Not that I have any personal urge to do so, but you know, this is needed for some test cases and so on ;-)

It’s quite well known that it’s possible to generate lots of rows using CONNECT BY LEVEL <= #rows syntax.

However as the CONNECT BY LEVEL syntax is recursive by nature, it will consume more memory the more rows you query in one shot ( I think it was Mark Bobak who once posted this observation to Oracle-L, but I didn’t manage to find that thread anymore ).

So, here’s a test case:

SQL> select count(r)
  2  from (
  3     select rownum r from dual connect by rownum <= 100000000
  4  )
  5  /
        select rownum r from dual connect by rownum <= 100000000
                             *
ERROR at line 3:
ORA-04030: out of process memory when trying to allocate 44 bytes (kxs-heap-w,cursor work heap)

After running for a while the server process run out of private memory, used for the CONNECT BY cursor work heap.

Let’s investigate:

[Read more...]

Advanced Oracle Troubleshooting Guide, Part 5: Sampling V$ stuff with WaitProf. Really fast. Using SQL!

I bet you thought I’ll be writing about direct SGA access?! ;)

Nope!

Direct SGA access has excellent troubleshooting potential (as long as you know the shared memory data structures), but it has one major drawback – very few companies have such tools already in place in their production systems.

I have occasionally been called in to solve an urgent performance problem, happening right now and it needs solving immediately! And did I mention, these are critical production systems. Where you can’t just install binary executables freshly downloaded off internet. In fact you would want to diagnose the issue with minimal impact and changes required to those production environments (and that leaves sql tracing out the first round troubleshooting tools for me as well!)

So, I’ve developed myself a toolset for such purpose, Snapper and sw.sql and some process stack reading techniques I already have introduced in my blog.

Next in line is waitprof.sql which is a high-frequency V$SESSION_WAIT sampler – implemented in plain SQL (not PL/SQL).

Waitprof is basically a sampling session wait profiler. It’s like running a select against V$SESSION_WAIT in a very tight loop and aggregating results – but I have used a trick to do all this in plain SQL, which gives me performance advantage over PL/SQL based loops. Waitprof is able to sample V$SESSION_WAIT for a session up to 100 000 times per second!

This depends on your hardware of course and Oracle version too, but normally you’ll get 50-70kHz sampling rate with it.

Ok, you want to see an example? ;-)

[Read more...]

Advanced Oracle Troubleshooting Guide, Part 4: Diagnosing a long parsing issue

There was a recent thread in Oracle Forums about a session getting stuck somewhere when a specific SQL was issued. The SQL executed did not return at all unless ORDERED hint was used. Even the EXPLAIN PLAN command (which only parses the statement, doesn’t execute it) did never return.

Classic tracing + tkprof techniques didn’t show much (just some recursive queries consuming insignificant amounts of time).

The proven V$SESSION_WAIT sampling technique didn’t reveal anything as it showed the session being constantly on CPU (the wait state = ‘WAITED KNOWN TIME’ which means session is on CPU) and SEQ# didn’t increase (which means that wait state did not change over time).

Due the symptoms described above I was well prepared to troubleshoot this issue. This looks exactly like one of the troubleshooting use cases I demonstrate in of my Advanced Oracle Troubleshooting class (nice embedded advertisment, huh? ;)

In such a case where tracing and V$ views don’t provide any useful information about what the session is doing, I normally look into few stack traces of the server process. In this case I asked the poster to do this and here is the result:

[Read more...]

Snapper shortcut

I have a (very) small script called sn.sql which I use as a wrapper around snapper (maybe I should’ve called it Snapper Wrapper but it’s too long name for the purpose :)

The idea is to have to type less when running Snapper with default options (take 1 snapshot, output to screen and display Session tats,Wait events and Time model stats).

Whenever there’s a performance issue with a session I first quickly run @sn <seconds> <sid>, for example:

SQL> @sn 10 106

This takes a 10 second snapshot of session 106 and displays output:

-- Session Snapper v1.07 by Tanel Poder ( http://www.tanelpoder.com )

---------------------------------------------------------------------------------------------------------------------------------------------
HEAD,     SID, SNAPSHOT START   ,  SECONDS, TYPE, STATISTIC                               ,         DELTA,  DELTA/SEC,     HDELTA, HDELTA/SEC
---------------------------------------------------------------------------------------------------------------------------------------------
DATA,     106, 20080603 06:31:17,       10, TIME, hard parse elapsed time                 ,      10124018,    1012402,     10.12s,      1.01s
DATA,     106, 20080603 06:31:17,       10, TIME, parse time elapsed                      ,      10124018,    1012402,     10.12s,      1.01s
DATA,     106, 20080603 06:31:17,       10, TIME, DB CPU                                  ,      10113462,    1011346,     10.11s,      1.01s
DATA,     106, 20080603 06:31:17,       10, TIME, DB time                                 ,      10124018,    1012402,     10.12s,      1.01s
DATA,     106, 20080603 06:31:17,       10, WAIT, events in waitclass Other               ,            51,          5,       51us,      5.1us
--  End of snap 1

PL/SQL procedure successfully completed.

This might help you to save few seconds of typing every time you troubleshoot :)

 

Oracle Troubleshooting with Snapper – detecting who’s causing excessive redo generation

My friend asked today a question that how to identify why his Oracle 9.2 database has suddenly started generating loads more redo than usual.

So obviously I recommended him Snapper as first thing, it’s perfect for ad-hoc analysis like that! ( I know I sound biased but if you haven’t used Snapper yet, then now is the time! :)

So, I asked him to run Snapper on all sessions of the instance with 10 second interval and find the session with highest “redo size” delta figure.
However as a normal production database has loads of sessions in it and as each session can return tens of statistic rows, then you might need to do some post-processing of output data to get only the “redo size” numbers.

So I figured that why not do the filtering in Snapper itself, so I made the small additions and released Snapper v1.07.

Usage is simple:

SQL> @snapper out,gather=s,sinclude=%redo_size% 10 1 "select sid from v$session"

-- Session Snapper v1.07 by Tanel Poder ( http://www.tanelpoder.com )

---------------------------------------------------------------------------------------------------------------------------------------------
HEAD,     SID, SNAPSHOT START   ,  SECONDS, TYPE, STATISTIC                               ,         DELTA,  DELTA/SEC,     HDELTA, HDELTA/SEC
---------------------------------------------------------------------------------------------------------------------------------------------
DATA,     129, 20080530 09:07:20,       10, STAT, redo size                               ,     180595100,   18059510,     180.6M,     18.06M
DATA,     166, 20080530 09:07:20,       10, STAT, redo size                               ,        790912,      79091,    790.91k,     79.09k
DATA,     167, 20080530 09:07:20,       10, STAT, redo size                               ,        889796,      88980,     889.8k,     88.98k
--  End of snap 1

PL/SQL procedure successfully completed.

And in the result you already see that there is a single session which generates significantly more redo than others (btw this is an artificially generated example). In my friend’s case there also happened to be a single session , scheduled through DBMS_JOB, contributing to most of the redo generation. Once the troublemaker was identified, the problem could be fixed.

There are few things to note:

[Read more...]

Using autonomous transactions for sleeping

There was a question in a recent Oracle-L thread about various uses of autonomous transactions.

Autonomous transactions can be very useful for a PL/SQL application logging, but sometimes they are also abused to cope with bad application design (like avoiding mutating table errors in triggers etc).

I’m not going to start on that topic here though, but instead presenting another case where autonomous transactions have helped me to work around a problem. It’s more a hack than a real solution though, but may be useful for someone else too.

My Snapper tool requires execute rights on DBMS_LOCK in order to sleep between snapshots. Sometimes when troubleshooting an urgent performance issue, I have had access only to some kind of application support account, without permissions to execute DBMS_LOCK.SLEEP. And sometimes it takes too long to get those rights granted by corporate DBAs.

So one workaround I’ve used so far is creating a fake DBMS_LOCK.SLEEP proc in the local support schema along with one dummy table and use a combination of SELECT FOR UPDATE and autonomous transactions to sleep for short time.

The code is very simple:

[Read more...]

SQL*Net message to client vs SQL*Net more data to client

After my last post about SQL*Net message to client wait event I had a follow-up question about what’s the difference between SQL*Net message to client and SQL*Net more data to client wait events. I’ll post the answer here:

The first session data unit (SDU) bufferful of return data is written to TCP socket buffer under “SQL*Net message to client” wait event.

If Oracle needs to write more data for a call than fits into the “first” SDU buffer, then further writes for that call are done under SQL*Net more data to client event.

So, whether and how much of the “SQL*Net more data to client” vs “SQL*Net message to client” waits you see depends on two things:

  • Amount of data returned to client per call
  • Oracle Net SDU size

A quick test with my Snapper performance tool illustrates this:

In one session run following:

SQL> set arraysize 1
SQL>
SQL> select rownum from dba_source;

    ROWNUM
----------
         1
         2
         3
         4
         5
         6
...many rows returned...

And I measure the waits in the other session:

SQL> @snapper out,gather=w 5 1 141

-- Session Snapper v1.06 by Tanel Poder ( http://www.tanelpoder.com )

---------------------------------------------------------------------------------------------------------------------------------------------
HEAD,     SID, SNAPSHOT START   ,  SECONDS, TYPE, STATISTIC                               ,         DELTA,      D/SEC,     HDELTA,     HD/SEC
---------------------------------------------------------------------------------------------------------------------------------------------
DATA,     141, 20080210 14:59:55,        5, WAIT, SQL*Net message to client               ,          6750,       1350,     6.75ms,     1.35ms
DATA,     141, 20080210 14:59:55,        5, WAIT, SQL*Net message from client             ,       4668258,     933652,      4.67s,   933.65ms
--  End of snap 1

No SQL*Net more data to client is evidenced above as due the arraysize = 1 setting Oracle fetches only 2 rows at a time. Therefore the returned resultset per fetch call always fits into the first SDU sized packet. Note that the reason why Oracle fetches 2 rows when arraysize is 1 is due an OCI optimization which tries to be clever and proactively detect end-of-resultset conditions when fetching a single row – that way there’s a chance to automatically cancel a cursor and release its pins without an explicit cursor cancel or close call. This should have positive effect reducing shared pool fragmentation in some conditions.

Anyway, now I set arraysize to 5000 and run the same SQL again:

[Read more...]