Advanced Oracle Troubleshooting Guide, Part 5: Sampling V$ stuff with WaitProf. Really fast. Using SQL!

Tanel Poder

2008/06/05

I bet you thought I’ll be writing about direct SGA access?! ;)

Nope!

Direct SGA access has excellent troubleshooting potential (as long as you know the shared memory data structures), but it has one major drawback – very few companies have such tools already in place in their production systems.

I have occasionally been called in to solve an urgent performance problem, happening right now and it needs solving immediately! And did I mention, these are critical production systems. Where you can’t just install binary executables freshly downloaded off internet. In fact you would want to diagnose the issue with minimal impact and changes required to those production environments (and that leaves sql tracing out the first round troubleshooting tools for me as well!)

So, I’ve developed myself a toolset for such purpose, Snapper and sw.sql and some process stack reading techniques I already have introduced in my blog.

Next in line is waitprof.sql which is a high-frequency V$SESSION_WAIT sampler – implemented in plain SQL (not PL/SQL).

Waitprof is basically a sampling session wait profiler. It’s like running a select against V$SESSION_WAIT in a very tight loop and aggregating results – but I have used a trick to do all this in plain SQL, which gives me performance advantage over PL/SQL based loops. Waitprof is able to sample V$SESSION_WAIT for a session up to 100 000 times per second!

This depends on your hardware of course and Oracle version too, but normally you’ll get 50-70kHz sampling rate with it.

Ok, you want to see an example? ;-)

I will run my lotspios.sql script in one session (SID=142), which, as the name says, generates lots of physical IOs.

SQL> @lotspios 100

Now, let say I’m interested what this session is doing over time (is it on CPU or waiting and if waiting, on what). For that I run waitprof.sql on that session. The syntax is following:

@waitprof    

And lets run waitprof now, on SID 142, gather only events waited on and take 100000 samples:

SQL> @waitprof noprint 142 e 100000

-- WaitProf 1.03 by Tanel Poder (  )

                                                                               % Total  Total Event   Distinct   Avg time
    SID STATE   EVENT                               P1                            Time      Time ms     Events   ms/Event
------- ------- ----------------------------------- -------------------------- ------- ------------ ---------- ----------
    142 WAITING db file scattered read                                           39.75      580.306        721       .805
    142 WAITING gc buffer busy                                                   30.90      451.198        485       .930
    142 WORKING On CPU / runqueue                                                17.42      254.347       1786       .142
    142 WAITING read by other session                                            10.35      151.081        313       .483
    142 WAITING db file sequential read                                           1.56       22.834        244       .094
    142 WAITING latch: cache buffers chains                                        .02         .234          9       .026

6 rows selected.

Elapsed: 00:00:01.48

From the %Total Time column you see that during sampling that session stats, most of the time (39.75%) it was waiting for db file scattered read event. This roughly accounts for 580 milliseconds.

The Distinct Events column shows the number of distinct occurrences of that event, so during the 1.48 second sampling period 721 waits on db file scattered read was done. This column is based on V$SESSION_WAIT.SEQ# so it’s reasonably accurate over short periods of time. And the Avg time ms/Event column shows average duration for a single event, so an db file scattered read in this example took 805 microseconds.

This is the simplest usage of waitprof, however this is not why I wrote it. I want to know the details of those waits too!

The what_to_sample parameter can accept any combination of following values:

So, I can run waitprof with e1 to get all wait event parameter details:

SQL> @waitprof noprint 142 e1 100000

-- WaitProf 1.03 by Tanel Poder (  )

                                                                               % Total  Total Event   Distinct   Avg time
    SID STATE   EVENT                               P1                            Time      Time ms     Events   ms/Event
------- ------- ----------------------------------- -------------------------- ------- ------------ ---------- ----------
    142 WAITING db file scattered read              file#= 6                     49.07      814.496       1057       .771
    142 WAITING gc buffer busy                      file#= 6                     23.99      398.151        437       .911
    142 WORKING On CPU / runqueue                                                20.38      338.358       1895       .179
    142 WAITING read by other session               file#= 6                      5.72       94.985        210       .452
    142 WAITING db file sequential read             file#= 6                       .83       13.712        159       .086
    142 WAITING latch: cache buffers chains         address= 422E843C              .01         .083          1       .083
    142 WAITING latch: cache buffers chains         address= 41F8E900              .00         .066          1       .066
    142 WAITING latch: cache buffers chains         address= 437FCEC4              .00         .017          1       .017
    142 WAITING latch: cache buffers chains         address= 4238EE2C              .00         .017          1       .017
    142 WAITING latch: cache buffers chains         address= 42342298              .00         .017          1       .017
    142 WAITING latch: cache buffers chains         address= 41FBF44C              .00         .017          1       .017
    142 WAITING latch: cache buffers chains         address= 41FB4998              .00         .017          1       .017
    142 WAITING latch: cache buffers chains         address= 41FACB6C              .00         .017          1       .017
    142 WAITING latch: cache buffers chains         address= 41F7D074              .00         .017          1       .017
    142 WAITING latch: cache buffers chains         address= 00000006              .00         .017          1       .017
    142 WAITING latch: cache buffers chains         address= 41FA98F4              .00         .017          1       .017

16 rows selected.

Elapsed: 00:00:01.70

So now I have the wait profile breakdown by event and its P1 value. For buffer related events we see that all IOs happened against datafile 6.

For latch waits, the P1 value is the latch address. This is especially helpful in pre-10g databases as it helps us to translate this address back to the actual latch name (and even to individual child latch) using V$LATCH_PARENT and V$LATCH_CHILDREN.

I have written a script la.sql for this:

SQL> @la 422E843C

ADDR         LATCH#   CHLD NAME                                           GETS      IGETS     MISSES    IMISSES     SLEEPS  WAIT_TIME
-------- ---------- ------ ---------------------------------------- ---------- ---------- ---------- ---------- ---------- ----------
422E843C        122   1735 cache buffers chains                         164637      64387        329          0          6        167

Waitprof allows you to use the what_to_sample parameters in any combination, for example, if I only want to know the breakdown of scattered reads by number of blocks read at a time (P3), I can use “3” in the parameter as seen below. Note that as P2, P3 and SEQ# are not printed with noprint option (for shorter line width), then now we need to use print option:

SQL> @waitprof print 142 e3 100000

-- WaitProf 1.03 by Tanel Poder (  )

                                                                                                                          % Total  Total Event   Distinct   Avg time
    SID STATE   EVENT                               P1                         P2              P3                    SEQ#    Time      Time ms     Events   ms/Event
------- ------- ----------------------------------- -------------------------- --------------- --------------- ---------- ------- ------------ ---------- ----------
    142 WORKING On CPU / runqueue                                                                                           36.03      544.053       1211       .449
    142 WAITING gc buffer busy                                                                 id#= 65537                   28.73      433.838        393      1.104
    142 WAITING db file scattered read                                                         blocks= 16                   22.39      338.149        270      1.252
    142 WAITING read by other session                                                          class#= 1                     7.61      114.866        340       .338
    142 WAITING db file scattered read                                                         blocks= 13                    1.16       17.546         20       .877
    142 WAITING db file scattered read                                                         blocks= 12                    1.03       15.568         16       .973
    142 WAITING db file scattered read                                                         blocks= 14                     .49        7.459          7      1.066
    142 WAITING db file scattered read                                                         blocks= 11                     .46        6.976          9       .775
    142 WAITING db file scattered read                                                         blocks= 10                     .36        5.361          7       .766
    142 WAITING db file scattered read                                                         blocks= 8                      .31        4.696          7       .671
    142 WAITING db file sequential read                                                        blocks= 1                      .27        4.077         47       .087
    142 WAITING db file scattered read                                                         blocks= 9                      .27        4.062          6       .677
    142 WAITING db file scattered read                                                         blocks= 3                      .16        2.341         11       .213
    142 WAITING db file scattered read                                                         blocks= 2                      .15        2.235         14       .160
    142 WAITING db file scattered read                                                         blocks= 15                     .15        2.235          2      1.117
    142 WAITING db file scattered read                                                         blocks= 4                      .13        1.933          6       .322
    142 WAITING db file scattered read                                                         blocks= 6                      .12        1.737          5       .347
    142 WAITING db file scattered read                                                         blocks= 5                      .10        1.495          4       .374
    142 WAITING db file scattered read                                                         blocks= 7                      .07        1.027          2       .513
    142 WAITING read by other session                                                          class#= 65537                  .01         .136          9       .015
    142 WAITING gc buffer busy                                                                 id#= 1                         .01         .106          7       .015
    142 WAITING latch: cache buffers chains                                                    tries= 0                       .00         .045          2       .023
    142 WAITING read by other session                                                          class#= 0                      .00         .015          1       .015
    142 WAITING db file sequential read                                                        blocks= 2                      .00         .015          1       .015
    142 WAITING gc buffer busy                                                                 id#= 5                         .00         .015          1       .015
    142 WAITING db file scattered read                                                         blocks= 1                      .00         .015          1       .015

26 rows selected.

Elapsed: 00:00:01.89

From above we see that as far as IO operations are concerned, most of the response time during sampling was spent during 16/13/12 block reads. Other read sizes took less time.

Of course we can also make Waitprof show all P1,P2,P3 colums, but this may display lots of lines for an active session:

SQL> @waitprof print 142 e123 100000

-- WaitProf 1.03 by Tanel Poder (  )

                                                                                                                          % Total  Total Event   Distinct   Avg time
    SID STATE   EVENT                               P1                         P2              P3                    SEQ#    Time      Time ms     Events   ms/Event
------- ------- ----------------------------------- -------------------------- --------------- --------------- ---------- ------- ------------ ---------- ----------
    142 WORKING On CPU / runqueue                                                                                           21.63      424.007       1970       .215
    142 WAITING db file scattered read              file#= 1                   block#= 43525   blocks= 16                     .16        3.175          1      3.175
    142 WAITING db file scattered read              file#= 1                   block#= 43541   blocks= 12                     .13        2.587          1      2.587
    142 WAITING db file scattered read              file#= 1                   block#= 43493   blocks= 16                     .11        2.234          1      2.234
    142 WAITING db file scattered read              file#= 1                   block#= 44167   blocks= 16                     .10        2.019          1      2.019
    142 WAITING db file scattered read              file#= 1                   block#= 55466   blocks= 16                     .10        1.901          2       .951
    142 WAITING db file scattered read              file#= 1                   block#= 44041   blocks= 16                     .10        1.882          1      1.882
    142 WAITING db file scattered read              file#= 1                   block#= 43877   blocks= 16                     .09        1.705          1      1.705
    142 WAITING db file scattered read              file#= 1                   block#= 44067   blocks= 16                     .09        1.686          1      1.686
    142 WAITING db file scattered read              file#= 1                   block#= 44103   blocks= 16                     .08        1.646          1      1.646
    142 WAITING db file scattered read              file#= 1                   block#= 44135   blocks= 16                     .08        1.568          1      1.568
    142 WAITING db file scattered read              file#= 1                   block#= 43597   blocks= 16                     .08        1.509          1      1.509
    142 WAITING db file scattered read              file#= 1                   block#= 8696    blocks= 16                     .08        1.490          1      1.490
[...lots of lines snipped...]
    142 WAITING db file sequential read             file#= 1                   block#= 35630   blocks= 1                      .00         .020          1       .020
    142 WAITING db file sequential read             file#= 1                   block#= 35634   blocks= 1                      .00         .020          1       .020
    142 WAITING db file sequential read             file#= 1                   block#= 35636   blocks= 1                      .00         .020          1       .020
    142 WAITING db file sequential read             file#= 1                   block#= 35638   blocks= 1                      .00         .020          1       .020
    142 WAITING db file sequential read             file#= 1                   block#= 35640   blocks= 1                      .00         .020          1       .020
    142 WAITING db file sequential read             file#= 1                   block#= 35644   blocks= 1                      .00         .020          1       .020
    142 WAITING db file sequential read             file#= 1                   block#= 38097   blocks= 16                     .00         .020          1       .020
    142 WAITING db file sequential read             file#= 1                   block#= 43562   blocks= 1                      .00         .020          1       .020
    142 WAITING db file scattered read              file#= 1                   block#= 35529   blocks= 16                     .00         .020          1       .020

1979 rows selected.

Even though the CPU usage is in the top of the list, it has only used 21.63% of total response time, so the rest 78% of the IO time is the “problem” here. The reason is that all CPU time is aggregated together (the time between waits), while most events are brought out separately due differences in P1,P2,P3 values.

If you want to see events in chronological order, then you can include SEQ# (which is the wait state sequence identifier) also in the output:

SQL> @waitprof print 142 e123s 1000

-- WaitProf 1.03 by Tanel Poder (  )

                                                                                                                          % Total  Total Event   Distinct   Avg time
    SID STATE   EVENT                               P1                         P2              P3                    SEQ#    Time      Time ms     Events   ms/Event
------- ------- ----------------------------------- -------------------------- --------------- --------------- ---------- ------- ------------ ---------- ----------
    142 WAITING db file scattered read              file#= 6                   block#= 43897   blocks= 13           26036    1.80         .360          1       .360
    142 WORKING On CPU / runqueue                                                                                   26036    2.40         .480          1       .480
    142 WAITING db file scattered read              file#= 6                   block#= 44105   blocks= 13           26037    5.30        1.060          1      1.060
    142 WORKING On CPU / runqueue                                                                                   26037    1.40         .280          1       .280
    142 WAITING db file scattered read              file#= 6                   block#= 44314   blocks= 12           26038    5.90        1.180          1      1.180
    142 WORKING On CPU / runqueue                                                                                   26038    1.70         .340          1       .340
    142 WORKING On CPU / runqueue                                                                                   26039    1.80         .360          1       .360
    142 WAITING db file scattered read              file#= 6                   block#= 44534   blocks= 13           26039    6.40        1.280          1      1.280
    142 WORKING On CPU / runqueue                                                                                   26040    1.80         .360          1       .360
    142 WAITING db file scattered read              file#= 6                   block#= 44742   blocks= 13           26040    6.20        1.240          1      1.240
    142 WORKING On CPU / runqueue                                                                                   26041    1.70         .340          1       .340
    142 WAITING db file scattered read              file#= 6                   block#= 44950   blocks= 13           26041    6.00        1.200          1      1.200
    142 WAITING db file scattered read              file#= 6                   block#= 45159   blocks= 12           26042    5.60        1.120          1      1.120
    142 WORKING On CPU / runqueue                                                                                   26042    2.10         .420          1       .420
    142 WAITING db file scattered read              file#= 6                   block#= 45366   blocks= 13           26043    6.00        1.200          1      1.200
    142 WORKING On CPU / runqueue                                                                                   26043    1.70         .340          1       .340
    142 WAITING db file scattered read              file#= 6                   block#= 45574   blocks= 13           26044    5.90        1.180          1      1.180
    142 WORKING On CPU / runqueue                                                                                   26044    1.80         .360          1       .360
    142 WAITING db file scattered read              file#= 6                   block#= 45769   blocks= 13           26045    5.90        1.180          1      1.180
    142 WORKING On CPU / runqueue                                                                                   26045    1.60         .320          1       .320
    142 WORKING On CPU / runqueue                                                                                   26046    1.60         .320          1       .320
    142 WAITING db file scattered read              file#= 6                   block#= 45978   blocks= 12           26046    5.50        1.100          1      1.100
    142 WAITING db file scattered read              file#= 6                   block#= 46185   blocks= 13           26047    6.00        1.200          1      1.200
    142 WORKING On CPU / runqueue                                                                                   26047    1.80         .360          1       .360
    142 WAITING db file scattered read              file#= 6                   block#= 46380   blocks= 13           26048    6.10        1.220          1      1.220
    142 WORKING On CPU / runqueue                                                                                   26048    1.70         .340          1       .340
    142 WAITING db file scattered read              file#= 6                   block#= 46601   blocks= 13           26049    4.30         .860          1       .860

27 rows selected.

Here you see a detailed chronological view of events, ordered by their SEQ# number. There are few gotchas like SEQ# wrapping to zero after reaching 65535 and potentially missed events due sampling granularity, but more about this in an upcoming post. Btw, you can get more details about script usage from the script header and by reading the single SQL statement in there :)

One might ask why use such sampling script when you can enable extended SQL trace which also gives you chronological ordering and all wait parameters. One answer is that it’s more convenient to use a script for first-round diagnosis of performance problems. Another answer is that sometimes, in change-controlled production environments it may take quite some time to get the change for enabling trace through.

In an upcoming post I’ll show more (very cool) uses of this powerful sampling technique, a hint is that the script is called Latchprof ;-)

If you haven’t read my other Advanced Oracle Troubleshooting posts, here they are:

/2007/06/18/advanced-oracle-troubleshooting-guide-when-the-wait-interface-is-not-enough-part-1/

/2007/08/27/advanced-oracle-troubleshooting-guide-part-2-no-magic-is-needed-systematic-approach-will-do/

/2007/09/06/advanced-oracle-troubleshooting-guide-part-3-more-adventures-in-process-stack/

/2008/06/03/advanced-oracle-troubleshooting-guide-part-4-diagnosing-a-long-parsing-issue/

Meanwhile, can someone explain how my script can sample the same table (V$SESSION_WAIT) hundreds of thousands of times by running a single SQL statement only once? ;-)

Comments powered by Talkyard.