Session-level statspack

Statspack is a useful tool for easily gathering and reporting some Oracle’s historical workload statistics. However it has its limitations and problems:

One of them that in past it used to record only Oracle statistics, measured from inside Oracle. This made Oracle and people using Oracle ignore other crucial statistics like OS workload below Oracle. For example statspack may show you heavy latch contention as the performance problem, while looking at vmstat output one would see that the server just has been heavily overloaded with numerous other jobs (like multiple backup, export and compress jobs overrunning their run-window) and the latch contention is just a symptom of CPU starvation.

This has been somewhat addressed by adding V$OSSTAT to Oracle 10g and statspack now gathers this info as well. Now we need people to start looking into those stats.

Another problem with statspack is which is described in Dan Fink’s blog, about cursor-level CPU accounting limitations before 10gR2.

However neither of those problems are the main limitations of statspack. The main problem is that statspack samples it’s statistic snapshots from system level views like V$SYSSTAT and V$SYSTEM_EVENT. If every single session in your database is performing exactly the same (kind) of operation, then system level aggregates might be quite OK to get an overview what’s going on in the database. But from the moment you start having different kind of sessions (e.g. OLTP vs batch vs data feed vs report) in your database, then system level aggregates used by statspack do heavily distort what the reality looks like for specific sessions.

There’s another gotcha – by default statspack excludes the “SQL*Net message from client” wait event from main report section, stating that it is not important to look into idle events. However by that we will throw away an important piece of information when diagnosing end-to-end performance, the end user experience. The bad thing is that even if we did include the SQL*Net wait event in our reports (by deleting the corresponding row from STATS$IDLE_EVENT), then we would still have no idea how many of those gazillions of system-wide SQL*Net client wait seconds do belong to our specific session or user.
Statspack doesn’t measure session-experience, it just gives you a system-wide aggregate, which cannot be translated back to individual session statistics (just as you can not convert a hash value back to original value – most of the information is just lost!).

How to get session-level detailed overview of database performance, with historical reporting capability then?

Oracle 10g addresses this partially with ASH and AWR and DBMS_MONITOR’s selective statistic sampling (Search for V$SERV_MOD_ACT_STATS).

While ASH and AWR are both awesome products, they are not usable in Standard Edition nor before 10g ( you may want to check out Kyle Hailey’s ASH simulator instead ) and for using them you need to pay a separate license fee.

So let me introduce my attempt to solve those problems ( NB! Free stuff!!! ;):

  • The poor-man’s version of Automatic Workload Repository: It’s called… um… Semi-Automatic Workload Repository :)
  • The clever-man’s addition to statspack, which takes session-level snapshots: In other words, Sesspack.

The idea is very simple, just take snapshots of V$SESSION_EVENT and some V$SESSTAT (and V$SESS_TIME_MODEL in 10g) statistics and store those in a repository just like statspack does. Snapshots are taken using sesspack.snap_xyz procedures. As a parameter to these procedures I can pass a specific SID, a group of SIDs, a Oracle db username a OS-username or whatever filtering fields can be found from V$SESSION.

Once you have taken your snapshots when running the workload, you can run a report which calculates statistic and wait event deltas between snapshots and that it.

A demo?

The simplest one is to just snap my own session details:

SQL> exec sesspack.snap_me;

PL/SQL procedure successfully completed.

SQL> select avg(length(text)) from dba_source;

AVG(LENGTH(TEXT))
-----------------
        114.01304

SQL> exec sesspack.snap_me;

PL/SQL procedure successfully completed.

SQL> @list

                                                   Snapshot                                  Snapped
 Snap ID Snapshot Time        Taken By             Mode                                     Sessions
-------- -------------------- -------------------- ---------------------------------------- --------
       1 2007-06-24 14:36:47  PERFSTAT             SNAP_ME: PERFSTAT                               1
       2 2007-06-24 14:36:52  PERFSTAT             SNAP_ME: PERFSTAT                               1

The list.sql has shown that I have two performance snapshots, with ID 1 and 2, so I’ll run a delta report between them with script srs.sql (SAWR Report by SessionID):

SQL> set tab off
SQL> @srs 1 2

SNAPSHOT_BEGIN    SNAPSHOT_END         DUR_SEC    DUR_MIN
----------------- ----------------- ---------- ----------
20070624 14:36:47 20070624 14:36:52          5 .083333333

                                                                                          ms/       ms in  Waits in Avg Wait
       SID     AUDSID      Event Name                                         % Total     sec    snapshot  snapshot       ms
---------- ---------- ---- --------------------------------------------- ------------ ------- ----------- --------- --------
       146     130096      db file scattered read                        |#######   |     676        3378       338       10
                           CPU Usage                                     |###       |     260        1300         0     1300
                           db file sequential read                       |#         |      44         220         6       37
                           SQL*Net message from client                   |          |       1           3         3        1
                           log file sync                                 |          |       0           1         1        1
                           SQL*Net message to client                     |          |       0           0         3        0

6 rows selected.

Another example which should bring the usefulness of sesspack out better:

I dirtied some buffer cache buffers using the delete command, then ran a alter system checkpoint between two snapshots of all background process activity (sesspack.snap_bg).

Tanel@Prod01> delete t;

50705 rows deleted.

Tanel@Prod01> exec sesspack.snap_bg;

PL/SQL procedure successfully completed.

Tanel@Prod01> alter system checkpoint;

System altered.

Tanel@Prod01> exec sesspack.snap_bg;

PL/SQL procedure successfully completed.

Tanel@Prod01> @list

                                                   Snapshot                                  Snapped
 Snap ID Snapshot Time        Taken By             Mode                                     Sessions
-------- -------------------- -------------------- ---------------------------------------- --------
       1 2007-06-24 14:36:47  PERFSTAT             SNAP_ME: PERFSTAT                               1
       2 2007-06-24 14:36:52  PERFSTAT             SNAP_ME: PERFSTAT                               1
       5 2007-06-24 14:57:49  TANEL                SNAP_BG:                                       14
       6 2007-06-24 14:57:55  TANEL                SNAP_BG:                                       14

Tanel@Prod01> @ srs 5 6

SNAPSHOT_BEGIN    SNAPSHOT_END         DUR_SEC    DUR_MIN
----------------- ----------------- ---------- ----------
20070624 14:57:49 20070624 14:57:55          6         .1

                                                                                       ms/       ms in  Waits in Avg Wait
    SID     AUDSID      Event Name                                         % Total     sec    snapshot  snapshot       ms
------- ---------- ---- --------------------------------------------- ------------ ------- ----------- --------- --------
    160          0      rdbms ipc message                             |##########|    1000        5999         6     1000

    161          0      rdbms ipc message                             |#####     |     500        3000         1     3000

    162          0      rdbms ipc message                             |######### |     833        5000         1     5000

    165          0      rdbms ipc message                             |#####     |     496        2974         4      744
                        control file parallel write                   |##        |     159         953        11       87
                        control file sequential read                  |#         |      57         341        26       13
                        direct path write                             |          |       0           2        12        0
                        direct path read                              |          |       0           0        12        0

    166          0      rdbms ipc message                             |##########|     998        5988         3     1996
                        log file parallel write                       |          |       1           4         2        2

    167          0      rdbms ipc message                             |######### |     818        4911         4     1228
                        db file parallel write                        |##        |     123         737        93        8

    168          0      rdbms ipc message                             |##########|    1002        6010         2     3005

    169          0      rdbms ipc message                             |##########|    1000        6000         2     3000

    170          0      pmon timer                                    |##########|    1000        6000         2     3000

15 rows selected.

An overview of what all background processes were doing at that time :)
If you ever had to diagnose performance issues in a large-scale variable-workload environment with lots of completely different activity going on, you already see the value of session-level statistic snapshots.

When you read the readme, you see that there are various options for taking selective snapshots, like running:

  • sesspack.snap_sid(57);
  • sesspack.snap_sid(’150,165,177,202′);
  • sesspack.snap_orauser(‘REPORT_USER’);
  • sesspack.snap_program(‘sqlplus.exe’);
  • sesspack.snap_machine(‘nycprd0005118′);

Also it is possible to snap based on completely custom conditions, using whatever SQL you write to return list of SID’s:

  • sesspack.snap_sidlist_internal(‘select sid from v$session where state=”ACTIVE” and last_call_et > 300′);
  • sesspack.snap_sidlist_internal(‘select sid from v$session where logon_time > sysdate – 1/48′);

Note that there are few crucial issues which need resolving in the snapshot engine (for example, waits are not recorded in V$SESSION_EVENT before event ends or times out, this is a problem with long waits exceeding snapshot interval).

The major work yet to be done is building easy reporting and visualizing capability as this is critical for any useful performance product.
I have experimented with Oracle Application Express for providing a convenient GUI and charting mechanism. Even though APEX charting is still not flexible enough for complex performance data visualization, nevertheless it looks promising! I will release it in version 0.06 perhaps, once rest of the crucial base features have been implemented. I’ll keep you posted.

It all is plain PL/SQL, you can download the package from here: http://blog.tanelpoder.com/files/scripts/tools/sesspack_0.05_release.zip

All feedback is appreciated, enjoy!

Note that this year’s only Advanced Oracle Troubleshooting class takes place in the end of April/May 2014, so sign up now if you plan to attend this year!

This entry was posted in Uncategorized and tagged . Bookmark the permalink.

17 Responses to Session-level statspack

  1. Pingback: OraTransplant » Log Buffer #51: a Carnival of the Vanities for DBAs

  2. lscheng says:

    Hi Tanel

    I think Statspack itself offers an option to snap session statistics. As

    execute statspack.snap(i_session_id=>3);

    However I have not used it for a while (like 3 years :-) ) so I am not sure if it does the samething as your package.

    Great informations in your blog

    Thanks


    LSC

  3. Paul says:

    Very nice. I use similar techniques but you’re ahead of me with the variety of snapshot types.
    I wrote one the other day to snapshot sessions for a particular osuser, but I surely don’t have the one that accepts a select statement. It’s the logical solution to the need to be able to specify sessions in many ways.

    From what I read in Tom Kyte’s book just now, you need to be careful, someone will misuse it it
    with the old “and sid = fire_emp(1234)”. Just kidding. This is great stuff.

    I keep forgetting to run some sessions that just take snapshots so I can get an idea about resource consumption by the snapshot process. Simple enough to do.

    I was looking through my notes from a well-known optimization course yesterday, there was a warning about don’t be misled by v$session_event aggregation… I haven’t tried to think very hard of examples. Maybe a good example would be a very small unit of work that involves both a very busy disk and a very idle one, the latter’s inflated atpr would distort the avg atpr. Just crank up the number of iterations and I suppose that distortion would subside.

    I’ll have to look into the statspack i_session_id thing, I don’t recall seeing that.

  4. Paul says:

    Also, I like to add CPU time from v$sesstat into the session_event output, union all, value/1000000, dummy values for the other session_event columns, 0, 1 whatever works.

  5. Paul says:

    I should have read the article more carefully, I see you’re already including CPU in your resource profile. “waits in snapshot = 0″, that’s familiar.

    I really like the delta math, the differencing of the snapshots. I have a side-by-side report that compares two snapshots, good for comparing two different runs. I haven’t focused so much on using successive snapshots to get a picture of what happened between time a and time b. That will change fairly soon.

    Re the crucial issues in the snapshot engine – will you go to v$session_wait for the incomplete waits? Where status = ‘WAITING’? What about a session that’s using CPU or is starved for cpu? I have too many types of unaccounted for time in my reports, I know, but still they are useful more often than not. I’d guess that as things become more complicated, as data comes from more tables and more rows, you will start getting into the slewed data that Mr. Millsap mentions with regard to the v$views. I will too, since I”m headed in that direction. I find the v$session_event resource profiles very useful, but I don’t do a very good job say with interactive sessions of distinguishing between neutral SQLnet waits ( really waiting for user input ) and SQLnet waits that really matter.

    In spite of those difficulties I think this type of reporting is very useful and I compliment you on what you have provided us.

  6. tanelp says:

    Thanks Paul and LSC.

    I have to admit that I either don’t remember or haven’t ever known about Statspack’s ability to take session level snapshots. That will be a good addition into the performance analyst’s toolset (well, not anymore, as we’ve got the Session Snapper ;)

    I checked Statspack’s session snap out though, it seems to be quite limited however, allowing to snap only one session by sid and has few other (common) problems like relying only on v$session_event for getting wait data. v$session_event is updated only at the end of the wait, so for getting full picture, one would need to add seconds_in_wait from v$session to corresponding wait event.

    Paul, I am using the v$session_wait in conjuction with v$session_event in both Snapper 1.03 and Sesspack 0.05 (which needs tidying up a little before I release it)

  7. lscheng says:

    Btw which one is better:
    Oracle Session Snapper
    or sesspack :-P

    Cheers


    LSC

  8. peter says:

    Hi Tanel,

    Not quite the same topic but I thought you might have some ideas.

    I am trying assemble a report about recent performance problems in our system.

    I am looking at V$SQL to get a breakdown of wait time contributions to ELAPSED_TIME for some SQL statements using the

    User_IO_Wait_Time,
    Application_wait_time,
    Concurrency_Wait_time,
    PLSQL_Execution_Wait_time,
    CPU_Time

    columns. But there is always a percentage not accounted for by these columns. Could you suggest possible reasons for this unaccounted time?

    I realise using V$ views is not ideal for this kind of task but its all I have (in the absence of properly scoped trace file data) now (the performance problem was fixed)

  9. tanelp says:

    Hi Peter,

    The unaccounted time is likely one of the following:

    1) Time spent waiting in CPU runqueue (Oracle thinks the process is on CPU and has stopped measuring wait time – but really the process is waiting in CPU runqueue, thus OS is not accounting any CPU time for the process either)

    2) CPU Time not accounted due quantization error (as normally the CPU usage is measured once every tick and attributed to the process currently happening to be on the CPU)

    3) Wait time not accounted a bug

  10. peter says:

    ok. This is a bit cheeky to ask another question so feel free not to reply !!

    But, I now have some raw trace file data related to this issue, that I’m unsure of.

    First of all, 2 of the WAIT lines (emitted before an insert) report enq:TX index contention on a different table/index to that being inserted. That index is in a different tablespace and segment to the table being inserted into. Therefore, Im confused why this session has to wait on it.

    EXEC #15:c=0,e=503,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=1,tim=3273173078085
    FETCH #15:c=0,e=1384,p=0,cr=19,cu=0,mis=0,r=1,dep=1,og=1,tim=3273173079686
    WAIT #25: nam=’enq: TX – index contention’ ela= 6 name|mode=1415053316 usn<<16 | slot=2555913 sequence=228627 obj#=315994 tim=3273173081540
    WAIT #25: nam=’enq: TX – index contention’ ela= 77446 name|mode=1415053316 usn<<16 | slot=2555913 sequence=228627 obj#=315994 tim=3273173159130
    WAIT #25: nam=’buffer busy waits’ ela= 1109 file#=198 block#=742607 class#=1 obj#=315242 tim=3273173160444
    WAIT #25: nam=’enq: TX – index contention’ ela= 8415 name|mode=1415053316 usn<<16 | slot=7602197 sequence=227627 obj#=315242 tim=3273173169242
    WAIT #25: nam=’buffer busy waits’ ela= 3668 file#=198 block#=742607 class#=1 obj#=315242 tim=3273173173017
    =====================
    PARSING IN CURSOR #25 len=250 dep=1 uid=47 oct=2 lid=47 tim=3273173173513 hv=3689280581 ad=’57c2b178′
    INSERT INTO SOME_TABLE

    (I’m assuming that the obj# reported in the WAIT line refers to the object_id in dba_objects)

    br
    peter

  11. peter says:

    I guess my assumption about obj# must be wrong (?)

    sorry to bother you !

  12. tanelp says:

    Nope, the cursor #25 you see above is some other cursor, not the insert. The new insert statement was then parsed and reused cursor slot #25 (as apparently the previous slot #25 occupier was closed).

    So search upwards in your tracefile for PARSING IN CURSOR #25 and the match you hit first is the statement that caused those waits.

  13. peter says:

    The trace file does not contain a cursor #25 prior to the one shown (and I’m pretty sure that it doesnt come from anywhere else). My interpretation was based on Cary Millsap’s advice that :

    “WAIT lines appear in the trace data before the database call that motivated
    them. This occurs because the Oracle kernel emits lines into the trace file as events
    complete.”

    Therefore, my interpretation is that EXECUTE database call begins for the INSERT (Cursor#25) and results ine WAITS on enq:TX and buffer busy waits. Only then is the data about cursor#25 output to the trace file.

    Here is a more complete extract from the file(i’ve changed the names of the database objects):

    WAIT #0: nam=’SQL*Net message to client’ ela= 2 driver id=1952673792 #bytes=1 p3=0 obj#=315994 tim=3273173063439
    WAIT #0: nam=’SQL*Net message from client’ ela= 1566 driver id=1952673792 #bytes=1 p3=0 obj#=315994 tim=3273173065090
    WAIT #0: nam=’SQL*Net message to client’ ela= 2 driver id=1952673792 #bytes=1 p3=0 obj#=315994 tim=3273173065421
    WAIT #0: nam=’SQL*Net message from client’ ela= 833 driver id=1952673792 #bytes=1 p3=0 obj#=315994 tim=3273173066319
    =====================
    PARSING IN CURSOR #24 len=637 dep=1 uid=47 oct=3 lid=47 tim=3273173073507 hv=1331493982 ad=’57c2a718′
    SELECT * FROM TABLE1 WHERE SOME_ID= :B1 FOR UPDATE
    END OF STMT
    EXEC #24:c=10000,e=4731,p=0,cr=31,cu=4,mis=0,r=0,dep=1,og=1,tim=3273173073489
    FETCH #24:c=0,e=1553,p=0,cr=8,cu=0,mis=0,r=1,dep=1,og=1,tim=3273173077198
    =====================
    PARSING IN CURSOR #15 len=431 dep=1 uid=47 oct=3 lid=47 tim=3273173078097 hv=604810361 ad=’6b4deb48′
    SELECT COUNT(*) FROM ( SELECT SOME_STUFF_FROM SOME_TABLES)
    END OF STMT
    EXEC #15:c=0,e=503,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=1,tim=3273173078085
    FETCH #15:c=0,e=1384,p=0,cr=19,cu=0,mis=0,r=1,dep=1,og=1,tim=3273173079686
    WAIT #25: nam=’enq: TX – index contention’ ela= 6 name|mode=1415053316 usn<<16 | slot=2555913 sequence=228627 obj#=315994 tim=3273173081540
    WAIT #25: nam=’enq: TX – index contention’ ela= 77446 name|mode=1415053316 usn<<16 | slot=2555913 sequence=228627 obj#=315994 tim=3273173159130
    WAIT #25: nam=’buffer busy waits’ ela= 1109 file#=198 block#=742607 class#=1 obj#=315242 tim=3273173160444
    WAIT #25: nam=’enq: TX – index contention’ ela= 8415 name|mode=1415053316 usn<<16 | slot=7602197 sequence=227627 obj#=315242 tim=3273173169242
    WAIT #25: nam=’buffer busy waits’ ela= 3668 file#=198 block#=742607 class#=1 obj#=315242 tim=3273173173017
    =====================
    PARSING IN CURSOR #25 len=250 dep=1 uid=47 oct=2 lid=47 tim=3273173173513 hv=3689280581 ad=’57c2b178′
    INSERT INTO TABLE_2 RETURNING A_COLUMN INTO :O0
    END OF STMT
    EXEC #25:c=0,e=93457,p=0,cr=11,cu=39,mis=0,r=1,dep=1,og=1,tim=3273173173484
    FETCH #24:c=0,e=101,p=0,cr=2,cu=0,mis=0,r=1,dep=1,og=1,tim=3273173173976
    FETCH #24:c=0,e=31,p=0,cr=1,cu=0,mis=0,r=0,dep=1,og=1,tim=3273173174119
    =====================

  14. tanelp says:

    Hi Peter,

    Yes, you’re right. Had the waits been for a previous cursor, we should have seen an EXEC# 25 or FETCH# 25 line after those waits, before PARSING IN.

    My first hunch was that this may be due auditing (insert into AUD$ or FGA_LOG$), but audit inserts should have been executed through a separate, recursive cursor.

    So what were those obj#’s seein in waits referring to? Were those normal tables/indexes?

    Can you reproduce this issue at will? If yes and if you happen to be on Solaris 10, I can give you a DTrace script which dumps the process stack when it waits for buffer busy waits or enq TX: index contention.

    This would give definitive proof on what caused those waits.

  15. peter says:

    hi Tanel,

    I’m trying to install Sesspack 0.05 – I got it from the Hotsos Symposium 2008 download back in March.

    It includes a synonym for sawr$sess_event_delta. But this view does not seem to be present in the install_sesspack_schema script in version 0.05 (it was there in 0.04). Do I need that view for 0.05?

    br
    peter

  16. tanelp says:

    Hi Peter,

    Yep the sawr$sess_event_delta should not be there anymore. This synonyms is a bug (and fixed in latest sesspack version). I’ll blog it soon.

  17. Pingback: Confluence: Productie

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>