Short note on KGX Mutexes

I received a question on what’s the point of the use of Mutexes for Oracle cursors in library cache. For short intro, I’m pasting one of my fairly recent answers in Oracle forums about Oracle mutexes here:

In Oracle, latches and mutexes are different things and managed using different modules. KSL* modules for latches and KGX* for mutexes.

General mutex operatins require less CPU instructions than latch operations (as they aren’t as sophisticated as latches and don’t maintain get/miss counts as latches do).

But the main scalability benefit comes from that there’s a mutex structure in each child cursor handle and the mutex itself acts as cursor pin structure. So if you have a cursor open (or cached in session cursor cache) you don’t need to get the library cache latch (which was previously needed for changing cursor pin status), but you can modify the cursor’s mutex refcount directly (with help of pointers in open cursor state area in sessions UGA).

Therefore you have much higher scalability when pinning/unpinning cursors (no library cache/library cache pin latching needed, virtually no false contention) and no separate pin structures need to be allocated/maintained.

Few notes:

  1. library cache latching is still needed for parsing etc, the mutexes address only the pinning issue in library cache
  2. mutexes are currently used for library cache cursors (not other objects like PL/SQL stored procs, table defs etc)
  3. As mutexes are a generic mechanism (not library cache specific) they’re used in V$SQLSTATS underlying structures too
  4. When mutexes are enabled, you won’t see cursor pins from X$KGLPN anymore (as X$KGLPN is a fixed table based on the KGL pin array – which wouldn’t be used for cursors anymore)

Performance Tools Quick Reference Guide

There’s a nice Metalink Note 438452.1 about various less known Oracle performance tuning utilities.

If you haven’t heard about things like StackX, LTOM, HangFG, SQLTXPLAIN, OS_Watcher or OPDG then it’s time to check this note out! :)

Future appearances

In addition to Hotsos Symposium this week, I will deliver my Advanced Oracle Troubleshooting presentation and demos at NYOUG DBA SIG this Thursday (06. March 2008) in Oracle’s Park Avenue office in Manhattan, NYC. If you are interested, see the details at http://www.nyoug.org/upcoming_events.htm#dbaname

Also, I will be speaking at Trivadis Performance Days in Zurich  23.-24. April 2008. I will deliver the following presentations/demos:

  • Advanced Oracle Troubleshooting
  • Performance and Scalability Improvements in Oracle 10g and 11g

 If you are interested, visit this link:

https://www.trivadis.com/shop/Kursdetail.aspx?KategorieID=2&SubkategorieID=78&KurseID=608

Cary Millsap is blogging

If you work with Oracle databases and are interested in system performance, then you probably want to know what Cary Millsap has to say. So, make sure you bookmark his blog: http://carymillsap.blogspot.com/

SQL*Net message to client wait isn’t really what it’s thought to be

In a recent Oracle Forum thread a question came up how to use SQL*Net message to client wait events for measuring network latency between server and client. The answer is that you can’t use it for network latency measurements at all, due how TCP stack works and how Oracle uses it.

I’ll paste my answer here too, for people who don’t follow Oracle Forums:

As I wrote in that reply, “SQL*Net message to client” does NOT measure network latency! It merely measures how long it took to put the response message into TCP send buffer on the server!

Once the response packet is put into TCP send buffer, Oracle server process continues on and starts waiting for “SQL*Net message FROM client” again. It’s up to TCP stack to deliver this packet from this point and Oracle server process has no way for measuring directly when did the packet arrive (it would have to start intercepting TCP ACK packets at kernel level for that).

This behaviour also explains, why the “SQL*Net message TO client” waits are usually unbelievably short – like 1 microsecond etc. The light in vacuum travels only 300 meters in one microsecond, the signal in a cable travels much less – yet when sending a single packet from London to Hong-Kong, we see few microsecond “response times”…

Example:

solaris02$ fgrep “SQL*Net message to client” sol10g_ora_837.trc | head
WAIT #8: nam=’SQL*Net message to client’ ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=16418611294
WAIT #7: nam=’SQL*Net message to client’ ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=16418644515
WAIT #5: nam=’SQL*Net message to client’ ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=16428803213
WAIT #5: nam=’SQL*Net message to client’ ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=16428880945
WAIT #5: nam=’SQL*Net message to client’ ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=16428927443
WAIT #5: nam=’SQL*Net message to client’ ela= 0 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=16428973661
WAIT #5: nam=’SQL*Net message to client’ ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=16429019250
WAIT #5: nam=’SQL*Net message to client’ ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=16429066742
WAIT #5: nam=’SQL*Net message to client’ ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=16429114761
WAIT #5: nam=’SQL*Net message to client’ ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=16429162471

Note that you can see longer times spent waiting for “SQL*Net message to client” when sending large amounts of data. This happens when your TCP send buffer gets full, thus TCP stack can not accept further packets. Sending will be blocked until the remote site sends back further ACK packets which state up to which byte in the TCP transmission stream it has received the data.

So, if you’re sending loads of data over a slow link or misconfigured TCP connection, the “SQL*Net message to client” wait time can be used as a low-confidence indicator of your SQL*Net throughput (in conjuction with “bytes sent via SQL*Net to client”), but never a measure of network latency!

Note that this is not OS platform specific, this is just how TCP and SQL*Net work.

Oracle hidden costs revealed, part 1 – Does a batch job run faster when executed locally?

This series is about revealing some Oracle’s internal execution costs and inefficiencies. I will analyze few situations and special cases where you can experience a performance hit where you normally wouldn’t expect to.

The first topic is about a question I saw in a recent Oracle Forum thread.

The question goes like this: “Is there any benefit if I run long sql queries from the server (by using telnet,etc) or from the remote by sql client.”

In order to leave out the network transfer cost of resultset for simplicity, I will rephrase the question like that: “Do I get better performance when I execute my server-side batch jobs (which don’t return any data to client) locally from the database server versus a remote application server or workstation?”

The obvious answer would be “NO, it does not matter where from you execute your batch job, as Oracle is a client server database system. All execution is done locally regardless of the client’s location, thus the performance is the same”.

While this sounds plausible in theory, there is (at least) one practical issue which can affect Oracle server performance depending on the clients platform and client libaries version.

It is caused by regular in-band break checking in client server communication channel where out of band break signalling is not available. A test case is below:

[Read more...]

Excellent article on Oracle 11g PL/SQL function result cache

I have so far avoided writing such pointer blog posts which only refer you to another article, but I have to do it with this one. Adrian Billington has written an excellent article on performance of Oracle 11g PL/SQL function result cache. His article is a good example of a thorough, well organized and well written technical content. I really enjoyed reading it and thanks to his thoroughness, he has just saved me some precious time doing that research on my own.

Get the article here: http://www.oracle-developer.net/display.php?id=504

Updated Session Wait script

Few days ago I wrote the post about systematic Unix troubleshooting, I mentioned my sw.sql script for quick session wait monitoring from sqlplus.
I had forgot to upload the latest version of my script, which is able to decode the enqueue and latch names from P1 values of V$SESSION_WAIT.

I have uploaded it now and you can get it from here: http://www.tanelpoder.com/files/scripts/sw.sql

This script allows you to easily see the lock or latch session is waiting on even on before Oracle 10g.

Also, not all latches and enqueues are externalized into separate wait events in 10g+, but only the most usual ones.

The below output is from Oracle 11.1.0.6 where only 27 latch types of total 496 do have their own wait event:

SQL> select count(*) from v$latch;

  COUNT(*)
----------
       496

SQL> select name from v$event_name where name like 'latch:%';

NAME
----------------------------------------------------------------
latch: cache buffers chains
latch: redo writing
latch: redo copy
latch: Undo Hint Latch
latch: In memory undo latch
latch: MQL Tracking Latch
latch: row cache objects
latch: shared pool
latch: session allocation
latch: messages
latch: enqueue hash chains
latch: ges resource hash list
latch: gcs resource hash
latch: cache buffers lru chain
latch: checkpoint queue latch
latch: cache buffer handles
latch: object queue header operation
latch: redo allocation
latch: gc element
latch: undo global data
latch: Change Notification Hash table latch
latch: change notification client cache latch
latch: lob segment hash table latch
latch: lob segment query latch
latch: lob segment dispenser latch
latch: virtual circuit queues
latch: parallel query alloc buffer

27 rows selected.

Not all enqueue types have their own wait either:

SQL> select count(*) from v$lock_type;

  COUNT(*)
----------
       187

SQL> select count(distinct(substr(name,6,2))) locks from v$event_name where name like 'enq:%';

     LOCKS
----------
       161

The SUBSTR trick was needed because some enqueues show different wait events depending for what purpose the enqueue is required.

Anyway, here is the output from my script. Note the syntax how I pass a SQL statement as a parameter (enclosed in double quotes) to dynamically generate a list of SIDs of interest (you may need to make your browsing window wider to see all output. Also note that the blogging engine keeps replacing normal quotes & double-quotes with “nicer” looking ones, so directly pasting these commands to sqlplus may not work).

SQL> @sw "select sid from v$session where type != 'BACKGROUND'"

    SID STATE   EVENT                                          SEQ# SEC_IN_WAIT         P1         P2         P3 P1TRANSL
------- ------- ---------------------------------------- ---------- ----------- ---------- ---------- ---------- ------------------------------------------
    144 WAITING SQL*Net message from client                   27328           3 1650815232          1          0
    127 WAITING SQL*Net message from client                    7008           4 1650815232          1          0
    170 WAITING SQL*Net message from client                    1279         221 1650815232          1          0
    118 WAITING enq: TX - row lock contention                    35        2053 1415053318     458777       1454 0x54580006: TX mode 6
    116 WAITING jobq slave wait                                   3           2          0          0          0
    114 WAITING kksfbc child completion                        2694           0          0          0          0
    111 WAITING latch: cache buffers chains                   10673           0 1586048564        141          0 0x5E892E34: cache buffers chains[c3482]
    119 WORKING On CPU / runqueue                              5703           0 1413697536          1          0
    117 WORKING On CPU / runqueue                             51328           0 1586048564        141          0
    132 WORKING On CPU / runqueue                              7457           0 2314369474    7667712    6263293

10 rows selected.

SQL> @sw "select sid from v$session where type != 'BACKGROUND'"

    SID STATE   EVENT                                          SEQ# SEC_IN_WAIT         P1         P2         P3 P1TRANSL
------- ------- ---------------------------------------- ---------- ----------- ---------- ---------- ---------- ------------------------------------------
    144 WAITING SQL*Net message from client                   27328           3 1650815232          1          0
    127 WAITING SQL*Net message from client                    7008           4 1650815232          1          0
    170 WAITING SQL*Net message from client                    1279         222 1650815232          1          0
    114 WAITING cursor: pin S wait on X                        2769           0 3157682392    8650752     327681
    118 WAITING enq: TX - row lock contention                    35        2054 1415053318     458777       1454 0x54580006: TX mode 6
    116 WAITING jobq slave wait                                   3           2          0          0          0
    111 WAITING latch: shared pool                            10765           0  537529180        277          0 0x200A0B5C: shared pool[c1]
    117 WORKING On CPU / runqueue                             51423           0  537529180        277          0
    119 WORKING On CPU / runqueue                              5709           0 1413697536          1          0
    132 WORKING On CPU / runqueue                              7577           0 1586048564        141          0

10 rows selected.

In addition to latch name, this script also shows the latch address in SGA and the child latch number, which allow identifying if several sessions are waiting on the same actual latch or just another child latch with same name.

Enjoy ;)

Systematic application troubleshooting in Unix

How many times have you seen a following case, where a user or developer complains that their Oracle session is stuck or running very slowly and the person who starts investigating the issue does following:

  1. Checks the database for locks
  2. Checks free disk space
  3. Checks alert log
  4. Goes back to the client saying “we did a healthcheck and everything looks ok” and closes the case or asks the user/developer to contact application support team or tune their SQL

The point here is that what the heck do the database locks, alert log or disk space have to do with first round session troubleshooting, when Oracle provides just about everything you need in one simple view?

Yes, I am talking about sampling V$SESSION_WAIT here. Database locks, free space and potential errors in alert log may have something to do with your users problems, but not necessarily. As there are many more causes, like network issues etc which could affect your user (and the whole database), it doesn’t make sense to go through all those random “healthchecks” every time you receive a user phone call. Moreover, even if you identify that there is shortage of disk space or there are many database locks – so what? They may not have anything to do with the users problem.

The issue here is that still many people do not know about V$SESSION_WAIT which in most cases shows your problem immediately or at least points you to right direction (e.g. there’s no need to check for locks if your session is waiting on “log file switch (archiving needed)” wait – and vice versa). Even if “these people” have heard of V$SESSION_WAIT and may be able to drop this in during their job interview, they may not know how to use it in systematic troubleshooting context. Many hours of service downtime and user frustration would be saved if all DBAs knew this extremely simple concept of looking at V$SESSION_WAIT.

This blog entry is not about Oracle though, so I will leave this rant for a future blog post.

This post is about a similar problem in Unix world. Having been involved with resolving some serious production issues lately I have been surprised quite many times by the corporate Unix support people who seem to do behave in similar manner. For example, there is a user calling in saying that their scheduled Unix job, which normally takes 5 minutes, has been running for hours now. The “senior unix support analyst” will do following:

[Read more...]

How to resolve SQL object and column names all the way to base tables and columns in Oracle?

If you have been involved in tuning SQL code which you have never seen before, you are probably familiar with the challenges of understanding what the code is trying to do. This can be especially time consuming when the SQL references lots of views, which reference views, which reference more views etc. So there may be a large information gap between the SQL statement (like select * from some_crazy_10_level_view) and the actual execution plan (referencing 10s of tables, with evidence of query transformations).

So unless you see something really obvious from the execution plan, you need to start mapping the SQL query and view texts back to the physical base tables which Oracle eventually has to access. This can be a tedious and boring (!) process.

The good news is that in Oracle 10.2+ there’s a hidden parameter that can do this mapping task for us.

Update: It looks like 9.2.0.8 patchset also has this parameter (so I guess later patchsets of 10.1.0.x have it as well now)

Let’s see an example:

I create a view on a view to illustrate the point:

SQL> create view myview as select * from all_users;

View created.

Now let’s set that parameter _dump_qbc_tree to 1 and run a query against the view:

SQL> alter session set "_dump_qbc_tree"=1;

Session altered.

SQL> select count(*) from myview;

  COUNT(*)
----------
        31

Now let’s look into the server process tracefile:

*** ACTION NAME:() 2007-09-16 12:19:57.500
*** MODULE NAME:(SQL*Plus) 2007-09-16 12:19:57.500
*** SERVICE NAME:(SYS$USERS) 2007-09-16 12:19:57.500
*** SESSION ID:(146.1984) 2007-09-16 12:19:57.500
QCSDMP: -------------------------------------------------------
QCSDMP:  SELECT: (qbc=2B8D1C28)
QCSDMP:    . (COUNT(*)) (opntyp=2 opndty=0)
QCSDMP:  FROM:
QCSDMP:    .MYVIEW
QCSDMP:      VQB:
QCSDMP:        SELECT: (qbc=2B8D163C)
QCSDMP:          .USERNAME
QCSDMP:        FROM:
QCSDMP:          .ALL_USERS
QCSDMP:            VQB:
QCSDMP:              SELECT: (qbc=2B8CAF78)
QCSDMP:                U.NAME (USERNAME)
QCSDMP:              FROM:
QCSDMP:                SYS.TS$ (TTS)
QCSDMP:                SYS.TS$ (DTS)
QCSDMP:                SYS.USER$ (U)

Here it is, the query text generated directly from parse tree, showing the base tables regardless that they had been hidden behind multiple views.

Also there’s few interesting things to note:

[Read more...]