Advanced Oracle Troubleshooting Guide – Part 10: Index unique scan doing multiblock reads?!

When you troubleshoot an Oracle (performance) problem you usually want to find out what are the problem sessions doing (wrong) and then why are they doing it.

The “what” question is usually answered using some wait interface based method – like ASH or SQL*Trace, which both add plenty of extra details to the plain wait event data.

My normal session troubleshooting data collection sequence consists of only three steps:

  1. Interpret session-level top wait and top SQL data (ASH, SQL*Trace or custom tools sampling session-level V$views)
  2. Interpret session-level performance counter data (V$SESSTAT)
  3. Sample the problem process stack and list the top stack branches where execution spent the most time

Only looking into #1 is enough in 90% of problem cases. But for advanced troubleshooting of complex and exotic problems I regularly have to dive into the #2 – V$SESSTAT performance counters (That’s the reason why I wrote Snapper for example).

Maybe only once out of every 40-50 troubleshooting cases I dive down into #3 – process stack sampling. While process stack sampling doesn’t lie – it helps especially in these cases where the Oracle’s instrumentation itself doesn’t have a clue what a session/process is doing – it’s like a microscope. It gives you very detailed overview of one part of the problem, but not the big picture. That’s why I never start from #3 when troubleshooting (and if you do, you are doing it wrong – with an exception of a completely hung database ;) I always start from #1 and proceed to #2 only when needed.

Snapper lists both #1 and #2 data for a process by the way.

Anyway, the reason why I’m writing this post is that a few days ago I received a question about an issue which nicely illustrates a case where step #1 shows that there’s an issue and step #2 explains why it’s happening.

There was an execution plan, which ran slightly slower on a new server compared to the old one – and it did read 36 blocks per execution as opposed to only 5:

select ...
from TABLE_X
where ...
 
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.02       0.02          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        2      0.00       0.07         36          5          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4      0.02       0.09         36          5          0           1

The “disk” column above shows 36 blocks being read in via physical IO – while only 5 buffers had actually been accessed after that (logical IOs are shown in the “query” column).

Now that’s weird – as logical IOs are the ones which cause physical IOs, so the LIO count should normally be equal or greater than the physical IO count.

What’s more stranger, the execution plan is an index unique scan:

 
 
Rows     Row Source Operation
-------  ---------------------------------------------------
      1  TABLE ACCESS BY INDEX ROWID TABLE_X (cr=5 pr=36 pw=0 time=0 us cost=4 size=26 card=1)
      1   INDEX UNIQUE SCAN TABLE_X_IDX (cr=4 pr=35 pw=0 time=0 us cost=3 size=0 card=1)(object id 51063)
 

… but if you look into the “pr=” column, which means “blocks read via physical read”, then you see that 35 out of the 36 blocks were read in by the INDEX UNIQUE SCAN row source (pr=35). These counters are cumulative, so the pr=36 in the parent TABLE ACCESS BY INDEX ROWID row source means that there was only one block read by physical IO at the table access level. So why does the index access read 35 blocks?

Is it index height? Can’t be – as the max index height is 24 levels (including the leaf level). And such an index would be ridiculous anyway :)

Let’s look into the wait profile:

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                       2        0.00          0.00
  Disk file operations I/O                        4        0.00          0.00
  db file sequential read                         4        0.01          0.04
  db file scattered read                          1        0.03          0.03
  SQL*Net message from client                     2        0.15          0.15
********************************************************************************

Wow, an index unique scan which is supposed to traverse the index tree with precise single block reads, has done 4 single block reads and one multiblock read!

Now this is where the wait interface and ASH-style data would not help us any further. We know what’s happening – a multiblock read happening where “it shouldn’t”. We still don’t know why this is happening. We would have to guess from here. And this is where V$SESSTAT comes in – there’s over 600 performance counters maintained for each session and many of them show valuable information about what kind of operations (and how many) the RDBMS kernel has done. So I asked the client to run Snapper on a test session and here’s the output (note that the below output is from a different test run where the queries were executed many times in a loop):


----------------------------------------------------------------------------------------------------------------------------------------
    SID, USERNAME  , TYPE, STATISTIC                                                 ,         DELTA, HDELTA/SEC,    %TIME, GRAPH
----------------------------------------------------------------------------------------------------------------------------------------
   1612, ORACLE    , STAT, opened cursors cumulative                                 ,          6617,      1.32k,
   1612, ORACLE    , STAT, user calls                                                ,         13234,      2.65k,
   1612, ORACLE    , STAT, pinned cursors current                                    ,            -1,        -.2,
   1612, ORACLE    , STAT, session logical reads                                     ,          5293,      1.06k,
   1612, ORACLE    , STAT, CPU used when call started                                ,             7,        1.4,
   1612, ORACLE    , STAT, CPU used by this session                                  ,             7,        1.4,
   1612, ORACLE    , STAT, DB time                                                   ,           306,       61.2,
   1612, ORACLE    , STAT, user I/O wait time                                        ,           223,       44.6,
   1612, ORACLE    , STAT, non-idle wait time                                        ,           224,       44.8,
   1612, ORACLE    , STAT, non-idle wait count                                       ,          6775,      1.36k,
   1612, ORACLE    , STAT, process last non-idle time                                ,             5,          1,
   1612, ORACLE    , STAT, physical read total IO requests                           ,           158,       31.6,
   1612, ORACLE    , STAT, physical read total multi block requests                  ,            31,        6.2,
   1612, ORACLE    , STAT, physical read total bytes                                 ,       8265728,      1.65M,
   1612, ORACLE    , STAT, cell physical IO interconnect bytes                       ,       8265728,      1.65M,
   1612, ORACLE    , STAT, consistent gets                                           ,          5293,      1.06k,
   1612, ORACLE    , STAT, consistent gets from cache                                ,          5293,      1.06k,
   1612, ORACLE    , STAT, consistent gets - examination                             ,          5293,      1.06k,
   1612, ORACLE    , STAT, physical reads                                            ,          1009,      201.8,
   1612, ORACLE    , STAT, physical reads cache                                      ,          1009,      201.8,
   1612, ORACLE    , STAT, physical read IO requests                                 ,           158,       31.6,
   1612, ORACLE    , STAT, physical read bytes                                       ,       8265728,      1.65M,
   1612, ORACLE    , STAT, free buffer requested                                     ,          1008,      201.6,
   1612, ORACLE    , STAT, physical reads cache prefetch                             ,           851,      170.2,
   1612, ORACLE    , STAT, physical reads prefetch warmup                            ,           851,      170.2,
   1612, ORACLE    , STAT, shared hash latch upgrades - no wait                      ,           188,       37.6,
   1612, ORACLE    , STAT, calls to get snapshot scn: kcmgss                         ,          1319,      263.8,
   1612, ORACLE    , STAT, file io service time                                      ,            25,          5,
   1612, ORACLE    , STAT, file io wait time                                         ,       2219641,    443.93k,
   1612, ORACLE    , STAT, min active SCN optimization applied on CR                 ,             1,         .2,
   1612, ORACLE    , STAT, index fetch by key                                        ,          1292,      258.4,
   1612, ORACLE    , STAT, session cursor cache hits                                 ,          6440,      1.29k,
   1612, ORACLE    , STAT, buffer is not pinned count                                ,          1277,      255.4,
   1612, ORACLE    , STAT, parse time elapsed                                        ,             6,        1.2,
   1612, ORACLE    , STAT, parse count (total)                                       ,          6380,      1.28k,
   1612, ORACLE    , STAT, execute count                                             ,          6378,      1.28k,
   1612, ORACLE    , STAT, bytes sent via SQL*Net to client                          ,       2195633,    439.13k,
   1612, ORACLE    , STAT, bytes received via SQL*Net from client                    ,       2721339,    544.27k,
   1612, ORACLE    , STAT, SQL*Net roundtrips to/from client                         ,          6374,      1.27k,
   1612, ORACLE    , TIME, parse time elapsed                                        ,        114215,    22.84ms,     2.3%, |@         |
   1612, ORACLE    , TIME, PL/SQL execution elapsed time                             ,         67004,     13.4ms,     1.3%, |@         |
   1612, ORACLE    , TIME, DB CPU                                                    ,        630000,      126ms,    12.6%, |@@        |
   1612, ORACLE    , TIME, sql execute elapsed time                                  ,       2595410,   519.08ms,    51.9%, |@@@@@@    |
   1612, ORACLE    , TIME, DB time                                                   ,       3190919,   638.18ms,    63.8%, |@@@@@@@   |
   1612, ORACLE    , WAIT, db file sequential read                                   ,        978601,   195.72ms,    19.6%, |@@        |
   1612, ORACLE    , WAIT, db file scattered read                                    ,       1213550,   242.71ms,    24.3%, |@@@       |
   1612, ORACLE    , WAIT, SQL*Net message to client                                 ,          9212,     1.84ms,      .2%, |          |
   1612, ORACLE    , WAIT, SQL*Net message from client                               ,       1667368,   333.47ms,    33.3%, |@@@@      |
--  End of Stats snap 1, end=2012-03-09 14:47:02, seconds=5

See the highlighted lines above and scroll right if you don’t see the metric values…

  • The physical read total multi block requests statistic confirms that indeed there were some multiblock reads done by the monitored session.
  • The physical reads cache prefetch metric indicates that there was some data block prefetching happening! This explains why we had multiblock reads done where only single block reads were really needed.
  • And the physical reads prefetch warmup metric explains why did we do any prefetching at all – it’s for buffer cache warmup, where after a restart or a significant buffer cache size increase Oracle decides to populate the empty buffer cache with any data as fast as possible. Oracle assumes that a random multiblock read doesn’t take that much more time than a random single block read anyway, so why not just read a bunch of additional blocks from around that single block that the index access wanted to read. The read operation will be slightly slower, but it might help to avoid multiple future single block reads as they are already in cache. This should affect things only when you restart the instance or significantly increase the buffer cache size. So on a steady state production system you shouldn’t see it (and it’s for a good cause anyway), but if you do performance tests on some test database, change something, bounce the instance and immediately run a test – this may affect your test results. Btw, this is controlled by a hidden parameter _db_cache_pre_warm, but I’m mentioning this just for educational purposes here, I have never fixed a problem by changing this parameter so you should touch it only in your sandbox hacking database :)

And now back to the original topic – this post wasn’t really about this specific prewarm feature, but rather a yet another example of how the V$SESSTAT metrics tell you much more than the usual wait interface and TOP-sql based approach. And you do not need to enable any tracing or undocumented events for getting these detailed metrics out of Oracle, just query the V$SESSION views at the right time. As I said in the beginning, I don’t jump to the microscopic view unless the higher level views aren’t helpful enough, so I recommend you to follow the 3-step “Advanced Oracle Troubleshooting” method when trying to diagnose what the heck some session is doing and why ;^)

Where is LOB data stored?

There was a question in Oracle-L about where is the LOB data actually stored (in the row or the LOB segments) and what are the exact conditions when a switch from one to another may happen. The documentation isn’t fully clear about this and the “4000 bytes” number may mislead people to think that you can store 4000 bytes of your data in a LOB item before it must move out-of-line.

I clarified this in my post in Oracle-L, but will post it here too. First check this diagram:
 

 
If you create the LOB column with DISABLE STORAGE IN ROW, then the LOB data is always stored out-of-line. LOB index is always used. Only the LOB ID is stored inline, and the ID is looked up in LOB index, where you’ll get the pointers to actual LOB chunk blocks.
 
If you create the LOB column with ENABLE STORAGE IN ROW, then the LOB data may be stored in-line or out-of-line.
 
If the total LOB data + overhead <= 4000 bytes, then the LOB item will be stored in-line. No LOB index is used, even if you modify the LOB later on as everything is stored in-line with the row and versioning/rollback is achieved with undo data.
 
If the total LOB data + overhead > 4000 bytes, then the LOB item will be stored out-of-line. If the LOB fits into 12 x LOB_chunk_size, then no LOB index entries are created, because the in-line LOB locator can store up to 12 pointers to the LOB chunk blocks for each lob item. So if your LOB chunk size is 8kB, you can store LOB items up to 96kB in size without inserting anything to LOB index. However if the LOB item is bigger, then no pointers are stored in-row and all pointers will be put to the LOB index.
 
Note that once you modify an existing LOB item (which is bigger than 4000 bytes with its overhead), but smaller than 12 x chunk_size, then LOB index will still be used after the first LOB change operation as pointers to the old LOB chunk versions have to be stored in it (LOB segments don’t rely on undo for rollback & consistency, but just use LOB chunk versioning managed by LOB index).
 
The “overhead” of an in-line LOB item is 36 bytes, so the actual LOB data must be 4000 – 36 = 3964 bytes or less in order to fully fit in-row. And note that we are talking about bytes here, not characters. With multibyte character sets a character in a CLOB may take multiple bytes.
 
I wrote a presentation about LOB internals quite some time ago, back when Oracle 10.1 came out I think – check it here:
 

 
Enjoy! :-)
 
P.S. The next Advanced Oracle Troubleshooting v2.0 class will start in 3 weeks! (I just added the online payment option back too for convenience:)
 

Exadata Smart Scan predicate offloading and sequence.NEXTVAL

There was a question in the twitter-sphere about whether using sequences (sequence.NEXTVAL) in your select query’s projection list would somehow disable smart scans happening?

The answer is no, sequence use with smart scans works just fine. The smart scan offloading applies to data retrieval row sources (and filtering) only and nothing else. So, what you have in the query’s projection list (the sequence use for example), does not directly affect the smart scan decision. Just like any other operations like sorting, grouping etc, do not have anything to do with smart scans and don’t disable their use. Smart scans are only related to data retrieval and any other operations do not affect them.

In other words, sequence.NEXTVAL can be used in a query and you can still get smart scans. Here’s simple proof:

 


SQL Monitoring Report

SQL Text
------------------------------
SELECT /*+ MONITOR */ t.object_name, s.NEXTVAL n FROM t WHERE owner LIKE 'TANEL%'

Global Information
------------------------------
 Status              :  DONE (ALL ROWS)
 Instance ID         :  1
 Session             :  TANEL (204:2707)
 SQL ID              :  5kzs2ucxwf4wk
 SQL Execution ID    :  16777216
 Execution Started   :  04/13/2012 09:06:46
 First Refresh Time  :  04/13/2012 09:06:46
 Last Refresh Time   :  04/13/2012 09:06:48
 Duration            :  2s
 Module/Action       :  SQL*Plus/-
 Service             :  dbm
 Program             :  sqlplus@mac02.local (TNS V1-V3)
 Fetch Calls         :  4

Global Stats
========================================================================================
| Elapsed |   Cpu   |    IO    | Concurrency | Fetch | Buffer | Read | Read  |  Cell   |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Calls |  Gets  | Reqs | Bytes | Offload |
========================================================================================
|    1.50 |    0.36 |     1.06 |        0.08 |     4 |   323K | 2503 |   2GB |  99.97% |
========================================================================================

SQL Plan Monitoring Details (Plan Hash Value=1589210007)
======================================================================================================================================================================
| Id |          Operation           | Name |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Read | Read  |  Cell   | Activity |      Activity Detail      |
|    |                              |      | (Estim) |       | Active(s) | Active |       | (Actual) | Reqs | Bytes | Offload |   (%)    |        (# samples)        |
======================================================================================================================================================================
|  0 | SELECT STATEMENT             |      |         |       |         3 |     +0 |     1 |     1408 |      |       |         |          |                           |
|  1 |   SEQUENCE                   | S    |         |       |         3 |     +0 |     1 |     1408 |      |       |         |          |                           |
|  2 |    TABLE ACCESS STORAGE FULL | T    |    439K | 20738 |         4 |     +0 |     1 |     1408 | 2496 |   2GB |  99.97% |   100.00 | cell smart table scan (1) |
======================================================================================================================================================================

I have highlighted the query text (which includes sequence.NEXTVAL) and the TABLE ACCESS STORAGE FULL row source, if you scroll right, you’ll see the “cell smart table scan” wait event and also the Offload % column populated, which both indicate that smart scan did happen for that monitored query execution.

Note that often I hear a saying, something like “xyz disables smart scan”, or “using a non-offloadable function disables the smart scan”. This can be misleading and give the impression that if you use a non-Exadata supported function/feature in one part of your query, then smart scanning is disabled for the whole query. This is incorrect.

First of all, the smart scanning is not a query level thing, it’s an access path level thing. So, you can easily run a two table join on one table which is on NFS storage (no smart scanning) and another table, which is on Exadata storage (which allows smart scanning). So, even though smart scanning is “disabled” for the first table, the other table still can be accessed using it! By the way, it’s even possible to have some partitions of your table on NFS and other partitions of the same table on Exadata storage cells and the same table scan can use smart scanning for the partitions stored on the cells and regular buffered or direct path read access for the partitions on NFS. So, smart scanning is not some “all or nothing” query level thing, this decision is made separately for every single segment full scanned in the query. More details in the book.

Secondly, using a non-offloadable function in a filter predicate when retrieving rows does not prevent Oracle to push other, offloadable, predicate functions to be pushed to the storage cells. So again, it’s not an all-or-nothing decision. If a predicate function is not offloadable, it will just not be offloaded (and will show up as just a filter() predicate as opposed to both storage() and filter() predicate), but all the other supported predicates can still be offloaded without a problem. Here’s an example:


SELECT * FROM t 
WHERE 
     line = 9999
AND  mod(line,10) = 0 
AND  f1(line) =9999

Plan hash value: 1601196873

----------------------------------------------------------------
| Id  | Operation                 | Name | E-Rows | Cost (%CPU)|
----------------------------------------------------------------
|   0 | SELECT STATEMENT          |      |        |  2887 (100)|
|*  1 |  TABLE ACCESS STORAGE FULL| T    |      1 |  2887   (1)|
----------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - storage(("LINE"=9999 AND MOD("LINE",10)=0))
       filter(("LINE"=9999 AND MOD("LINE",10)=0 AND "F1"("LINE")=9999))

 

Note that the F1() function is a PL/SQL function and filtering based on PL/SQL functions can not be offloaded (as Exadata cells don’t run PL/SQL). This is visible in the predicate section where two predicates are shown in the storage() predicate list but the third one (F1) only shows up in the filter() predicate list.

So, as I explained in the beginning, the smart scan is not an all-or-nothing thing, this decision is made for every segment (partition) separately during query execution phase, again for every execution, of course assuming that we are doing a full table scan (of fast full index scan) on them – that part is fixed in the execution plan.

And as shown above, predicate offload is not an all-or-nothing thing either. If one predicate can’t be offloaded, then other predicates still can.

However, there’s a catch. If you combine such predicates with ANDs, Oracle can push the offloadable (storage()) predicates to the storage cells, where they will do the early filtering – and the last, non-offloadable (filter()) predicate is applied then on the retrieved dataset in the database layer.

But when you combine such predicates with an OR clause, then there’s a logical problem here. OR has to return rows which match ANY of the predicates (not ALL the predicates), therefore you can’t only return a subset of rows from the storage cells which match the predicate A and hope that this set contains all the rows which would match the (non-offloadable) predicate B as well. To evaluate B, you would still need to go through all the rows, but as B is non-offloadable, this has to be done in the database layer. All rows have to be fetched into the database layer for this.

This explains why the storage() predicate disappears when I chain my query predicates together with an OR instead of the AND:


SELECT * FROM t 
WHERE 
     line = 9999
AND  mod(line,10) = 0 
OR   f1(line) =9999

Plan hash value: 1601196873

----------------------------------------------------------------
| Id  | Operation                 | Name | E-Rows | Cost (%CPU)|
----------------------------------------------------------------
|   0 | SELECT STATEMENT          |      |        |  2963 (100)|
|*  1 |  TABLE ACCESS STORAGE FULL| T    |   6353 |  2963   (3)|
----------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(("F1"("LINE")=9999 OR ("LINE"=9999 AND MOD("LINE",10)=0)))

The storage() predicate section above is gone – as Oracle has to retrieve all the rows to the database layer anyway (due to the non-offloadable F1 function used in an OR predicate), it doesn’t make sense to offload the other filters to the storage (and then return the row anyway as it’s needed in the db layer). Makes sense? ;-)

Note that even though the storage() predicates disappeared from the plan, this doesn’t mean that the entire smart scanning has been disabled, the smart scanning can still happen, it’s only the predicate offloading sub-component of the smart scan what is not used. The storage cells can still do the asynchronous prefetching, decompression offloading and projection offloading in these cases.

By the way, it’s all explained in the book!

Create a database link with the new host:port/service syntax

I just noticed that (finally) in 11.2 this syntax is supported:

SQL> CREATE DATABASE LINK demo_x2 
  2  CONNECT TO tanel IDENTIFIED BY password 
  3  USING 'exadb03:1521/DEMO';
Database link created.

This just makes life a bit easier as there’s no need to use the long TNS format entry (or a tnsnames.ora/LDAP alias). It might work in 11.1 too (haven’t tested) but it didn’t work on 10.2.0.4 …

Update: This feature works for dblinks in 10.2 onwards – when I tested it on my 10.2, I got an error initially, but it was because the hostname I used didn’t resolve to an IP. Thanks to Randolf Geist for pointing this out.

In case you didn’t know, the sqlplus supports such an easy connect method since 10g:

tanel@mac02:~$ sqlplus tanel/password@exadb03/DEMO
SQL*Plus: Release 10.2.0.4.0 - Production on Mon Mar 5 09:51:27 2012
Copyright (c) 1982, 2007, Oracle. All Rights Reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
SQL>

If you omit the port from the syntax, it defaults to 1521.

Oracle Core: Essential Internals for DBAs and Developers book by Jonathan Lewis

In case you didn’t know, Jonathan Lewis’es new Oracle Core: Essential Internals for DBAs and Developers book is out (for a few weeks already).

I was the technical reviewer for that book and I can say it’s awesome! It will likely be the best Oracle internals book out there for the coming 10 years, just like Steve Adams’es Oracle Internal Services book was in the last decade :) 

Jonathan does a very good job explaining complex things in a simple enough way – and the book is not just dry listing of how things work inside Oracle database, but also why they work like they do and what are the benefits, limitations and side effects of the behavior.

So check out Jonathan’s book page for more details (and I think you can still order a heavily discounted ebook from Apress too):

Enjoy :)

Profiling trace files with preprocessor external tables in 11g and some parallel execution hacking

If you work with SQL Trace files (and profile them) then you should check out the awesome novel use of the “external table preprocessor” feature explained by Adrian Billington here:

Ironically just a day after writing my “Evil things” article, I noticed a note in MOS about how to enable an event 10384 at level 16384 to get a parallel plan to be executed in serial:

  • How to force that a Parallel Query runs in serial with the Parallel Execution Plan [ID 1114405.1]

This way you can still see the A-rows and other feedback like buffer gets and PIOs by row source even if the plan is a parallel plan. The problem is that even with GATHER_PLAN_STATISTICS enabled (or the equivalent parameter(s)) the PX slaves don’t pass their actual time, rows and buffer gets/PIOs stats back to the QC to be displayed in V$SQL_PLAN_STATISTICS / DBMS_XPLAN output. With parallel slaves, all you’d see would be the QC-generated numbers and not the PX slave stuff.

So if you set that magic event (at your own risk) then even the parallel plans would be executed by the QC only (basically a parallel plan executed entirely ins serial) and you’ll still see all the A-rows and buffer gets/physical read numbers as with serial plans. But remember my yesterday’s article ;-)

Alternative options for getting such runtime stats for a parallel query would be:

  1. Use the Real-Time SQL Monitoring feature (only on 11g and with diagnostics+tuning pack licenses)
  2. Enable SQL Trace for the QC, run your parallel query (you’ll have to wait until it finishes or cancel the query with CTRL+C) and consolidate the STAT# lines from all the PX slave sessions – each PX slave dumps the STAT# lines with row-source level stats like regular serial queries
Update: Coskan commented over twitter that why not use ALLSTATS ALL in DBMS_XPLAN.DISPLAY_CURSOR instead of ALLSTATS LAST – and indeed, in this case you would see the A-rows, buffer gets and other stats of both QC and the PX slaves aggregated together. Although I had used ALLSTATS ALL to compare average execution stats to the LAST stats, I never thought of using it for aggregating the PX slave stats together with QC like that.
 
Basically what happens (in 10g+) with PX is that the QC and all PX slaves all execute the same SQL_ID (but sometimes a different child version though, especially when you’re running cross-instance PX, in which case SQL Monitoring is a better option). And ALLSTATS ALL would aggregate the stats of all executions of the particular SQL_ID (and child number). So if you run something with 4 PX slaves with GATHER_PLAN_STATISTICS enabled, they all add their execution stats to the “ALL” columns. It’s just that the QC is always the last one to finish executing the query (that’s when the exec stats get updated in the V$SQL_PLAN_STATISTICS views), that’s why the ALLSTATS LAST always shows the limited QC stats and not the PX slave ones. But Allstats ALL solves that problem.
 
However the catch is that ALLSTATS ALL would accumulate all executions of this query (including your previous runs of it), so if you want to see fresh stats of only your latest query execution, you should adjust the SQL_ID so that a new cursor would be created. And you can adjust the SQL_ID by simply adding a comment like /* test 2 */ somewhere into your query text…
Thanks Coskan! :)

Evil things are happening in Oracle

Relax, I’m talking about the Oracle Database kernel here, not the corporation ;-)

Here’s a couple of more reasons why not to play around with undocumented debug events unless you’re really sure why and how would they help to solve your specific problem (and you’ve gotten a blessing in some form from Oracle support too):

$ oerr ora 10665
 10665, 00000, "Inject Evil Literals"
 // *Cause:  Event 10665 is set to some number &gt; 0, causing 1/(value-1) of all
 //          literals to be replaced by 2000 letter 'A's.  A value of 1 does
 //          not corrupt anything.
 // *Action: never set this event

$ oerr ora 10668

10668, 00000, "Inject Evil Identifiers"
 // *Cause:  event 10668 is set to some number &gt; 0, causing 1/(value-1) of all
 //          identifiers to be replaced by a maximum amount of x's.  It is
 //          common for an identifier to be parsed once with a max of 30 bytes,
 //          then reparsed later with a max of 4000, so it may not be possible
 //          to inject such an identifier without the aid of this event.  A
 //          value of 1 causes no identifiers to be corrupted.
 // *Action: never set this event

 

Some events are meant to be left alone. You don’t want to wake up the evil sleeping deep in the core of the Oracle Kernel-land!

 

By the way, Karl Arao once managed to capture what this evil creature looks like:

 

http://t.co/GVbxwfnJ

:)

What the heck is the SQL Execution ID – SQL_EXEC_ID?

Ok, I think it’s time to write another blog entry. I’ve been traveling and dealing with jetlag from 10-hour time difference, then traveling some more, spoken at conferences, drank beer, had fun, then traveled some more, trained customers, hacked some Exadatas and now I’m back home.

Anyway, do you know what is the SQL_EXEC_ID in V$SESSION and ASH views?

Oh yeah, it’s the “SQL Execution ID” just like the documentation says … all clear. Um … is it? I’d like to know more about it – what does it actually stand for?! Is it session level, instance level or a RAC-global counter? And why does it start from 16 million, not 1?

 

SQL> SELECT sql_exec_id FROM v$session WHERE sid = USERENV('SID');

SQL_EXEC_ID
-----------
   16777216

 

This number 16777216 looks strangely familiar – indeed, it’s 2^24.

When I run the same query again (incrementing the SQL_EXEC_ID counter for the same SQL), I see the counter going up by 1:

 

SQL> SELECT sql_exec_id FROM v$session WHERE sid = USERENV('SID');

SQL_EXEC_ID
-----------
   16777217

SQL> SELECT sql_exec_id FROM v$session WHERE sid = USERENV('SID');

SQL_EXEC_ID
-----------
   16777218

SQL> SELECT sql_exec_id FROM v$session WHERE sid = USERENV('SID');

SQL_EXEC_ID
-----------
   16777219

SQL> SELECT sql_exec_id FROM v$session WHERE sid = USERENV('SID');

SQL_EXEC_ID
-----------
   16777220

 

Further executions of the same query keep incrementing this counter, one by one – even if I run this same SQL from another session in the same instance. So, this SQL_EXEC_ID is not a session-scope value for each SQL_ID, it’s at least instance-wide. It looks like the counting starts from 2^24 (the bit representing 2^24 is set) and ignoring that bit for now, the counting works normally, one by one, starting from zero.

 

Note that changing even a single character in the SQL text (see the extra space in the end before the semi-colon) causes the SQL_ID to change and a different SQL_EXEC_ID counter to be reported (which starts from “zero” again). A separate SQL_EXEC_ID counter is maintained in shared pool for each SQL_ID:

SQL> SELECT sql_exec_id FROM v$session WHERE sid = USERENV('SID') ;

SQL_EXEC_ID
-----------
   16777216

 

So, obviously, when I have just restarted my instance and still see 16777216 as the starting SQL_EXEC_ID for any SQL I execute, it must mean that the full SQL_EXEC_ID value contains something else than just the execution number of this SQL_ID. Whenever I see such familiar values (like powers of 2), then I like to look into the values in hex format to see whether some higher order bits are used for some special purpose. Let’s run a new SQL statement:

 

SQL> SELECT sql_exec_id, TO_CHAR(sql_exec_id,'XXXXXXXX') hex FROM v$session WHERE sid = USERENV('SID') ;

SQL_EXEC_ID HEX
----------- ---------
   16777216   1000000

SQL> SELECT sql_exec_id, TO_CHAR(sql_exec_id,'XXXXXXXX') hex FROM v$session WHERE sid = USERENV('SID') ;

SQL_EXEC_ID HEX
----------- ---------
   16777217   1000001

SQL> SELECT sql_exec_id, TO_CHAR(sql_exec_id,'XXXXXXXX') hex FROM v$session WHERE sid = USERENV('SID') ;

SQL_EXEC_ID HEX
----------- ---------
   16777218   1000002

 

Indeed, it looks like the 25th bit (2^24) is always pre-set to 1, while the least significant 24 bits represent how many times this SQL ID has been executed in an instance (I have tested this with a loop – the 24 least significant bits do get used fully for representing the SQL ID’s execution count in the instance and once it reaches 0xFFFFFF – or 0x1FFFFFF with that pre-set 25th bit, it wraps to 0×1000000 – the 25th bit still remaining set!). So the SQL_EXEC_ID can reliably only track 2^24 – 1 SQL executions in an instance and then the counter wraps to beginning. This is why you should include SQL_EXEC_START (date datatype with 1 sec precision) column in your performance monitoring queries as well, to distinguish between SQL executions with a colliding SQL_EXEC_ID. As long as you’re executing your SQL statement less than 16.7 million times per second per instance, this should be fine :-)

 

Anyway, so what’s the magic 25th bit then? Well, in RAC it would be very hard to somehow coordinate the incrementing of a single counter globally (that’s why you want to keep your sequences cached in RAC), I figure that there are different counters for the same SQL ID in different RAC instances. Let’s check – I will log in to another RAC node (node 2) and run this:

 

SQL> SELECT sql_exec_id, TO_CHAR(sql_exec_id,'XXXXXXXX') hex FROM v$session WHERE sid = USERENV('SID') ;

SQL_EXEC_ID HEX
----------- ---------
   33554433   2000001

SQL> SELECT sql_exec_id, TO_CHAR(sql_exec_id,'XXXXXXXX') hex FROM v$session WHERE sid = USERENV('SID') ;

SQL_EXEC_ID HEX
----------- ---------
   33554434   2000002

SQL> SELECT sql_exec_id, TO_CHAR(sql_exec_id,'XXXXXXXX') hex FROM v$session WHERE sid = USERENV('SID') ;

SQL_EXEC_ID HEX
----------- ---------
   33554435   2000003

 

Whoa – the SQL Execution ID in the 2nd instance starts from 33 Million! And when you convert the value to hex, you’ll see that now the 26th bit is set – showing that this SQL was executed in instance #2!

So, it very much looks like that while the 24 least significant bits are used for the SQL execution ID counter, the more significant bits are used for showing which instance_id ran that SQL. Assuming that 32 bits are used for the whole SQL_EXEC_ID value, then up to 8 higher order bits could be used for storing the instance_id – supporting up to 256-node RAC clusters. This is very useful when analyzing past ASH data as you can aggregate data (count min/max exec ID difference to get the execution counts in a time range) either in each separate instance or globally – by stripping out the instance_id part from the value.

I haven’t tested the instance_id part with 256-node RAC clusters (as Santa Claus is cutting back due to poor economy), but at least on an 8-node full rack Exadata all 8 instance_ids were reported properly. Note that for serial queries, the SQL_EXEC_ID shows you the instance_id of the instance where the session is logged on to, but for inter-instance parallel query, you will see the instance_id of the query coordinator for all PX slaves, regardless of in which instances they run. Here’s a little script from a 8-node Exadata cluster to show it. I’ll leave it up to you to fully figure it out what, how and why it’s doing, but basically what it shows is that the SQL_EXEC_ID consists of the query coordinator’s instance_id value and the execution number for a SQL_ID in the instance where the query coordinator session was logged in:

 

SQL> SELECT qc_instance_id, MIN(TO_CHAR(sql_exec_id,'XXXXXXXX'))
  2  , MAX(TO_CHAR(sql_exec_id,'XXXXXXXX'))
  3* FROM gv$active_session_history GROUP BY qc_instance_id order by 1
SQL> /

QC_INSTANCE_ID MIN(TO_CH MAX(TO_CH
-------------- --------- ---------
             1   1000000   100540F
             2   2000000   20009BF
             3   3000000   300541E
             4   4000000   40000DD
             5   5000000   50C5035
             6   6000000   600018C
             7   7000000   700023D
             8   8000000   8000755
                 1000000   803DF3B

9 rows selected.

 

That’s all for today – more cool stuff is coming, I promise :-)

 

And oh, next week I’ll start another run of my Advanced Oracle Troubleshooting seminar, so check it out! ;-)

V8 Bundled Exec call – and Oracle Program Interface (OPI) calls

So, what he hell is that V8 Bundled Exec call which shows up in various Oracle 11g monitoring reports?!

It’s yet another piece of instrumentation which can be useful for diagnosing non-trivial performance problems. Oracle ASH has allowed us to measure what is the top wait event or top SQLID for a long time, but now it’s also possible to take a step back and see what type of operation the database session is servicing. 

I am talking about Oracle Program Interface (OPI) calls. Basically for each OCI call in the client side (like , OCIStmtExecute, OCIStmtFetch, etc) there’s a corresponding server side OPI function (like opiexe(), opifch2() etc). 

It has been possible to trace all the OPI calls with event 10051 as I’ve explained here, but since Oracle 11g this data is also conveniently accessible from ASH views (the various monitoring reports, including SQL Monitoring report also use ASH data for some features).

So, I can write a simple query against ASH, which doesn’t group the samples by wait event or SQLID, but just by the general OPI call type (TOPLEVELCALLNAME column) and also by the SQL command type (using V$SQLCOMMAND in 11.2):  

SQL> SELECT
  2      a.top_level_call#
  3    , a.top_level_call_name
  4    , a.top_level_sql_opcode
  5    , s.command_name
  6    , COUNT(*)
  7  FROM
  8      v$active_session_history a
  9    , v$sqlcommand s
 10  WHERE
 11      a.top_level_sql_opcode = s.command_type
 12  GROUP BY
 13      a.top_level_call#
 14    , a.top_level_call_name
 15    , a.top_level_sql_opcode
 16    , s.command_name
 17  ORDER BY
 18*     COUNT(*) DESC
 19  /

TOP_LEVEL_CALL# TOP_LEVEL_CALL_NAME     top_op# COMMAND_NAME                     COUNT(*)
--------------- -------------------- ---------- ------------------------------ ----------
             94 V8 Bundled Exec               7 DELETE                              10505
              0                               0                                      4041
             59 VERSION2                      0                                       579
             59 VERSION2                     47 PL/SQL EXECUTE                        377
             59 VERSION2                      3 SELECT                                191
             96 LOB/FILE operations         170 CALL METHOD                            67
             59 VERSION2                    170 CALL METHOD                            66
             94 V8 Bundled Exec               6 UPDATE                                 52
             59 VERSION2                      6 UPDATE                                 41
             59 VERSION2                      7 DELETE                                 36
             94 V8 Bundled Exec               3 SELECT                                 24
             96 LOB/FILE operations          47 PL/SQL EXECUTE                         18
             59 VERSION2                      2 INSERT                                  8
             94 V8 Bundled Exec               1 CREATE TABLE                            5
              0                               3 SELECT                                  3
             59 VERSION2                     15 ALTER TABLE                             1
             96 LOB/FILE operations           0                                         1
             59 VERSION2                     12 DROP TABLE                              1
              5 FETCH                         3 SELECT                                  1
             94 V8 Bundled Exec              12 DROP TABLE                              1

20 rows selected.

 

Aas you see above, most of the ASH samples in my test database have been created by a DELETE type SQL statement, executed via V8 Bundled Exec type of an OPI call.

So, what is this call about? Let’s explain its name. Look into other call types in the above output. In the bottom you see a FETCH call (fetching from a SELECT type statement). Also there are a few LOB/FILE operations calls, which are used exclusively for accessing LOB data (via LOB locator, bypassing the usual SQL processing layer).

In Oracle 7 you would also see PARSE and EXECUTE calls, but starting from Oracle 8 not anymore. This is because starting from Oracle 8, the OPI layer in database side can accept bundled OCI calls from the client – to reduce the number of network roundtrips. So, basically instead of sending the PARSE and EXEC requests in separate SQLNet roundtrips (increasing the latency), the OCI client libraries can bundle these requests together and send them to database in one SQLNet payload. The database server side understands it and is able to extract these separate OPI requests from the bundled packet (in right order) and execute the corresponding OPI function for each separate call. 

Note that this is why you frequently see the kpoal8() function close to the beginning in Oracle server process stack traces (and where I usually start reading them from), this is the function which processes all the OPI requests sent to it in a bundled package. So, whenever there’s a OCIStmtExecute() call extracted from the bundle, the opiexe() function is called in Oracle kernel with appropriate arguments extracted from the same bundle. Whenever we extract an OCIStmtFetch2() call from the bundle, the corresponding opifch2() function is called in the kernel.

Hopefully this explains why is there such a call “V8 Bundled Exec” in Oracle. It just allows to reduce client – server communication latency by allowing to bundle multiple database requests together into a single SQL*Net payload. In other words, it’s just how Oracle works and it’s perfectly normal to see V8 Bundled Exec as the top OPI call type in performnace reports. If you see this OPI call as the top one, then you’d need to drill down into what’s the actual SQLID which consumes the most of the response time (and further breakdown like which wait event and execution plan step takes the most time).

But the ability of breaking down database response time by OPI call becomes much more useful when troubleshooting somewhat more exotic performance problems like LOB access times (where there’s no SQL statement associated with the database call) or other direct OPI calls which are executed without parsing and running a SQL cursor.

For example, have you noticed that the behavior of ROLLBACK command in sqlplus is different from the shorter ROLL command?

When you issue a ROLLBACK, then Oracle will actually send the string “ROLLBACK” to the database as a regular SQL statement (using V8 Bundled Exec), it will be parsed there (with all the latching and shared pool overhead) as regular SQL – and then Oracle realizes that the command in it is a rollback. Then rollback is performed.

But if you issue a ROLL command, then sqlplus understands it and doesn’t send it to the database for parsing like a regular SQL statement. Instead it will send an OCITransRollback() call, which will call the corresponding OPI function directly, bypassing the SQL processing layer completely. Instead of the usual “V8 Bundled Exec” bundle call you would see a “Transaction Commit/Rollback” OPI call as it was called directly, without any SQL statement processing involved. This is why you sometimes see WAIT#0 lines in SQLTrace, where the waits seem to be associated with some non-existent cursor #0. Whenever the wait happens when the database session is servicing an OPI call which bypasses the SQL processing layer (kks/kkx modules) then the SQLTrace just shows cursor#0 as the wait’s “owner”.

The same happens when using things like connection.commit() in JDBC, the client does not send a SQL statement with text “commit” into the datbase, but rather will call out the OPI commit function out directly.

So, how many different OPI calls are there? Well, a lot, as you can see from v$toplevelcall (or its underlying x$orafn) in Oracle 11.2:

 

SQL> SELECT * FROM v$toplevelcall;

TOP_LEVEL_CALL# TOP_LEVEL_CALL_NAME
--------------- ----------------------------------------------------------------
              0
              2 OPEN
              3 PARSE
              4 EXECUTE
              5 FETCH
              8 CLOSE
              9 LOGOFF
             10 DESCRIBE
             11 DEFINE
             12 COMMIT ON
             13 COMMIT OFF
             14 COMMIT
             15 ROLLBACK
             16 SET OPTIONS
             17 RESUME
             18 VERSION
             20 CANCEL
             21 GET ERR MSG
             23 SPECIAL
             24 ABORT
             25 DEQ ROW
             26 FETCH LONG
             31 HOW MANY
             32 INIT
             33 CHANGE USER
             34 BIND REF POS
             35 GET BIND VAR
             36 GET INTO VAR
             37 BINDV REF
             38 BINDN REF
             39 PARSE EXE
             40 PARSE SYNTAX
             41 PARSE SYNSDI
             42 CONTINUE
             43 ARRAY DESC
             44 INIT PARS
             45 FIN PARS
             46 PUT PAR
             48 START ORACLE
             49 STOP ORACLE
             50 RUN IND PROC
             52 ARCHIVE OP
             53 MED REC STRT
             54 MED REC TABS
             55 MED REC GETS
             56 MED REC RECL
             57 MED REC CANC
             58 LOGON
             59 VERSION2
             60 INIT
             62 EVERYTHING
             65 DIRECT LOAD
             66 UL BUFFER XMIT
             67 DISTRIB XACTION
             68 DESCRIBE INDEXES
             69 SESSION OPS
             70 EXEC w/SCN
             71 FAST UPI
             72 FETCH LONG
             74 V7 PARSE
             76 PL/SQL RPC
             78 EXEC & FCH
             79 XA OPS
             80 KGL OP
             81 LOGON
             82 LOGON
             83 Streaming op
             84 SES OPS (71)
             85 XA OPS (71)
             86 DEBUG
             87 DEBUGS
             88 XA XA Start
             89 XA XA Commit
             90 XA XA Prepare
             91 x/import
             92 KOD OP
             93 RPI Callback with ctxdef
             94 V8 Bundled Exec
             95 Streaming op
             96 LOB/FILE operations
             97 FILE Create
             98 V8 Describe Query
             99 Connect
            100 OPEN Recursive
            101 Bundled KPR
            102 Bundled PL/SQL
            103 Transaction Start/End
            104 Transaction Commit/Rollback
            105 Cursor close all
            106 Failover session info
            107 SES OPS (80)
            108 Do Dummy Defines
            109 INIT V8 PARS
            110 FIN V8 PARS
            111 PUT V8 PAR
            112 TERM V8 PARS
            114 INIT UNTR CB
            115 OAUTH
            116 Failover get info
            117 Commit Remote Sites
            118 OSESSKEY
            119 V8 Describe Any
            120 Cancel All
            121 Enqueue
            122 Dequeue pre 8.1
            123 Object Transfer
            124 RFS op
            125 Notification
            126 Listen
            127 Commit Remote Sites >= V813
            128 DirPathPrepare
            129 DirPathLoadStream
            130 DirPathMiscOps
            131 MEMORY STATS
            132 AQ Prop Status
            134 remote Fetch Archive Log (FAL)
            135 Client ID propagation
            136 DR Server CNX Process
            138 SPFILE parameter put
            139 KPFC exchange
            140 V82 Object Transfer
            141 Push transaction
            142 Pop transaction
            143 KFN Operation
            144 DirPathUnloadStream
            145 AQ batch enqueue/dequeue
            146 File transfer
            147 PING
            148 TSM
            150 Begin TSM
            151 End TSM
            152 Set schema
            153 Fetch from suspended result-set
            154 Key value pair
            155 XS Create Session Op
            156 XS Session RoundtripOp
            157 XS Piggyback Oper.
            158 KSRPC Execution
            159 Streams combined capture/apply
            160 AQ replay information
            161 SSCR
            162 OSESSGET
            163 OSESSRLS
            165 workload replay data
            166 replay statistic data
            167 Query Cache Stats
            168 Query Cache IDs
            169 RPC Test Stream
            170 replay plsql rpc
            171 XStream Out
            172 Golden Gate RPC

151 rows selected.

 

A lot of calls … special stuff (like DESCRIBE) which will bypass the SQL layer completely (but may in turn invoke further recursive SQL statements through the Recursive Program Interface – RPI).

Ok, time to stop – if you want to learn more, enable SQL trace with waits & binds and event 10051 at level 1 in your test database and try to describe a table or read some LOB columns for example!

 

Full scans, direct path reads and ORA-8103 error hacking session video here (plus iTunes podcast address!)

I have uploaded the latest hacking session video to blip.tv. I have edited it a little, I cut out the part where I spilled an entire Red Bull onto my desk, with some onto my laptop (some keys are still sticky:)

Also, I do upload all these sessins into iTunes – so you can subscribe to my podcast! That way you can download the videos into your computer, phone or iPad. I have deliberately used 1024×768 resolution so it would look awesome on iPad screen! (so hopefully your commute time gets a bit more fun now ;-)

 

Enjoy!

iTunes video-podcast: