V8 Bundled Exec call – and Oracle Program Interface (OPI) calls

So, what he hell is that V8 Bundled Exec call which shows up in various Oracle 11g monitoring reports?!

It’s yet another piece of instrumentation which can be useful for diagnosing non-trivial performance problems. Oracle ASH has allowed us to measure what is the top wait event or top SQLID for a long time, but now it’s also possible to take a step back and see what type of operation the database session is servicing. 

I am talking about Oracle Program Interface (OPI) calls. Basically for each OCI call in the client side (like , OCIStmtExecute, OCIStmtFetch, etc) there’s a corresponding server side OPI function (like opiexe(), opifch2() etc). 

It has been possible to trace all the OPI calls with event 10051 as I’ve explained here, but since Oracle 11g this data is also conveniently accessible from ASH views (the various monitoring reports, including SQL Monitoring report also use ASH data for some features).

So, I can write a simple query against ASH, which doesn’t group the samples by wait event or SQLID, but just by the general OPI call type (TOPLEVELCALLNAME column) and also by the SQL command type (using V$SQLCOMMAND in 11.2):  

SQL> SELECT
  2      a.top_level_call#
  3    , a.top_level_call_name
  4    , a.top_level_sql_opcode
  5    , s.command_name
  6    , COUNT(*)
  7  FROM
  8      v$active_session_history a
  9    , v$sqlcommand s
 10  WHERE
 11      a.top_level_sql_opcode = s.command_type
 12  GROUP BY
 13      a.top_level_call#
 14    , a.top_level_call_name
 15    , a.top_level_sql_opcode
 16    , s.command_name
 17  ORDER BY
 18*     COUNT(*) DESC
 19  /

TOP_LEVEL_CALL# TOP_LEVEL_CALL_NAME     top_op# COMMAND_NAME                     COUNT(*)
--------------- -------------------- ---------- ------------------------------ ----------
             94 V8 Bundled Exec               7 DELETE                              10505
              0                               0                                      4041
             59 VERSION2                      0                                       579
             59 VERSION2                     47 PL/SQL EXECUTE                        377
             59 VERSION2                      3 SELECT                                191
             96 LOB/FILE operations         170 CALL METHOD                            67
             59 VERSION2                    170 CALL METHOD                            66
             94 V8 Bundled Exec               6 UPDATE                                 52
             59 VERSION2                      6 UPDATE                                 41
             59 VERSION2                      7 DELETE                                 36
             94 V8 Bundled Exec               3 SELECT                                 24
             96 LOB/FILE operations          47 PL/SQL EXECUTE                         18
             59 VERSION2                      2 INSERT                                  8
             94 V8 Bundled Exec               1 CREATE TABLE                            5
              0                               3 SELECT                                  3
             59 VERSION2                     15 ALTER TABLE                             1
             96 LOB/FILE operations           0                                         1
             59 VERSION2                     12 DROP TABLE                              1
              5 FETCH                         3 SELECT                                  1
             94 V8 Bundled Exec              12 DROP TABLE                              1

20 rows selected.

 

Aas you see above, most of the ASH samples in my test database have been created by a DELETE type SQL statement, executed via V8 Bundled Exec type of an OPI call.

So, what is this call about? Let’s explain its name. Look into other call types in the above output. In the bottom you see a FETCH call (fetching from a SELECT type statement). Also there are a few LOB/FILE operations calls, which are used exclusively for accessing LOB data (via LOB locator, bypassing the usual SQL processing layer).

In Oracle 7 you would also see PARSE and EXECUTE calls, but starting from Oracle 8 not anymore. This is because starting from Oracle 8, the OPI layer in database side can accept bundled OCI calls from the client – to reduce the number of network roundtrips. So, basically instead of sending the PARSE and EXEC requests in separate SQLNet roundtrips (increasing the latency), the OCI client libraries can bundle these requests together and send them to database in one SQLNet payload. The database server side understands it and is able to extract these separate OPI requests from the bundled packet (in right order) and execute the corresponding OPI function for each separate call. 

Note that this is why you frequently see the kpoal8() function close to the beginning in Oracle server process stack traces (and where I usually start reading them from), this is the function which processes all the OPI requests sent to it in a bundled package. So, whenever there’s a OCIStmtExecute() call extracted from the bundle, the opiexe() function is called in Oracle kernel with appropriate arguments extracted from the same bundle. Whenever we extract an OCIStmtFetch2() call from the bundle, the corresponding opifch2() function is called in the kernel.

Hopefully this explains why is there such a call “V8 Bundled Exec” in Oracle. It just allows to reduce client – server communication latency by allowing to bundle multiple database requests together into a single SQL*Net payload. In other words, it’s just how Oracle works and it’s perfectly normal to see V8 Bundled Exec as the top OPI call type in performnace reports. If you see this OPI call as the top one, then you’d need to drill down into what’s the actual SQLID which consumes the most of the response time (and further breakdown like which wait event and execution plan step takes the most time).

But the ability of breaking down database response time by OPI call becomes much more useful when troubleshooting somewhat more exotic performance problems like LOB access times (where there’s no SQL statement associated with the database call) or other direct OPI calls which are executed without parsing and running a SQL cursor.

For example, have you noticed that the behavior of ROLLBACK command in sqlplus is different from the shorter ROLL command?

When you issue a ROLLBACK, then Oracle will actually send the string “ROLLBACK” to the database as a regular SQL statement (using V8 Bundled Exec), it will be parsed there (with all the latching and shared pool overhead) as regular SQL – and then Oracle realizes that the command in it is a rollback. Then rollback is performed.

But if you issue a ROLL command, then sqlplus understands it and doesn’t send it to the database for parsing like a regular SQL statement. Instead it will send an OCITransRollback() call, which will call the corresponding OPI function directly, bypassing the SQL processing layer completely. Instead of the usual “V8 Bundled Exec” bundle call you would see a “Transaction Commit/Rollback” OPI call as it was called directly, without any SQL statement processing involved. This is why you sometimes see WAIT#0 lines in SQLTrace, where the waits seem to be associated with some non-existent cursor #0. Whenever the wait happens when the database session is servicing an OPI call which bypasses the SQL processing layer (kks/kkx modules) then the SQLTrace just shows cursor#0 as the wait’s “owner”.

The same happens when using things like connection.commit() in JDBC, the client does not send a SQL statement with text “commit” into the datbase, but rather will call out the OPI commit function out directly.

So, how many different OPI calls are there? Well, a lot, as you can see from v$toplevelcall (or its underlying x$orafn) in Oracle 11.2:

 

SQL> SELECT * FROM v$toplevelcall;

TOP_LEVEL_CALL# TOP_LEVEL_CALL_NAME
--------------- ----------------------------------------------------------------
              0
              2 OPEN
              3 PARSE
              4 EXECUTE
              5 FETCH
              8 CLOSE
              9 LOGOFF
             10 DESCRIBE
             11 DEFINE
             12 COMMIT ON
             13 COMMIT OFF
             14 COMMIT
             15 ROLLBACK
             16 SET OPTIONS
             17 RESUME
             18 VERSION
             20 CANCEL
             21 GET ERR MSG
             23 SPECIAL
             24 ABORT
             25 DEQ ROW
             26 FETCH LONG
             31 HOW MANY
             32 INIT
             33 CHANGE USER
             34 BIND REF POS
             35 GET BIND VAR
             36 GET INTO VAR
             37 BINDV REF
             38 BINDN REF
             39 PARSE EXE
             40 PARSE SYNTAX
             41 PARSE SYNSDI
             42 CONTINUE
             43 ARRAY DESC
             44 INIT PARS
             45 FIN PARS
             46 PUT PAR
             48 START ORACLE
             49 STOP ORACLE
             50 RUN IND PROC
             52 ARCHIVE OP
             53 MED REC STRT
             54 MED REC TABS
             55 MED REC GETS
             56 MED REC RECL
             57 MED REC CANC
             58 LOGON
             59 VERSION2
             60 INIT
             62 EVERYTHING
             65 DIRECT LOAD
             66 UL BUFFER XMIT
             67 DISTRIB XACTION
             68 DESCRIBE INDEXES
             69 SESSION OPS
             70 EXEC w/SCN
             71 FAST UPI
             72 FETCH LONG
             74 V7 PARSE
             76 PL/SQL RPC
             78 EXEC & FCH
             79 XA OPS
             80 KGL OP
             81 LOGON
             82 LOGON
             83 Streaming op
             84 SES OPS (71)
             85 XA OPS (71)
             86 DEBUG
             87 DEBUGS
             88 XA XA Start
             89 XA XA Commit
             90 XA XA Prepare
             91 x/import
             92 KOD OP
             93 RPI Callback with ctxdef
             94 V8 Bundled Exec
             95 Streaming op
             96 LOB/FILE operations
             97 FILE Create
             98 V8 Describe Query
             99 Connect
            100 OPEN Recursive
            101 Bundled KPR
            102 Bundled PL/SQL
            103 Transaction Start/End
            104 Transaction Commit/Rollback
            105 Cursor close all
            106 Failover session info
            107 SES OPS (80)
            108 Do Dummy Defines
            109 INIT V8 PARS
            110 FIN V8 PARS
            111 PUT V8 PAR
            112 TERM V8 PARS
            114 INIT UNTR CB
            115 OAUTH
            116 Failover get info
            117 Commit Remote Sites
            118 OSESSKEY
            119 V8 Describe Any
            120 Cancel All
            121 Enqueue
            122 Dequeue pre 8.1
            123 Object Transfer
            124 RFS op
            125 Notification
            126 Listen
            127 Commit Remote Sites >= V813
            128 DirPathPrepare
            129 DirPathLoadStream
            130 DirPathMiscOps
            131 MEMORY STATS
            132 AQ Prop Status
            134 remote Fetch Archive Log (FAL)
            135 Client ID propagation
            136 DR Server CNX Process
            138 SPFILE parameter put
            139 KPFC exchange
            140 V82 Object Transfer
            141 Push transaction
            142 Pop transaction
            143 KFN Operation
            144 DirPathUnloadStream
            145 AQ batch enqueue/dequeue
            146 File transfer
            147 PING
            148 TSM
            150 Begin TSM
            151 End TSM
            152 Set schema
            153 Fetch from suspended result-set
            154 Key value pair
            155 XS Create Session Op
            156 XS Session RoundtripOp
            157 XS Piggyback Oper.
            158 KSRPC Execution
            159 Streams combined capture/apply
            160 AQ replay information
            161 SSCR
            162 OSESSGET
            163 OSESSRLS
            165 workload replay data
            166 replay statistic data
            167 Query Cache Stats
            168 Query Cache IDs
            169 RPC Test Stream
            170 replay plsql rpc
            171 XStream Out
            172 Golden Gate RPC

151 rows selected.

 

A lot of calls … special stuff (like DESCRIBE) which will bypass the SQL layer completely (but may in turn invoke further recursive SQL statements through the Recursive Program Interface – RPI).

Ok, time to stop – if you want to learn more, enable SQL trace with waits & binds and event 10051 at level 1 in your test database and try to describe a table or read some LOB columns for example!

 

LOBREAD SQL Trace entry in Oracle 11.2 (and tracing OPI calls with event 10051)

A few days ago I looked into a SQL Tracefile of some LOB access code and saw a LOBREAD entry there. This is a really welcome improvement (or should I say, bugfix of a lacking feature) for understanding resource consumption by LOB access OPI calls. Check the bottom of the output below:

*** 2011-03-17 14:34:37.242
WAIT #47112801352808: nam='SQL*Net message from client' ela= 189021 driver id=1413697536 #bytes=1 p3=0 obj#=99584 tim=1300390477242725
WAIT #0: nam='gc cr multi block request' ela= 309 file#=10 block#=20447903 class#=1 obj#=99585 tim=1300390477243368
WAIT #0: nam='cell multiblock physical read' ela= 283 cellhash#=379339958 diskhash#=787888372 bytes=32768 obj#=99585 tim=1300390477243790
WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390477243865
[...snipped...]
WAIT #0: nam='SQL*Net more data to client' ela= 2 driver id=1413697536 #bytes=2048 p3=0 obj#=99585 tim=1300390477244205
WAIT #0: nam='SQL*Net more data to client' ela= 4 driver id=1413697536 #bytes=2048 p3=0 obj#=99585 tim=1300390477244221
WAIT #0: nam='gc cr multi block request' ela= 232 file#=10 block#=20447911 class#=1 obj#=99585 tim=1300390477244560
WAIT #0: nam='cell multiblock physical read' ela= 882 cellhash#=379339958 diskhash#=787888372 bytes=32768 obj#=99585 tim=1300390477245579
WAIT #0: nam='SQL*Net more data to client' ela= 16 driver id=1413697536 #bytes=2020 p3=0 obj#=99585 tim=1300390477245685
WAIT #0: nam='SQL*Net more data to client' ela= 6 driver id=1413697536 #bytes=2048 p3=0 obj#=99585 tim=1300390477245706
WAIT #0: nam='SQL*Net more data to client' ela= 5 driver id=1413697536 #bytes=1792 p3=0 obj#=99585 tim=1300390477245720
LOBREAD: c=1000,e=2915,p=8,cr=5,cu=0,tim=1300390477245735

In past versions of Oracle the CPU (c=) usage figures and other stats like number of physical/logical reads of the LOB chunk read OPI call were just lost – they were never reported in the tracefile. In past only the most common OPI calls, like PARSE, EXEC, BIND, FETCH (and recently CLOSE cursor) were instrumented with SQL Tracing. But since 11.2(.0.2?) the LOBREAD’s are printed out too. This is good, as it reduces the amount of guesswork needed to figure out what are those WAITs for cursor #0 – which is really a pseudocursor.

Why cursor#0? It’s because normally, with PARSE/EXEC/BIND/FETCH, you always had to specify a cursor slot number you operated on (if you fetch from cursor #5, it means that Oracle process went to slot #5 in the open cursor array in your session’s UGA and followed the pointers to shared cursor’s executable parts in library cache from there). But LOB interface works differently – if you select a LOB column using your query (cursor), then all your application gets is a LOB LOCATOR (sort of a pointer with LOB item ID and consistent read/version SCN). Then it’s your application which must issue another OPI call (LOBREAD) to read the chunks of that LOB out from the database. And the LOB locator is independent from any cursors, it doesn’t follow the same cursor API as regular SQL statements (as it requires way different functionality compared to a regular select or update statement).

So, whenever a wait happened in your session due to an access using a LOB locator, then there’s no specific cursor responsible for it (as far as Oracle sees internally) and that’s why a fake, pseudocursor #0 is used.

Note that on versions earlier than 11.2(.0.2?) when the LOBREAD wasn’t printed out to trace – you can use OPI call tracing (OPI stands for Oracle Program Interface and is the server-side counterpart to OCI API in the client side) using event 10051. First enable SQL Trace and then the event 10051 (or the other way around if you like):

SQL> @oerr 10051

ORA-10051: trace OPI calls

SQL> alter session set events '10051 trace name context forever, level 1';

Session altered.

Now run some LOB access code and check the tracefile:

*** 2011-03-17 14:37:07.178
WAIT #47112806168696: nam='SQL*Net message from client' ela= 6491763 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627178602
OPI CALL: type=105 argc= 2 cursor=  0 name=Cursor close all
CLOSE #47112806168696:c=0,e=45,dep=0,type=1,tim=1300390627178731
OPI CALL: type=94 argc=28 cursor=  0 name=V8 Bundled Exec
=====================
PARSING IN CURSOR #47112802701552 len=19 dep=0 uid=93 oct=3 lid=93 tim=1300390627179807 hv=1918872834 ad='271cc1480' sqlid='3wg0udjt5zb82'
select * from t_lob
END OF STMT
PARSE #47112802701552:c=1000,e=1027,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=3547887701,tim=1300390627179805
EXEC #47112802701552:c=0,e=29,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3547887701,tim=1300390627179884
WAIT #47112802701552: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627179939
WAIT #47112802701552: nam='SQL*Net message from client' ela= 238812 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627418785
OPI CALL: type= 5 argc= 2 cursor= 26 name=FETCH
WAIT #47112802701552: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627418945
FETCH #47112802701552:c=0,e=93,p=0,cr=5,cu=0,mis=0,r=1,dep=0,og=1,plh=3547887701,tim=1300390627418963
WAIT #47112802701552: nam='SQL*Net message from client' ela= 257633 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627676629
OPI CALL: type=96 argc=21 cursor=  0 name=LOB/FILE operations
WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=99585 tim=1300390627676788
[...snip...]
WAIT #0: nam='SQL*Net more data to client' ela= 2 driver id=1413697536 #bytes=1792 p3=0 obj#=99585 tim=1300390627677054
LOBREAD: c=0,e=321,p=0,cr=5,cu=0,tim=1300390627677064

Check the bold and especially the red string above.  Tracing OPI calls gives you some extra details of what kind of tasks are executed in the session. The “LOB/FILE operations” call indicates that whatever lines come after it (unlike SQL trace call lines where all the activity happens before a call line is printed (with some exceptions of course)) are done for this OPI call (until a next OPI call is printed out). OPI call tracing should work even on ancient database versions…

By the way, if you are wondering, what’s the cursor number 47112801352808 in the “WAIT #47112801352808″ above? Shouldn’t the cursor numbers be small numbers?

Well, in 11.2.0.2 this was also changed. Before that, the X in CURSOR #X (and PARSE #X, BIND #X, EXEC #X, FETCH #X) represented the slot number in your open cursor array (controlled by open_cursors) in your session’s UGA. Now, the tracefile dumps out the actual address of that cursor. 47112801352808 in HEX is 2AD94DC9FC68 and it happens to reside in the UGA of my session.

Naturally I asked Cary Millsap about whether he had spotted this LOBREAD already and yes, Cary’s way ahead of me – he said that Method-R’s mrskew tool v2.0, which will be out soon, will support it too.

It’s hard to not end up talking about Cary’s work when talking about performance profiling and especially Oracle SQL trace, so here are a few very useful bits which you should know about:

If you want to understand the SQL trace & profiling stuff more, then the absolute must document is Cary’s paper on the subject – Mastering Performance with Extended SQL Trace:

Also, if you like to optimize your work like me (in other words: you’re proactively lazy ;-) and you want to avoid some boring “where-the-heck-is-this-tracefile-now” and “scp-copy-it-over-to-my-pc-for-analysis” work then check out Cary’s MrTrace plugin (costs ~50 bucks and has a 30-day trial) for SQL Developer. I’ve ended up using it myself regularly although I still tend to avoid GUIs:

ORA-4031 errors, contention, cursor management issues and shared pool fragmentation – free secret seminar!

Free stuff! Free stuff! Free stuff! :-)

The awesome dudes at E2SN have done it again! (and yes, Tom, this time the “we at E2SN Ltd” doesn’t mean only me alone ;-)

On Tuesday 22nd March I’ll hold two (yes two) Secret Oracle Hacking Sessions – about ORA-04031: unable to allocate x bytes of shared memory errors, cursor management issues and other shared pool related problems (like fragmentation). This event is free for all! You’ll just need to be fast enough to register, both events have 100 attendee limit (due to my GotoWebinar accont limitations).

I am going to run this online event twice, so total 200 people can attend (don’t register for both events, please). One event is in the morning (my time) to cater for APAC/EMEA region and the other session is for EMEA/US/Americas audience.

The content will be the same in both sessions. There will be no slides (you cant fix your shared pool problems with slides!) but there will be demos, scripts, live examples and fun (for the geeks among us anyway – others go and read some slides instead ;-)!

Exadata CAN do smart scans on bitmap indexes

As I’m finishing up a performance chapter for the Exadata book (a lot of work!), I thought to take a quick break and write a blog entry.

This is not really worth putting into my Oracle Exadata Performance series (which so far has only 1 article in it anyway) .. so this is a little stand-alone article …

Everybody knows that the Exadata smart scan can be used when scanning tables (and table partitions). You should also know that smart scan can be used with fast full scan on Oracle B-tree indexes (a fast full scan on an index segment is just like a full table scan, only on the index segment (and ignoring branch blocks)).

For some reason there’s a (little) myth circulating that smart scans aren’t used for scanning bitmap indexes.

So, here’s evidence, that smart scan can be used when scanning bitmap indexes:

SQL> select /*+ tanel3 */ count(*) from t1 where owner like '%XYZXYZ%';

...

Plan hash value: 39555139

-----------------------------------------------------------------------------------
| Id  | Operation                             | Name        | E-Rows | Cost (%CPU)|
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |             |        |   505 (100)|
|   1 |  SORT AGGREGATE                       |             |      1 |            |
|   2 |   BITMAP CONVERSION COUNT             |             |    400K|   505   (0)|
|*  3 |    BITMAP INDEX STORAGE FAST FULL SCAN| BI_T1_OWNER |        |            |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - storage(("OWNER" LIKE '%XYZXYZ%' AND "OWNER" IS NOT NULL))
       filter(("OWNER" LIKE '%XYZXYZ%' AND "OWNER" IS NOT NULL))

So, as you see the execution plan sure shows a FAST FULL SCAN on a BITMAP INDEX segment, which happens to be on Exadata STORAGE.

Also, you see a storage() predicate applied on the line 3 of the execution plan, which means that Oracle will attempt to use a smart scan predicate offload – but this can’t always be done!

So, you can’t really determine whether a smart scan happened during execution just by looking into the execution plan, you should really check some V$SESSION statistics too. That’s where my Snapper script becomes handy.

I started Snapper on my session just before running the above query. The “smart table scan” and “smart index scan” performance counters are updated right after Oracle has opened the segment header and determines, from the number of blocks in the segment, whether to call the smart scan codepath or not. In other words, the smart scan counters are inremented in the beginning of the segment scan.

The output is following (some irrelevant counters are stripped for brevity):


@snapper all 5 1 "301"
Sampling SID 301 with interval 5 seconds, taking 1 snapshots...
setting stats to all due to option = all

-- Session Snapper v3.52 by Tanel Poder @ E2SN ( http://tech.e2sn.com )

-------------------------------------------------------------------------------------------------------------------------------------
    SID, USERNAME  , TYPE, STATISTIC                                                 ,     HDELTA, HDELTA/SEC,    %TIME, GRAPH
-------------------------------------------------------------------------------------------------------------------------------------
    301, TANEL     , STAT, physical read total IO requests                           ,         13,        2.6,
    301, TANEL     , STAT, physical read total multi block requests                  ,          4,         .8,
    301, TANEL     , STAT, physical read requests optimized                          ,          1,         .2,
    301, TANEL     , STAT, physical read total bytes optimized                       ,      8.19k,      1.64k,
    301, TANEL     , STAT, physical read total bytes                                 ,      4.63M,     925.7k,
    301, TANEL     , STAT, cell physical IO interconnect bytes                       ,     10.02k,         2k,
    301, TANEL     , STAT, physical reads                                            ,        565,        113,
    301, TANEL     , STAT, physical reads cache                                      ,          1,         .2,
    301, TANEL     , STAT, physical reads direct                                     ,        564,      112.8,
    301, TANEL     , STAT, physical read IO requests                                 ,         13,        2.6,
    301, TANEL     , STAT, physical read bytes                                       ,      4.63M,     925.7k,
    301, TANEL     , STAT, db block changes                                          ,          1,         .2,
    301, TANEL     , STAT, cell physical IO bytes eligible for predicate offload     ,      4.62M,    924.06k,
    301, TANEL     , STAT, cell physical IO interconnect bytes returned by smart scan,      1.82k,      364.8,
    301, TANEL     , STAT, cell blocks processed by cache layer                      ,        564,      112.8,
    301, TANEL     , STAT, cell blocks processed by txn layer                        ,        564,      112.8,
    301, TANEL     , STAT, cell blocks processed by index layer                      ,        564,      112.8,
    301, TANEL     , STAT, cell blocks helped by minscn optimization                 ,        564,      112.8,
    301, TANEL     , STAT, cell index scans                                          ,          1,         .2,
    301, TANEL     , STAT, index fast full scans (full)                              ,          1,         .2,
    301, TANEL     , STAT, index fast full scans (direct read)                       ,          1,         .2,
    301, TANEL     , STAT, bytes sent via SQL*Net to client                          ,        334,       66.8,
    301, TANEL     , STAT, bytes received via SQL*Net from client                    ,        298,       59.6,
    301, TANEL     , STAT, SQL*Net roundtrips to/from client                         ,          2,         .4,
    301, TANEL     , STAT, cell flash cache read hits                                ,          1,         .2,
    301, TANEL     , TIME, hard parse elapsed time                                   ,     1.17ms,    233.8us,      .0%, |          |
    301, TANEL     , TIME, parse time elapsed                                        ,      1.5ms,    300.2us,      .0%, |          |
    301, TANEL     , TIME, DB CPU                                                    ,       11ms,      2.2ms,      .2%, |          |
    301, TANEL     , TIME, sql execute elapsed time                                  ,     82.2ms,    16.44ms,     1.6%, |@         |
    301, TANEL     , TIME, DB time                                                   ,    84.36ms,    16.87ms,     1.7%, |@         |
    301, TANEL     , WAIT, enq: KO - fast object checkpoint                          ,    16.18ms,     3.24ms,      .3%, |          |
    301, TANEL     , WAIT, gc cr grant 2-way                                         ,      223us,     44.6us,      .0%, |          |
    301, TANEL     , WAIT, gc current grant 2-way                                    ,      136us,     27.2us,      .0%, |          |
    301, TANEL     , WAIT, cell smart index scan                                     ,    56.04ms,    11.21ms,     1.1%, |@         |
    301, TANEL     , WAIT, SQL*Net message to client                                 ,        7us,      1.4us,      .0%, |          |
    301, TANEL     , WAIT, SQL*Net message from client                               ,      4.42s,   884.47ms,    88.4%, |@@@@@@@@@ |
    301, TANEL     , WAIT, cell single block physical read                           ,      541us,    108.2us,      .0%, |          |
    301, TANEL     , WAIT, events in waitclass Other                                 ,     2.22ms,    443.2us,      .0%, |          |
--  End of Stats snap 1, end=2011-03-13 19:36:31, seconds=5

As you see from the above “cell index scans” statistic – indeed one index segment was scanned using the cell smart scan method.

So, I would rather call this feature “smart segment scan” to reflect that smart scan can scan more than just tables…

I guess one of the reasons why few people have seen smart bitmap index scans in action is that (single-column) bitmap indexes tend to be small. Smaller than corresponding table segments and B-tree index segments. On partitioned tables they’re much more likely going to be under the “_small_table_threshold” calculation which is used for determining whether to do a direct path full segment scan or not (yes, the _small_table_threshold applies to fast full index scan and fast full bitmap index scan too, not just table scans). So, it’s likely that Oracle chooses to do a regular, buffered full bitmap segment scan and thus won’t even consider using smart scan (as smart scans require direct path reads).

By the way – the direct path read (or not) decision is done per segment – not per object (like a table or index). So if you have 10 partitions in a table (or index), half of them are large, half are smaller, then Oracle may end up using direct path reads (and smart scan) on 5 of them and buffered (dumb) scan on the other 5. If you run something like Snapper on the session, then you’d see the smart scan counters go up by 5 only. As written above, Oracle decides whether to do direct path reads (and smart scan) right after opening the header block of a segment (partition) and reading out how many blocks this partition’s segment has below HWM.

The above applied to serial direct path reads – the Parallel Execution slaves should always read using direct path mode, right? …. Wrong :)

Well, partially wrong… In 11.2.0.2, if the parallel_degree_policy = manual, then yes, PX slaves behave like usual and always force a direct path read (and try to use a smart scan). However, with parallel_degree_policy = AUTO, which is the future of PX auto-management, Oracle can decide to do a buffered parallel scan instead, again disabling the use of smart scan…

One more note – I didn’t say anything about whether you should or should not use (bitmap) indexes on Exadata, it’s an entirely different discussion. I just brought out that the smart scan is used for scanning table segments, B-tree index segments and bitmap index segments if conditions are right.

And in the end I have to say…. that even with this evidence you can’t be fully sure that a smart scan was used throughout the entire segment, but more about this in the book and perhaps in a later blog article. We have interesting times ahead ;-)

Oracle Exadata Performance series – Part 1: Should I use Hugepages on Linux Database Nodes?

There was a question in LinkedIn forum about whether Linux Hugepages should be used in Oracle Exadata Database layer, as they aren’t enabled by default during ACS install. I’m putting my answer into this blog entry – apparently LinkedIn forums have a limit of 4000 characters per reply… (interestingly familiar number, by the way…:)

So, I thought that it’s time to start writing my Oracle Exadata Performance series articles what I’ve planned for a while… with some war stories from the field, some stuff what I’ve overcome when researching for writing the Expert Oracle Exadata book etc.

I’ve previously published an article about Troubleshooting Exadata Smart Scan performance and some slides from my experience with VLDB Data Warehouse migrations to Exadata.

Here’s the first article (initially planned as a short response in LinkedIn, but it turned out much longer though):

As far as I’ve heard, the initial decision to not enable hugepages by default was that the hugepages aren’t flexible & dynamic enough – you’ll have to always configure the hugepages at OS level to match your desired SGA size (to avoid wastage). So, different shops may want radically different SGA sizes (larger SGA for single-block read oriented databases like transactional/OLTP or OLAP cubes), but smaller SGA for smart scan/parallel scan oriented DWs. If you configure 40GB of hugepages on a node, but only use 1GB of SGA, then 39GB memory is just reserved, not used, wasted – as hugepages are pre-allocated. AMM, using regular pages, will only use the pages what it touches, so there’s no memory wastage due to any pre-allocation issues…

So, Oracle chose to use an approach which is more universal and doesn’t require extra OS level configuration (which isn’t hard at all though if you pay attention, but not all people do). So, less people will end up in trouble with their first deployments although they might not be getting the most out of their hardware.

However, before enabling hugepages “because it makes things faster” you should ask yourself what exact benefit would they bring you?

There are 3 main reasons why hugepages may be useful in Linux:

1) Smaller kernel memory usage thanks to less PTEs thanks to larger pagesizes

This means less pagetable entries (PTEs) and less kernel memory usage. The bigger your SGA and the more processes you have logged on, the bigger the memory usage.

You can measure this in your case – just “grep Page /proc/meminfo” and see how big portion of your RAM has been used by “PageTables”. Many people have blogged about this, but Kevin Closson’s blog is probably the best source to read about this:

2) Lower CPU usage due to less TLB misses in CPU and soft page-fault processing when accessing SGA.

It’s harder to measure this on Linux with standard tools, although it is sure possible (on Solaris you can just run prstat -m to get microstate accounting and look into TFL,DFL,TRP stats).

Anyway, the catch here is that if you are running parallel scans and smart scans, then you don’t access that much of buffer cache in SGA at all, all IOs or smart scan result-sets are read directly to PGAs of server processes – which don’t use large pages at all, regardless of whether hugepages for SGA have been configured or not. There are some special cases, like when a block clone has to be rolled back for read consistency, you’ll have to access some undo blocks via buffer cache… but again this should be a small part of total workload.

So, in a DW, which using mostly smarts scans or direct path reads, there won’t be much CPU efficiency win from large pages as you bypass buffer cache anyway and use small pages of private process memory. All the sorting, hashing etc all happens using small pages anyway. Again I have to mention that on (my favorite OS) Solaris it is possible to configure even PGAs to use large pages (via _realfree_heap_pagesize_hint parameter) … so it’ll be interesting to see how this would help DW workloads on the Exadata X2-8 monsters which can run Solaris 11.

3) Lock SGA pages into RAM so they won’t be paged out when memory shortage happens (for whatever reason).

Hugepages are pre-allocated and never paged out. So, when you have extreme memory shortage, your SGAs won’t be paged out “by accident”. Of course it’s better to ensure that such memory shortages won’t happen – configure the SGA/PGA_AGGREGATE_TARGET sizes properly and don’t allow third party programs consume crazy amounts of memory etc. Of course there’s the lock_sga parameter in Oracle which should allow to do this on Linux with small pages too, but first I have never used it on Linux so I don’t know whether it works ok at all and also in 11g AMM perhaps the mlock() calls aren’t supported on the /dev/shm files at all (haven’t checked and don’t care – it’s better to stay away from extreme memory shortages). Read more about how the AMM MEMORY_TARGET (/dev/shm) works from my article written back in 2007 when 11g came out ( Oracle 11g internals – Automatic Memory Management ).

So, the only realistic win (for DW workload) would be the reduction of kernel pagetables structure size – and you can measure this using PageTables statistic in /proc/meminfo. Kevin demonistrated in his article that 500 connections to an instance with ~8 GB SGA consisting of small pages resulted in 7 GB of kernel pagetables usage, while the usage with large pages (still 500 connections, 8 GB SGA) was about 265 MB. So you could win over 6 GB of RAM, which you can then give to PGA_AGGREGATE_TARGET or to further inrease SGA. The more processes you have connected to Oracle, the more pagetable space is used… Similarly, the bigger the SGA is, the more pagetable space is used…

This is great, but the tradeoff here is manageability and some extra effort you have to put in to always check whether the large pages actually got used or not. After starting up your instance, you should really check whether the HugePages_Free in /proc/meminfo shrank and HugePages_Rsvd increased (when instance has just started up and Oracle hasn’t touched all the SGA pages yet, some pages will show up as Rsvd – reserved).

With a single instance per node this is trivial – you know how much SGA you want and pre-allocate the amount of hugepages for that. If you want to increase the SGA, you’ll have to shut down the instance and increase the Linux hugepages setting too. This can be done dynamically by issuing a command like echo N > /proc/sys/vm/nr_hugepages (where N is the number of huge pages), BUT in real life this may not work out well as if Linux kernel can’t free enough small pages from right physical RAM locations to consolidate 2 or 4 MB contiguous pages, the above command may fail to create the requested amount of new hugepages.

And this means you should restart the whole node to do the change. Note that if you increase your SGA larger to the number of hugepages (or you forget to increase the memlock setting in /etc/security/limits.conf accordingly) then your instance will silently just use the small pages, while all the memory pre-allocated for hugepages stays reserved for hugepages and is not usable for anything else!).

So, this may become more of a problem when you have multiple database instances per cluster node or you expect to start up and shut down instances on different nodes based on demand (or when some cluster nodes fail).

Long story short – I do configure hugepages in “static” production environments, to save kernel memory (and some CPU time for OLTP type environments using buffer cache heavily), also on Exadata. However for various test and development environments with lots of instances per server and constant action, I don’t bother myself (and the client) with hugepages and make everyone’s life easier… Small instances with small number of connections won’t use that many PTEs anyway…

For production environments with multiple database instances per node (and where failovers are expected) I would take the extra effort to ensure that whatever hugepages I have preallocated, won’t get silently wasted because an instance wants more SGA than the available hugepages can accommodate. You can do this by monitoring /proc/meminfo’s HugePage entries as explained above. And remember, the ASM instance (which is started before DB instances) will also grab itself some hugepages when it starts!

New cursor_bind_capture_destination parameter in Oracle 11.2.0.2

I just noticed that there’s a new cursor_bind_capture_destination parameter in Oracle 11.2.0.2 (which is really more like Oracle 11gR3 version because of the large amount of new features in it, as opposed to just bugfixes).

This parameter allows you to save some SYSAUX tablespace disk space – if the occasionally captured bind variable values (from V$SQL_BIND_DATA) take too much space. Normally these bind values (in a packed RAW form) are visible in DBA_HIST_SQLSTAT.BIND_DATA column, which can take up to 2kB per statement in a snapshot – it’s stored as RAW(2000). Of course a more convenient way to query the actual bind values is to use DBA_HIST_SQLBIND (you can also use DBMS_SQLTUNE.EXTRACT_BIND function for translating the raw payload to meaningful values).

So, if you choose to capture a lot of SQL statements per AWR snapshot (it’s configurable) and don’t really care about the sampled bind variable values and want to save the disk space, then you can set cursor_bind_capture_destination = MEMORY or OFF (if you don’t want to capture bind variable values at all).

I’m using my pvalid.sql script for checking its valid values (it’s based on the X$ table underlying V$PARAMETER_VALID_VALUES view, so I could see undocumented parameter valid values too):

SQL> @pvalid cursor_bind_capture_destination
Display valid values for multioption parameters matching “cursor_bind_capture_destination”…

  PAR# PARAMETER                                                 ORD VALUE         
—— ————————————————– ———- —————
  2062 cursor_bind_capture_destination                             2 MEMORY
       cursor_bind_capture_destination                             3 MEMORY+DISK
       cursor_bind_capture_destination                             1 OFF

The default is MEMORY+DISK (this is essentially what you get before 11.2.0.2 and you can’t turn it off unless you turn off the AWR flushing of the whole SQLSTATS metrics).

Asynch descriptor resize wait event in Oracle

A lot of people have started seeing “asynch descriptor resize” wait event in Oracle 11gR2. Here’s my understanding of what it is. Note that I didn’t spend too much time researching it, so some details may be not completely accurate, but my explanation will at least give you an idea of why the heck you suddenly see this event in your database.

FYI, there’s a short, but incomplete explanation of this wait event also documented in MOS Note 1081977.1

Update: There’s a bug and a patch related to this wait event too.

The “direct path loader” (KCBL) module is used for performing direct path IO in Oracle, such as direct path segment scans and reading/writing spilled over workareas in temporary tablespace. Direct path IO is used whenever you see “direct path read/write*” wait events reported in your session. This means that IOs aren’t done from/to buffer cache, but from/to PGA directly, bypassing the buffer cache.

 

This KCBL module tries to dynamically scale up the number of asynch IO descriptors (AIO descriptors are the OS kernel structures, which keep track of asynch IO requests) to match the number of direct path IO slots a process uses. In other words, if the PGA workarea and/or spilled-over hash area in temp tablespace gets larger, Oracle also scales up the number of direct IO slots. Direct IO slots are PGA memory structures helping to do direct IO between files and PGA.

 

In order to be able to perform this direct IO asynchronously, Oracle also dynamically scales up the number of OS asynch IO descriptors, one for each slot (up to 4096 descriptors per process). When Oracle doesn’t need the direct IO slots anymore (when the direct path table scan has ended or a workarea/tempseg gets cancelled) then it scales down the number of direct IO slots and asynch IO descriptors. Scaling asynch IO descriptors up/down requires issuing syscalls to OS (as the AIO descriptors are OS kernel structures).

 

I guess this is supposed to be an optimization, to avoid running out of OS AIO descriptors, by releasing them when not they’re not needed, but as that Metalink note mentioned, the resize apparently sucks on Linux. Perhaps that’s why other ports also suffer and have seen the same wait event.

 

The “asynch descriptor resize” event itself is really an IO wait event (recorded in the wait class Other though), waiting for reaping outstanding IOs. Once this wait is over, then the OS call to change the amount of asynch IO descriptors (allocated to that process) is made. There’s no wait event recorded for the actual “resize” OS call as it shouldn’t block.

 

So, the more direct IO you do, especially when sorting/hashing to temp with frequent workarea closing/opening, the more of this event you’ll see (and it’s probably the same for regular tablespace direct path IO too).

 

This problem wouldn’t be noticeable if Oracle kept async io descriptors cached and wouldn’t constantly allocated/free them. Of course then you may end up running out of aio descriptors in the whole server easier. Also I don’t know whether there would be some OS issues with reusing cached aio descriptors, perhaps there is a good reason why such caching isn’t done.

 

Nevertheless, what’s causing this wait event is too frequent aio descriptor resize due to changes in direct IO slot count (due to changes in PGA workarea/temp segment and perhaps when doing frequent direct path scans through lots of tables/partitions too).

 

So, the obvious question here is what to do about this wait event? Well, first you should check how big part of your total response time this event takes at all?

 

  1. If it’s someting like 1% of your response time, then this is not your problem anyway and troubleshooting this further would be not practical – it’s just how Oracle works :)
  2. If it’s something like 20% or more of your response time, then it’s clearly a problem and you’d need to talk to Oracle Support to sort out the bug
  3. If it’s anything in between, make sure you don’t have an IO problem first, before telling that this is a bug. In one recent example I saw direct path reads take over a second on average when this problem popped up. The asynch descriptor resize wait event may well disappear from the radar once you fix the root cause – slow IO (or SQL doing too much IO). Remember, the asynch descriptor resize wait event, at least on Linux, is actually an IO wait event, the process is waiting for outstanding IO completion before the descriptor count increase/decrease can take place.

Read currently running SQL statement’s bind variable values using V$SQL_MONITOR.BIND_XML in Oracle 11.2

The title pretty much says it. In Oracle 11.2, if you have the Diag+Tuning Pack licenses and the SQL monitoring kicks in for your SQL statement, then instead of the old fashioned ERRORSTACK dump reading you can just query the V$SQL_MONITOR.BIND_XML to find the values and metadata of your SQL statement’s bind variables.

I’ve written an example here:

And a related comment – V$SQL_BIND_CAPTURE is not a reliable way for identifying the current bind variable values in use. Oracle’s bind capture mechanism does not capture every single bind variable into SGA (it would slow down apps which run lots of short statements with bind variables). The bind capture only selectively samples bind values, during the first execution of a new cursor and then every 15 minutes from there (controlled by _cursor_bind_capture_interval parameter), assuming that new executions of that same cursor are still being started (the capture happens only when execution starts, not later during the execution).

Exadata v2 Smart Scan Performance Troubleshooting article

I finally finished my first Exadata performance troubleshooting article.

This explains one bug I did hit when stress testing an Exadata v2 box, which caused smart scan to go very slow – and how I troubleshooted it:

Thanks to my secret startup company I’ve been way too busy to write anything serious lately, but apparently staying up until 6am helped this time! :-) Anyway, maybe next weekend I can repeat this and write Part 2 in the Exadata troubleshooting series ;-)

Enjoy! Comments are welcome to this blog entry as I haven’t figured out a good way to enable comments in the google sites page I’m using…

Dropping and creating tables in read only tablespaces?!

You probably already know that it’s possible to drop tables in Oracle read only tablespaces… (You did know that already, right? ;-) Here’s a little example:

SQL> create tablespace ronly datafile '/u03/oradata/LIN112/ronly.01.dbf' size 10m;

Tablespace created.

SQL> create table test tablespace ronly as select * from all_users;

Table created.

SQL> alter tablespace ronly READ ONLY;

Tablespace altered.

SQL> drop table test;

Table dropped.

I just dropped a table from a read only tablespace! Well, perhaps it’s because that instead of dropping the table was put into recyclebin instead (which is a data dictionary update)? Let’s check which segments remain in the RONLY tablespace:

SQL> select owner,segment_name,segment_type from dba_segments where tablespace_name = 'RONLY';

OWNER   SEGMENT_NAME                    SEGMENT_TYPE
------- ------------------------------- ------------------
TANEL   BIN$ix7rAUXZfB3gQKjAgS4LXg==$0  TABLE

Indeed, it seems that the table segment wasn’t actually dropped. Well, let’s purge the recycle bin to try to actually drop the table segment:

SQL> purge recyclebin;

Recyclebin purged.

SQL> select owner,segment_name,segment_type from dba_segments where tablespace_name = 'RONLY';

OWNER    SEGMENT_NAME                   SEGMENT_TYPE
-------- ------------------------------ ------------------
TANEL    9.130                          TEMPORARY

Wow, Oracle has converted the table segment into a temporary segment instead (see segment_type)! Bur our tablespace is read only, how can it do that?! The answer is that neither the regular DROP nor DROP PURGE need to write anything into the tablespace where the segment resides! The initial DROP operation just updated data dictionary, like renaming the table to BIN$… in OBJ$ and so on. The second DROP PURGE operation just ran a bunch of deletes against data dictionary to indicate that the table object is gone. But why is the TEMPORARY segment left behind? This has to do with locally managed tablespaces. Before LMT days, when you dropped a segment, then the segment space was released and acquired back to tablespace through inserts/updates to UET$/FET$ (used/free extents) base tables, which resided in system tablespace like all other data dictionary base tables. But with LMTs, the free space information is kept in bitmaps in the tablespace files themselves! Thus, if you drop a table in a read only LMT tablespace, the table will be gone, but the space will not be physically released (as you can’t update the LMT bitmaps in read only tablespace files). However, Oracle doesn’t want to lose that space should someone make the tablespace read write later on, so the table segment is updated to be a TEMPORARY segment instead of completely deleting it from data dictionary. That’s how the SMON can clean it up later on should that tablespace become read-write again. The 9.130 in SEGMENT_NAME column means relative file# 9 and starting block# 130, that’s a segment’s unique identifier in a tablespace. Let’s move on. This example is executed on Oracle 11.2, while logged on to a non-SYS/SYSTEM user:

SQL> select status from dba_tablespaces where tablespace_name = 'RONLY';

STATUS
---------
READ ONLY

The tablespace is still in read only status. Let’s try to CREATE a table into that tablespace:

SQL> create table test(a int) tablespace ronly;

Table created.

What? I can also CREATE a table into a read only tablespace?! Well, this is the behavior you get starting from 11gR2 onwards, it’s called deferred segment creation, Oracle doesn’t need to create any segments for a table until you actually insert rows into it! So, Oracle creates all needed metadata about the new table in data dictionary, but doesn’t actually allocate any space from the tablespace. This applies to other segment types like index and table/index partitions as well. There’s new syntax which controls the deferred segment creation:

SQL> drop table test purge;

Table dropped.

SQL> create table test (a int) segment creation IMMEDIATE tablespace ronly;
create table test (a int) segment creation IMMEDIATE tablespace ronly
*
ERROR at line 1:
ORA-01647: tablespace 'RONLY' is read-only, cannot allocate space in it

In the above case I disabled the deferred segment creation and got an error immediately as Oracle tried to allocate space from the read only tablespace. Let’s try the other way:

SQL> create table test (a int) segment creation DEFERRED tablespace ronly;

Table created.

Now, with deferred segment creation enabled, Oracle didn’t try to allocate space from the read only tablespace. Let’s look into the segments in that tablespace again:

SQL> select owner,segment_name,segment_type from dba_segments where tablespace_name = 'RONLY';

OWNER        SEGMENT_NAME    SEGMENT_TYPE
------------ --------------- ---------------
TANEL        9.130           TEMPORARY

We still have that old segment from an earlier DROP TABLE operation waiting to be cleaned up (when the tablespace goes into read-write mode again), but no segment for our latest test table created. Let’s make the tablespace read write and check again:

SQL> alter tablespace ronly read write;

Tablespace altered.

SQL> select owner,segment_name,segment_type from dba_segments where tablespace_name = 'RONLY';

no rows selected

Apparently SMON has kicked in already and cleaned up the table segment which had been marked temporary earlier when I dropped the table in a read only tablespace.

Note that tables in SYS and SYSTEM schema can not use deferred segment creation:

SQL> create table sys.test (a int) segment creation deferred;
create table sys.test (a int) segment creation deferred
*
ERROR at line 1:
ORA-14223: Deferred segment creation is not supported for this table

SQL> create table system.test (a int) segment creation deferred;
create table system.test (a int) segment creation deferred
*
ERROR at line 1:
ORA-14223: Deferred segment creation is not supported for this table

SQL> create table tanel.test (a int) segment creation deferred;

Table created.

There’s also a parameter, deferred_segment_creation which controls the default behavior:

SQL> show parameter deferred

NAME_COL_PLUS_SHOW_PARAM     TYPE        VALUE
---------------------------- ----------- --------
deferred_segment_creation    boolean     TRUE

SQL>