The limitations of CURSOR_SHARING = FORCE and FORCE_MATCHING_SIGNATURE for SQL plan stability

It’s a well known fact that the cursor_sharing parameter can be set to FORCE for making Oracle replace any literals with system-generated bind variable placeholders and then calculating the SQL hash value for looking up an existing cursor in library cache. This should reduce the amount of hard parsing and shared pool garbage.

Also, the same mechanism (of replacing literal values with bind variable placeholders before calculating the SQL hash value for library cache lookup) can be used for enforcing a SQL profile for SQL statements which differ by literal values in them. You just accept a SQL profile using DBMS_SQLTUNE.ACCEPT_PROFILE( …, FORCE_MATCH=>TRUE).

However this “force matching” has one limitation which can make it almost useless for achieving plan stability in databases where the applications use dynamically generated SQL.

Here’s an example – I’m just setting up my sqlplus environment, set cursor_sharing = FORCE (which is not needed for DBMS_SQLTUNE’s FORCE_MATCH by the way, I’m just doing it for displaying the “new” SQL text used for hash value computation):


SQL> COL force_matching_signature FOR 99999999999999999999999999
SQL> COL sql_text FOR A100
SQL>
SQL> ALTER SESSION SET cursor_sharing = FORCE;

Session altered.

SQL>
SQL> SELECT * FROM dual WHERE rownum IN (1,2,3);

D
-
X

And now I’ll query V$SQL to see what the actual SQL text looks like (and its related hash values):


SQL> SELECT sql_id,hash_value,force_matching_signature,sql_text FROM v$sql WHERE sql_id = (SELECT prev_sql_id FROM v$session WHERE sid = USERENV('SID'));

SQL_ID HASH_VALUE FORCE_MATCHING_SIGNATURE SQL_TEXT
------------- ---------- --------------------------- ----------------------------------------------------------------------------------------------------
fqcq5k1wd4d4h 2027041936 1308158718700150644 SELECT * FROM dual WHERE rownum IN (:"SYS_B_0",:"SYS_B_1",:"SYS_B_2")

Scroll right to see how the literal values 1,2,3 have been replaced with :SYS_B_x bind variables in SQL text. The SQL hash value (and SQL_ID and force_matching_signature) are calculated after this replacement was done.

So, now I’ll run a similar statement with just different literal values:


SQL> SELECT * FROM dual WHERE rownum IN (999,888,777);

no rows selected

SQL> SELECT sql_id,hash_value,force_matching_signature,sql_text FROM v$sql WHERE sql_id = (SELECT prev_sql_id FROM v$session WHERE sid = USERENV('SID'));

SQL_ID HASH_VALUE FORCE_MATCHING_SIGNATURE SQL_TEXT
------------- ---------- --------------------------- ----------------------------------------------------------------------------------------------------
fqcq5k1wd4d4h 2027041936 1308158718700150644 SELECT * FROM dual WHERE rownum IN (:"SYS_B_0",:"SYS_B_1",:"SYS_B_2")

See, different literal values (999,888,777), but as they got replaced with exactly the same number of bind variables (in the same order), the resulting SQL text is exactly the same – therefore the SQL text hash values are also exactly the same as you see above.

So this is how these features work – the literal values are replaced in SQL text string just before calculating the hash value for library cache lookup.

However, let’s run one more query, this time with 4 literal values instead of 3:


SQL> SELECT * FROM dual WHERE rownum IN (999,888,777,666);

no rows selected

SQL> SELECT sql_id,hash_value,force_matching_signature,sql_text FROM v$sql WHERE sql_id = (SELECT prev_sql_id FROM v$session WHERE sid = USERENV('SID'));

SQL_ID HASH_VALUE FORCE_MATCHING_SIGNATURE SQL_TEXT
------------- ---------- --------------------------- ----------------------------------------------------------------------------------------------------
8btxq8q6avt6b 2360206539 15602626316910109322 SELECT * FROM dual WHERE rownum IN (:"SYS_B_0",:"SYS_B_1",:"SYS_B_2",:"SYS_B_3")

Now as we used 4 literals (999,888,777,666), they got replaced by 4 bind variables in the SQL text (as opposed to 3 in earlier queries) and thanks to that the resulting SQL text is different! Therefore the hash values will also be different and anything that’s matching/looking up cursors based on SQL text hash value, won’t find the previous query plan anymore. Note that the force_matching_signature is also different from previous thanks to this. So, a new hard parse and library cache load would be done for this cursor – and also any SQL Profile used with first 2 queries with FORCE_MATCH = TRUE would not be applicable to this new query – simply because it can’t find a profile matching the new force_matching_signature.

Now you might say who cares, I’ll just create a profile for this query with 4 literal values in the inlist and we are done. Yes true, but now you’d only have a profile for the query with 3 and 4 inlist values. Often in data warehouses, reporting environments and also in some OLTP systems, the developers are using dynamically generated in-lists extensively. By now I have seen queries with tens of thousands of literal values in  inlists (using OR concatenation to work around the ORA-01795: maximum number of expressions in a list is 1000 error). There’s also a bug in MOS about someone’s query failing with 60 000 bind variables in it! :)

So if you have a dynamically generated query with multiple IN-lists used for filtering different tables and the number of items in each inlist fluctuates anywhere between 1 and tens of thousands, you may end up with a very large number of possible combinations – and each of such combinations of inlist item counts would have its own force_matching_signature. Thus even if you think you have fixed your plans with a force matching profile, some day you will hit a yet another combination and have to reactively fix that “new” query again.

So what can you do in such case? (other than making sure that your schema, code, stats and configuration is good enough so that the CBO would come up with good plans for new queries by itself). Well, don’t use such dynamically generated inlist strings!

Instead of dynamically generating an inlist string with 10000 literal values (or binds), you can just create an array of these values in your application side and bind that entire array to your query as a single bind variable. So regardless of how many values you have to pass in to this in-list, they would all be still stored in a single bind variable, leaving the SQL hash values & force_matching_signature the same. PL/SQL supports it, OCI-based tools support it, Oracle’s JDBC supports it (and probably ODP.NET too). So there’s no excuse not to use it :)

I didn’t find my own demo code anymore, so I googled for a quick example in Java - http://blogs.itemis.de/kloss/2009/03/05/arrays-preparedstatements-jdbc-and-oracle/. If you do have your own code examples, feel free to post links to these into comments.

There are other working, but less clean options, like passing in a comma separated string as a single bind variable (either as VARCHAR2 or even a CLOB) and then using a string tokenizer function in the inlist to break this into individual values) or even always inserting the inlist items into a GTT and then joining it to the main query block (this could be useful if you have millions of inlist values and you don’t want to use too much of private memory used for storing bind values and PL/SQL arrays).

There’s possible one challenge though – the CBO’s cardinality estimates for such array-based inlists may be incorrect, so some adjustment might be needed either via the cardinality hint or a SQL profile or a variety of other means :)

Update: Karen Morton has written about another limitation of force_matching_signature. IIRC this used to be a bug, but apparently now it’s documented, so it’s a new feature :)

MOATS-like sqlplus “top” utility for RAC

 

So, you think MOATS was cool?! Check this out by Jagjeet Singh :)

 
 
 

Oradebug hanganalyze with a prelim connection and “ERROR: Can not perform hang analysis dump without a process state object and a session state object.”

Back in the (really) old days, systemstate dumps had to be used for diagnosing hangs and finding blockers of hung databases. Basically you just identified which resource your waiting sessions were waiting for and then scanned through the rest of the system state dump to see which session held that particular resource (note that over time the instrumentation and systemstate dumps have evolved to resolve and dump the blocker information right at the waiting session section in the dump).

Diagnosing complex hangs was a tedious manual process (or required some scripting), so the hanganalyze was a really welcome addition to hang diagnosis. The hanganalyze basically walked through the blocker-waiter chains, found which session was the LEAF in the chain (the LEAF is the ultimate/final blocker in the wait chain), which everybody either directly or indirectly was waiting for.

Additionally the hanganalyze output allows you to draw the topology of the hang too, which may be useful in complex hang scenarios. In fact Enterprise Manager’s hang topology page also just calls ORADEBUG HANGANALYZE internally and formats the output into a visual hang graph (this is why the hang page requires OS credentials or sysdba privileges).

Starting from 11g, there is a welcome change though – there’s a new V$WAIT_CHAINS view, which lists all the hangs it has detected, it’s RAC-aware and is accessible via plain SQL. No need to run oradebug anymore, assuming that you are able to log on to query that V$ view :-)

And this is where we have a little problem – in rare circumstances an instance can get hung so that even a SYSDBA privilege holder can not log on, so how would you run the diagnostic queries & commands? This is why Oracle has introduced the preliminary connection option in sqlplus for diagnosing hung databases. With a prelim connection you will have a server process started for you, it attaches to the SGA shared memory segments, but it skips the last step of setting up that session/process-related structures in SGA. These operations on shared SGA structures have to be protected by latches/mutexes and would potentially get blocked if the related latches/mutexes are already held by someone else part of that hang.

I have already written about How to log on to Oracle when even sysdba cannot do so, but here’s an example:

$ sqlplus -prelim "/as sysdba"

SQL*Plus: Release 11.2.0.3.0 Production on Sun Jan 29 12:48:03 2012

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

SYS:dbm1> ORADEBUG HANGANALYZE 3

Statement processed.

The “problem” is though that prelim connections only allow you to run ORADEBUG commands as running regular SQL do require that all the session/process structures are set up properly, plus that the parsing, library cache access etc would get latches/mutexes which could again block.

But we just wanted to dump hang analysis and ORADEBUG HANGANALYZE above seems to have succeeded. However when looking into the resulting tracefile, we see this:

*** 2012-01-29 12:48:11.041
Processing Oradebug command 'HANGANALYZE 3'
===============================================================================
HANG ANALYSIS:

ERROR: Can not perform hang analysis dump without a process
       state object and a session state object.
  ( process=(nil), sess=(nil) )

===============================================================================

It looks like hanganalyze isn’t able to work anymore without a fully logged on session where the process & session state objects (in v$process and v$session respectively) have been set up. This is a bit weird as one of the benefits of hanganalyze (and systemstate dumps) was that it did not require allocating any SGA structures nor take any latches & locks. 

This behavior change seems to have appeared since Oracle 11.2.0.2 and it’s documented in MOS note 452358.1, but that note doesn’t offer a good alternative for diagnosing hangs in such situations.

The alternative is actually simple – you should attach to an existing fully-initialized & logged on process with oradebug and let that process do the hanganalyze for you. Whatever process you attach to with oradebug is the process which will be performing the work for you.

So if you ever see the error above when diagnosing extreme hangs, then just identify some least critical process logged in to the database (not a background process ideally and definitely not critical bg processes such DBWR & LGWR etc) and attach to that with oradebug (as opposed to “oradebug setmypid”) and the hang analysis should work ok.

However, the catch is that the above ORADEBUG HANGANALYZE command is a shortcut, which always tries to run hanganalyze in your (preliminary) connection, by calling the hanganalyze function directly in the process. So you have to use ORADEBUG DUMP HANGANALYZE 3 instead (the 3 is the detail level where I usually start from) as this syntax will send the “dump hanganalyze” command to the target (fully-initialized) process where you’ve attached to.

Update: I added this to clarify which commands should you use when you hit this problem:

  1. sqlplus -prelim “/as sysdba”
  2. ORADEBUG SETOSPID OS_PID
  3. ORADEBUG DUMP HANGANALYZE 3

Enjoy :)

Advanced Oracle Troubleshooting Guide – Part 10: Index unique scan doing multiblock reads?!

When you troubleshoot an Oracle (performance) problem you usually want to find out what are the problem sessions doing (wrong) and then why are they doing it.

The “what” question is usually answered using some wait interface based method – like ASH or SQL*Trace, which both add plenty of extra details to the plain wait event data.

My normal session troubleshooting data collection sequence consists of only three steps:

  1. Interpret session-level top wait and top SQL data (ASH, SQL*Trace or custom tools sampling session-level V$views)
  2. Interpret session-level performance counter data (V$SESSTAT)
  3. Sample the problem process stack and list the top stack branches where execution spent the most time

Only looking into #1 is enough in 90% of problem cases. But for advanced troubleshooting of complex and exotic problems I regularly have to dive into the #2 – V$SESSTAT performance counters (That’s the reason why I wrote Snapper for example).

Maybe only once out of every 40-50 troubleshooting cases I dive down into #3 – process stack sampling. While process stack sampling doesn’t lie – it helps especially in these cases where the Oracle’s instrumentation itself doesn’t have a clue what a session/process is doing – it’s like a microscope. It gives you very detailed overview of one part of the problem, but not the big picture. That’s why I never start from #3 when troubleshooting (and if you do, you are doing it wrong – with an exception of a completely hung database ;) I always start from #1 and proceed to #2 only when needed.

Snapper lists both #1 and #2 data for a process by the way.

Anyway, the reason why I’m writing this post is that a few days ago I received a question about an issue which nicely illustrates a case where step #1 shows that there’s an issue and step #2 explains why it’s happening.

There was an execution plan, which ran slightly slower on a new server compared to the old one – and it did read 36 blocks per execution as opposed to only 5:

select ...
from TABLE_X
where ...
 
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.02       0.02          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        2      0.00       0.07         36          5          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4      0.02       0.09         36          5          0           1

The “disk” column above shows 36 blocks being read in via physical IO – while only 5 buffers had actually been accessed after that (logical IOs are shown in the “query” column).

Now that’s weird – as logical IOs are the ones which cause physical IOs, so the LIO count should normally be equal or greater than the physical IO count.

What’s more stranger, the execution plan is an index unique scan:

 
 
Rows     Row Source Operation
-------  ---------------------------------------------------
      1  TABLE ACCESS BY INDEX ROWID TABLE_X (cr=5 pr=36 pw=0 time=0 us cost=4 size=26 card=1)
      1   INDEX UNIQUE SCAN TABLE_X_IDX (cr=4 pr=35 pw=0 time=0 us cost=3 size=0 card=1)(object id 51063)
 

… but if you look into the “pr=” column, which means “blocks read via physical read”, then you see that 35 out of the 36 blocks were read in by the INDEX UNIQUE SCAN row source (pr=35). These counters are cumulative, so the pr=36 in the parent TABLE ACCESS BY INDEX ROWID row source means that there was only one block read by physical IO at the table access level. So why does the index access read 35 blocks?

Is it index height? Can’t be – as the max index height is 24 levels (including the leaf level). And such an index would be ridiculous anyway :)

Let’s look into the wait profile:

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                       2        0.00          0.00
  Disk file operations I/O                        4        0.00          0.00
  db file sequential read                         4        0.01          0.04
  db file scattered read                          1        0.03          0.03
  SQL*Net message from client                     2        0.15          0.15
********************************************************************************

Wow, an index unique scan which is supposed to traverse the index tree with precise single block reads, has done 4 single block reads and one multiblock read!

Now this is where the wait interface and ASH-style data would not help us any further. We know what’s happening – a multiblock read happening where “it shouldn’t”. We still don’t know why this is happening. We would have to guess from here. And this is where V$SESSTAT comes in – there’s over 600 performance counters maintained for each session and many of them show valuable information about what kind of operations (and how many) the RDBMS kernel has done. So I asked the client to run Snapper on a test session and here’s the output (note that the below output is from a different test run where the queries were executed many times in a loop):


----------------------------------------------------------------------------------------------------------------------------------------
    SID, USERNAME  , TYPE, STATISTIC                                                 ,         DELTA, HDELTA/SEC,    %TIME, GRAPH
----------------------------------------------------------------------------------------------------------------------------------------
   1612, ORACLE    , STAT, opened cursors cumulative                                 ,          6617,      1.32k,
   1612, ORACLE    , STAT, user calls                                                ,         13234,      2.65k,
   1612, ORACLE    , STAT, pinned cursors current                                    ,            -1,        -.2,
   1612, ORACLE    , STAT, session logical reads                                     ,          5293,      1.06k,
   1612, ORACLE    , STAT, CPU used when call started                                ,             7,        1.4,
   1612, ORACLE    , STAT, CPU used by this session                                  ,             7,        1.4,
   1612, ORACLE    , STAT, DB time                                                   ,           306,       61.2,
   1612, ORACLE    , STAT, user I/O wait time                                        ,           223,       44.6,
   1612, ORACLE    , STAT, non-idle wait time                                        ,           224,       44.8,
   1612, ORACLE    , STAT, non-idle wait count                                       ,          6775,      1.36k,
   1612, ORACLE    , STAT, process last non-idle time                                ,             5,          1,
   1612, ORACLE    , STAT, physical read total IO requests                           ,           158,       31.6,
   1612, ORACLE    , STAT, physical read total multi block requests                  ,            31,        6.2,
   1612, ORACLE    , STAT, physical read total bytes                                 ,       8265728,      1.65M,
   1612, ORACLE    , STAT, cell physical IO interconnect bytes                       ,       8265728,      1.65M,
   1612, ORACLE    , STAT, consistent gets                                           ,          5293,      1.06k,
   1612, ORACLE    , STAT, consistent gets from cache                                ,          5293,      1.06k,
   1612, ORACLE    , STAT, consistent gets - examination                             ,          5293,      1.06k,
   1612, ORACLE    , STAT, physical reads                                            ,          1009,      201.8,
   1612, ORACLE    , STAT, physical reads cache                                      ,          1009,      201.8,
   1612, ORACLE    , STAT, physical read IO requests                                 ,           158,       31.6,
   1612, ORACLE    , STAT, physical read bytes                                       ,       8265728,      1.65M,
   1612, ORACLE    , STAT, free buffer requested                                     ,          1008,      201.6,
   1612, ORACLE    , STAT, physical reads cache prefetch                             ,           851,      170.2,
   1612, ORACLE    , STAT, physical reads prefetch warmup                            ,           851,      170.2,
   1612, ORACLE    , STAT, shared hash latch upgrades - no wait                      ,           188,       37.6,
   1612, ORACLE    , STAT, calls to get snapshot scn: kcmgss                         ,          1319,      263.8,
   1612, ORACLE    , STAT, file io service time                                      ,            25,          5,
   1612, ORACLE    , STAT, file io wait time                                         ,       2219641,    443.93k,
   1612, ORACLE    , STAT, min active SCN optimization applied on CR                 ,             1,         .2,
   1612, ORACLE    , STAT, index fetch by key                                        ,          1292,      258.4,
   1612, ORACLE    , STAT, session cursor cache hits                                 ,          6440,      1.29k,
   1612, ORACLE    , STAT, buffer is not pinned count                                ,          1277,      255.4,
   1612, ORACLE    , STAT, parse time elapsed                                        ,             6,        1.2,
   1612, ORACLE    , STAT, parse count (total)                                       ,          6380,      1.28k,
   1612, ORACLE    , STAT, execute count                                             ,          6378,      1.28k,
   1612, ORACLE    , STAT, bytes sent via SQL*Net to client                          ,       2195633,    439.13k,
   1612, ORACLE    , STAT, bytes received via SQL*Net from client                    ,       2721339,    544.27k,
   1612, ORACLE    , STAT, SQL*Net roundtrips to/from client                         ,          6374,      1.27k,
   1612, ORACLE    , TIME, parse time elapsed                                        ,        114215,    22.84ms,     2.3%, |@         |
   1612, ORACLE    , TIME, PL/SQL execution elapsed time                             ,         67004,     13.4ms,     1.3%, |@         |
   1612, ORACLE    , TIME, DB CPU                                                    ,        630000,      126ms,    12.6%, |@@        |
   1612, ORACLE    , TIME, sql execute elapsed time                                  ,       2595410,   519.08ms,    51.9%, |@@@@@@    |
   1612, ORACLE    , TIME, DB time                                                   ,       3190919,   638.18ms,    63.8%, |@@@@@@@   |
   1612, ORACLE    , WAIT, db file sequential read                                   ,        978601,   195.72ms,    19.6%, |@@        |
   1612, ORACLE    , WAIT, db file scattered read                                    ,       1213550,   242.71ms,    24.3%, |@@@       |
   1612, ORACLE    , WAIT, SQL*Net message to client                                 ,          9212,     1.84ms,      .2%, |          |
   1612, ORACLE    , WAIT, SQL*Net message from client                               ,       1667368,   333.47ms,    33.3%, |@@@@      |
--  End of Stats snap 1, end=2012-03-09 14:47:02, seconds=5

See the highlighted lines above and scroll right if you don’t see the metric values…

  • The physical read total multi block requests statistic confirms that indeed there were some multiblock reads done by the monitored session.
  • The physical reads cache prefetch metric indicates that there was some data block prefetching happening! This explains why we had multiblock reads done where only single block reads were really needed.
  • And the physical reads prefetch warmup metric explains why did we do any prefetching at all – it’s for buffer cache warmup, where after a restart or a significant buffer cache size increase Oracle decides to populate the empty buffer cache with any data as fast as possible. Oracle assumes that a random multiblock read doesn’t take that much more time than a random single block read anyway, so why not just read a bunch of additional blocks from around that single block that the index access wanted to read. The read operation will be slightly slower, but it might help to avoid multiple future single block reads as they are already in cache. This should affect things only when you restart the instance or significantly increase the buffer cache size. So on a steady state production system you shouldn’t see it (and it’s for a good cause anyway), but if you do performance tests on some test database, change something, bounce the instance and immediately run a test – this may affect your test results. Btw, this is controlled by a hidden parameter _db_cache_pre_warm, but I’m mentioning this just for educational purposes here, I have never fixed a problem by changing this parameter so you should touch it only in your sandbox hacking database :)

And now back to the original topic – this post wasn’t really about this specific prewarm feature, but rather a yet another example of how the V$SESSTAT metrics tell you much more than the usual wait interface and TOP-sql based approach. And you do not need to enable any tracing or undocumented events for getting these detailed metrics out of Oracle, just query the V$SESSION views at the right time. As I said in the beginning, I don’t jump to the microscopic view unless the higher level views aren’t helpful enough, so I recommend you to follow the 3-step “Advanced Oracle Troubleshooting” method when trying to diagnose what the heck some session is doing and why ;^)

Where is LOB data stored?

There was a question in Oracle-L about where is the LOB data actually stored (in the row or the LOB segments) and what are the exact conditions when a switch from one to another may happen. The documentation isn’t fully clear about this and the “4000 bytes” number may mislead people to think that you can store 4000 bytes of your data in a LOB item before it must move out-of-line.

I clarified this in my post in Oracle-L, but will post it here too. First check this diagram:
 

 
If you create the LOB column with DISABLE STORAGE IN ROW, then the LOB data is always stored out-of-line. LOB index is always used. Only the LOB ID is stored inline, and the ID is looked up in LOB index, where you’ll get the pointers to actual LOB chunk blocks.
 
If you create the LOB column with ENABLE STORAGE IN ROW, then the LOB data may be stored in-line or out-of-line.
 
If the total LOB data + overhead <= 4000 bytes, then the LOB item will be stored in-line. No LOB index is used, even if you modify the LOB later on as everything is stored in-line with the row and versioning/rollback is achieved with undo data.
 
If the total LOB data + overhead > 4000 bytes, then the LOB item will be stored out-of-line. If the LOB fits into 12 x LOB_chunk_size, then no LOB index entries are created, because the in-line LOB locator can store up to 12 pointers to the LOB chunk blocks for each lob item. So if your LOB chunk size is 8kB, you can store LOB items up to 96kB in size without inserting anything to LOB index. However if the LOB item is bigger, then no pointers are stored in-row and all pointers will be put to the LOB index.
 
Note that once you modify an existing LOB item (which is bigger than 4000 bytes with its overhead), but smaller than 12 x chunk_size, then LOB index will still be used after the first LOB change operation as pointers to the old LOB chunk versions have to be stored in it (LOB segments don’t rely on undo for rollback & consistency, but just use LOB chunk versioning managed by LOB index).
 
The “overhead” of an in-line LOB item is 36 bytes, so the actual LOB data must be 4000 – 36 = 3964 bytes or less in order to fully fit in-row. And note that we are talking about bytes here, not characters. With multibyte character sets a character in a CLOB may take multiple bytes.
 
I wrote a presentation about LOB internals quite some time ago, back when Oracle 10.1 came out I think – check it here:
 

 
Enjoy! :-)
 
P.S. The next Advanced Oracle Troubleshooting v2.0 class will start in 3 weeks! (I just added the online payment option back too for convenience:)
 

Exadata Smart Scan predicate offloading and sequence.NEXTVAL

There was a question in the twitter-sphere about whether using sequences (sequence.NEXTVAL) in your select query’s projection list would somehow disable smart scans happening?

The answer is no, sequence use with smart scans works just fine. The smart scan offloading applies to data retrieval row sources (and filtering) only and nothing else. So, what you have in the query’s projection list (the sequence use for example), does not directly affect the smart scan decision. Just like any other operations like sorting, grouping etc, do not have anything to do with smart scans and don’t disable their use. Smart scans are only related to data retrieval and any other operations do not affect them.

In other words, sequence.NEXTVAL can be used in a query and you can still get smart scans. Here’s simple proof:

 


SQL Monitoring Report

SQL Text
------------------------------
SELECT /*+ MONITOR */ t.object_name, s.NEXTVAL n FROM t WHERE owner LIKE 'TANEL%'

Global Information
------------------------------
 Status              :  DONE (ALL ROWS)
 Instance ID         :  1
 Session             :  TANEL (204:2707)
 SQL ID              :  5kzs2ucxwf4wk
 SQL Execution ID    :  16777216
 Execution Started   :  04/13/2012 09:06:46
 First Refresh Time  :  04/13/2012 09:06:46
 Last Refresh Time   :  04/13/2012 09:06:48
 Duration            :  2s
 Module/Action       :  SQL*Plus/-
 Service             :  dbm
 Program             :  sqlplus@mac02.local (TNS V1-V3)
 Fetch Calls         :  4

Global Stats
========================================================================================
| Elapsed |   Cpu   |    IO    | Concurrency | Fetch | Buffer | Read | Read  |  Cell   |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Calls |  Gets  | Reqs | Bytes | Offload |
========================================================================================
|    1.50 |    0.36 |     1.06 |        0.08 |     4 |   323K | 2503 |   2GB |  99.97% |
========================================================================================

SQL Plan Monitoring Details (Plan Hash Value=1589210007)
======================================================================================================================================================================
| Id |          Operation           | Name |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Read | Read  |  Cell   | Activity |      Activity Detail      |
|    |                              |      | (Estim) |       | Active(s) | Active |       | (Actual) | Reqs | Bytes | Offload |   (%)    |        (# samples)        |
======================================================================================================================================================================
|  0 | SELECT STATEMENT             |      |         |       |         3 |     +0 |     1 |     1408 |      |       |         |          |                           |
|  1 |   SEQUENCE                   | S    |         |       |         3 |     +0 |     1 |     1408 |      |       |         |          |                           |
|  2 |    TABLE ACCESS STORAGE FULL | T    |    439K | 20738 |         4 |     +0 |     1 |     1408 | 2496 |   2GB |  99.97% |   100.00 | cell smart table scan (1) |
======================================================================================================================================================================

I have highlighted the query text (which includes sequence.NEXTVAL) and the TABLE ACCESS STORAGE FULL row source, if you scroll right, you’ll see the “cell smart table scan” wait event and also the Offload % column populated, which both indicate that smart scan did happen for that monitored query execution.

Note that often I hear a saying, something like “xyz disables smart scan”, or “using a non-offloadable function disables the smart scan”. This can be misleading and give the impression that if you use a non-Exadata supported function/feature in one part of your query, then smart scanning is disabled for the whole query. This is incorrect.

First of all, the smart scanning is not a query level thing, it’s an access path level thing. So, you can easily run a two table join on one table which is on NFS storage (no smart scanning) and another table, which is on Exadata storage (which allows smart scanning). So, even though smart scanning is “disabled” for the first table, the other table still can be accessed using it! By the way, it’s even possible to have some partitions of your table on NFS and other partitions of the same table on Exadata storage cells and the same table scan can use smart scanning for the partitions stored on the cells and regular buffered or direct path read access for the partitions on NFS. So, smart scanning is not some “all or nothing” query level thing, this decision is made separately for every single segment full scanned in the query. More details in the book.

Secondly, using a non-offloadable function in a filter predicate when retrieving rows does not prevent Oracle to push other, offloadable, predicate functions to be pushed to the storage cells. So again, it’s not an all-or-nothing decision. If a predicate function is not offloadable, it will just not be offloaded (and will show up as just a filter() predicate as opposed to both storage() and filter() predicate), but all the other supported predicates can still be offloaded without a problem. Here’s an example:


SELECT * FROM t 
WHERE 
     line = 9999
AND  mod(line,10) = 0 
AND  f1(line) =9999

Plan hash value: 1601196873

----------------------------------------------------------------
| Id  | Operation                 | Name | E-Rows | Cost (%CPU)|
----------------------------------------------------------------
|   0 | SELECT STATEMENT          |      |        |  2887 (100)|
|*  1 |  TABLE ACCESS STORAGE FULL| T    |      1 |  2887   (1)|
----------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - storage(("LINE"=9999 AND MOD("LINE",10)=0))
       filter(("LINE"=9999 AND MOD("LINE",10)=0 AND "F1"("LINE")=9999))

 

Note that the F1() function is a PL/SQL function and filtering based on PL/SQL functions can not be offloaded (as Exadata cells don’t run PL/SQL). This is visible in the predicate section where two predicates are shown in the storage() predicate list but the third one (F1) only shows up in the filter() predicate list.

So, as I explained in the beginning, the smart scan is not an all-or-nothing thing, this decision is made for every segment (partition) separately during query execution phase, again for every execution, of course assuming that we are doing a full table scan (of fast full index scan) on them – that part is fixed in the execution plan.

And as shown above, predicate offload is not an all-or-nothing thing either. If one predicate can’t be offloaded, then other predicates still can.

However, there’s a catch. If you combine such predicates with ANDs, Oracle can push the offloadable (storage()) predicates to the storage cells, where they will do the early filtering – and the last, non-offloadable (filter()) predicate is applied then on the retrieved dataset in the database layer.

But when you combine such predicates with an OR clause, then there’s a logical problem here. OR has to return rows which match ANY of the predicates (not ALL the predicates), therefore you can’t only return a subset of rows from the storage cells which match the predicate A and hope that this set contains all the rows which would match the (non-offloadable) predicate B as well. To evaluate B, you would still need to go through all the rows, but as B is non-offloadable, this has to be done in the database layer. All rows have to be fetched into the database layer for this.

This explains why the storage() predicate disappears when I chain my query predicates together with an OR instead of the AND:


SELECT * FROM t 
WHERE 
     line = 9999
AND  mod(line,10) = 0 
OR   f1(line) =9999

Plan hash value: 1601196873

----------------------------------------------------------------
| Id  | Operation                 | Name | E-Rows | Cost (%CPU)|
----------------------------------------------------------------
|   0 | SELECT STATEMENT          |      |        |  2963 (100)|
|*  1 |  TABLE ACCESS STORAGE FULL| T    |   6353 |  2963   (3)|
----------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(("F1"("LINE")=9999 OR ("LINE"=9999 AND MOD("LINE",10)=0)))

The storage() predicate section above is gone – as Oracle has to retrieve all the rows to the database layer anyway (due to the non-offloadable F1 function used in an OR predicate), it doesn’t make sense to offload the other filters to the storage (and then return the row anyway as it’s needed in the db layer). Makes sense? ;-)

Note that even though the storage() predicates disappeared from the plan, this doesn’t mean that the entire smart scanning has been disabled, the smart scanning can still happen, it’s only the predicate offloading sub-component of the smart scan what is not used. The storage cells can still do the asynchronous prefetching, decompression offloading and projection offloading in these cases.

By the way, it’s all explained in the book!

Create a database link with the new host:port/service syntax

I just noticed that (finally) in 11.2 this syntax is supported:

SQL> CREATE DATABASE LINK demo_x2 
  2  CONNECT TO tanel IDENTIFIED BY password 
  3  USING 'exadb03:1521/DEMO';
Database link created.

This just makes life a bit easier as there’s no need to use the long TNS format entry (or a tnsnames.ora/LDAP alias). It might work in 11.1 too (haven’t tested) but it didn’t work on 10.2.0.4 …

Update: This feature works for dblinks in 10.2 onwards – when I tested it on my 10.2, I got an error initially, but it was because the hostname I used didn’t resolve to an IP. Thanks to Randolf Geist for pointing this out.

In case you didn’t know, the sqlplus supports such an easy connect method since 10g:

tanel@mac02:~$ sqlplus tanel/password@exadb03/DEMO
SQL*Plus: Release 10.2.0.4.0 - Production on Mon Mar 5 09:51:27 2012
Copyright (c) 1982, 2007, Oracle. All Rights Reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
SQL>

If you omit the port from the syntax, it defaults to 1521.

Oracle Core: Essential Internals for DBAs and Developers book by Jonathan Lewis

In case you didn’t know, Jonathan Lewis’es new Oracle Core: Essential Internals for DBAs and Developers book is out (for a few weeks already).

I was the technical reviewer for that book and I can say it’s awesome! It will likely be the best Oracle internals book out there for the coming 10 years, just like Steve Adams’es Oracle Internal Services book was in the last decade :) 

Jonathan does a very good job explaining complex things in a simple enough way – and the book is not just dry listing of how things work inside Oracle database, but also why they work like they do and what are the benefits, limitations and side effects of the behavior.

So check out Jonathan’s book page for more details (and I think you can still order a heavily discounted ebook from Apress too):

Enjoy :)

Profiling trace files with preprocessor external tables in 11g and some parallel execution hacking

If you work with SQL Trace files (and profile them) then you should check out the awesome novel use of the “external table preprocessor” feature explained by Adrian Billington here:

Ironically just a day after writing my “Evil things” article, I noticed a note in MOS about how to enable an event 10384 at level 16384 to get a parallel plan to be executed in serial:

  • How to force that a Parallel Query runs in serial with the Parallel Execution Plan [ID 1114405.1]

This way you can still see the A-rows and other feedback like buffer gets and PIOs by row source even if the plan is a parallel plan. The problem is that even with GATHER_PLAN_STATISTICS enabled (or the equivalent parameter(s)) the PX slaves don’t pass their actual time, rows and buffer gets/PIOs stats back to the QC to be displayed in V$SQL_PLAN_STATISTICS / DBMS_XPLAN output. With parallel slaves, all you’d see would be the QC-generated numbers and not the PX slave stuff.

So if you set that magic event (at your own risk) then even the parallel plans would be executed by the QC only (basically a parallel plan executed entirely ins serial) and you’ll still see all the A-rows and buffer gets/physical read numbers as with serial plans. But remember my yesterday’s article ;-)

Alternative options for getting such runtime stats for a parallel query would be:

  1. Use the Real-Time SQL Monitoring feature (only on 11g and with diagnostics+tuning pack licenses)
  2. Enable SQL Trace for the QC, run your parallel query (you’ll have to wait until it finishes or cancel the query with CTRL+C) and consolidate the STAT# lines from all the PX slave sessions – each PX slave dumps the STAT# lines with row-source level stats like regular serial queries
Update: Coskan commented over twitter that why not use ALLSTATS ALL in DBMS_XPLAN.DISPLAY_CURSOR instead of ALLSTATS LAST – and indeed, in this case you would see the A-rows, buffer gets and other stats of both QC and the PX slaves aggregated together. Although I had used ALLSTATS ALL to compare average execution stats to the LAST stats, I never thought of using it for aggregating the PX slave stats together with QC like that.
 
Basically what happens (in 10g+) with PX is that the QC and all PX slaves all execute the same SQL_ID (but sometimes a different child version though, especially when you’re running cross-instance PX, in which case SQL Monitoring is a better option). And ALLSTATS ALL would aggregate the stats of all executions of the particular SQL_ID (and child number). So if you run something with 4 PX slaves with GATHER_PLAN_STATISTICS enabled, they all add their execution stats to the “ALL” columns. It’s just that the QC is always the last one to finish executing the query (that’s when the exec stats get updated in the V$SQL_PLAN_STATISTICS views), that’s why the ALLSTATS LAST always shows the limited QC stats and not the PX slave ones. But Allstats ALL solves that problem.
 
However the catch is that ALLSTATS ALL would accumulate all executions of this query (including your previous runs of it), so if you want to see fresh stats of only your latest query execution, you should adjust the SQL_ID so that a new cursor would be created. And you can adjust the SQL_ID by simply adding a comment like /* test 2 */ somewhere into your query text…
Thanks Coskan! :)

Evil things are happening in Oracle

Relax, I’m talking about the Oracle Database kernel here, not the corporation ;-)

Here’s a couple of more reasons why not to play around with undocumented debug events unless you’re really sure why and how would they help to solve your specific problem (and you’ve gotten a blessing in some form from Oracle support too):

$ oerr ora 10665
 10665, 00000, "Inject Evil Literals"
 // *Cause:  Event 10665 is set to some number &gt; 0, causing 1/(value-1) of all
 //          literals to be replaced by 2000 letter 'A's.  A value of 1 does
 //          not corrupt anything.
 // *Action: never set this event

$ oerr ora 10668

10668, 00000, "Inject Evil Identifiers"
 // *Cause:  event 10668 is set to some number &gt; 0, causing 1/(value-1) of all
 //          identifiers to be replaced by a maximum amount of x's.  It is
 //          common for an identifier to be parsed once with a max of 30 bytes,
 //          then reparsed later with a max of 4000, so it may not be possible
 //          to inject such an identifier without the aid of this event.  A
 //          value of 1 causes no identifiers to be corrupted.
 // *Action: never set this event

 

Some events are meant to be left alone. You don’t want to wake up the evil sleeping deep in the core of the Oracle Kernel-land!

 

By the way, Karl Arao once managed to capture what this evil creature looks like:

 

http://t.co/GVbxwfnJ

:)