Forcing Smart Scans on Exadata – is the _serial_direct_read parameter safe to use in production?

One of the most common Exadata performance problems I see is that the direct path reads (and thus also Smart Scans) don’t sometimes kick in when running full scans in serial sessions. This is because in Oracle 11g onwards, the serial full segment scan IO path decision is done dynamically, at runtime, for every SQL execution – and for every segment (partition) separately. Whether you get a direct path read & smart scan, depends on the current buffer cache size, how big segment you’re about to scan and how much of that segment is actually cached at the moment. Note that the automatic IO path decision for index fast full scans is slightly different from table scans.

This dynamic decision unfortunately can cause unexpected surprises and variance in your report/batch job runtimes. Additionally, it looks like the SELECT part of your UPDATE/DELETE statements (the select part finds the rows to update/delete) does not ever automatically get direct path read/smart scan chosen – by design! So, when your SELECT statement may use smart scan and be really fast, the same select operation in an INSERT SELECT (or UPDATE/DELETE) context will not end up using smart scans by default. There’s even a bug explaining that – closed as “not a bug” (Exadata Smartscan Is Not Being Used On Insert As Select[Article ID 1348116.1]).

To work around these problems and force a direct path read/smart scan, you can either:

  1. Run your query in parallel as parallel full segment scans will use direct path reads, unless your parallel_degree_policy = AUTO, then you may still get buffered reads thanks to the dynamic in-memory parallel execution decision of Oracle 11.2
  2. Run your query in serial, but force the serial direct path reads by setting _serial_direct_read = TRUE (or ALWAYS in 11.2.0.2+)

Here are the valid options for this parameter in 11.2.0.2+

[Read more...]

List Exadata Storage Cell disk summary with cellpd.sql and cellpdx.sql scripts

In the previous post I explained how to list Exadata disk layout and topology details with the exadisktopo scripts, in this post I’ll introduce one celldisk overview script, which I use to quickly see the celldisk configuration, specs and error statuses. The cellpd.sql script (Cell Physical Disk) will show the following output:

SQL> @exadata/cellpd.sql
Show Exadata cell versions from V$CELL_CONFIG....

DISKTYPE             CELLNAME             STATUS                 TOTAL_GB     AVG_GB  NUM_DISKS   PREDFAIL   POORPERF WTCACHEPROB   PEERFAIL   CRITICAL
-------------------- -------------------- -------------------- ---------- ---------- ---------- ---------- ---------- ----------- ---------- ----------
FlashDisk            192.168.12.3         normal                      183         23          8
FlashDisk            192.168.12.3         not present                 183         23          8                     3
FlashDisk            192.168.12.4         normal                      366         23         16
FlashDisk            192.168.12.5         normal                      366         23         16
HardDisk             192.168.12.3         normal                    22352       1863         12
HardDisk             192.168.12.4         normal                    22352       1863         12
HardDisk             192.168.12.5         normal                    22352       1863         12

Above you see that some of my Flash Disks are missing and before they went completely missing (because the Oracle Ninja removed them :) these disks indicated the “poor performance” status.

The cellpdx.sql script will display extended info for each celldisk, including disk model, firmware version, serial number and various error metrics. Warning, lots of wide output coming again:

[Read more...]

List Exadata Disk Layout and Topology with the exadisktopo scripts

Here are two more Exadata scripts for listing the end-to-end ASM<->Exadata disk topology from V$ASM_ views and from V$CELL_CONFIG. These scripts see both the ASM level layout and the storage cell-level disk topology.

The exadisktopo.sql script shows all disks starting from the ASM diskgroup layer, going deeper and deeper all the way to the OS disk device level in the storage cells. It uses outer joins, so will show celldisks even if there are no corresponding grid disks allocated on them (or if there are no ASM disks using them). It also shows the Flash cards used as flash cache, thus there are no ASM disks on them usually.

The output should be self-explanatory, you just read the data from left to right (from ASM diskgroup to ASM disk to Grid Disk level etc), the more rightwards you scroll, the “deeper” you go in the Exadata IO stack layers. Warning, lots of wide output coming :)

[Read more...]

V$CELL_THREAD_HISTORY – “ASH” for Exadata Storage Cells

Did you know that there’s something like Active Session History also in the Exadata storage cells? ;-)

The V$CELL_THREAD_HISTORY view is somewhat like V$ACTIVE_SESSION_HISTORY, but it’s measuring thread activity in the Exadata Storage Cells:

SQL> @desc v$cell_thread_history
           Name                            Null?    Type
           ------------------------------- -------- ----------------------------
    1      CELL_NAME                                VARCHAR2(1024)
    2      SNAPSHOT_ID                              NUMBER
    3      SNAPSHOT_TIME                            DATE
    4      THREAD_ID                                NUMBER
    5      JOB_TYPE                                 VARCHAR2(32)
    6      WAIT_STATE                               VARCHAR2(32)
    7      WAIT_OBJECT_NAME                         VARCHAR2(32)
    8      SQL_ID                                   VARCHAR2(13)
    9      DATABASE_ID                              NUMBER
   10      INSTANCE_ID                              NUMBER
   11      SESSION_ID                               NUMBER
   12      SESSION_SERIAL_NUM                       NUMBER

It keeps about 10 minutes worth of samples of Exadata Storage Cell thread activity:

SQL> @minmax snapshot_time v$cell_thread_history
Show min/max (low/high) values in column "snapshot_time" of table v$cell_thread_history...

MIN(SNAPSHOT_TIME MAX(SNAPSHOT_TIME
----------------- -----------------
20130419 14:42:15 20130419 14:52:54

Note that it’s not the V$ view or the database instance which stores this array – it’s the storage cells themselves. If you query the v$cell_thread_history view, your Oracle database session is going to “gather” this instrumentation data from all the required cells and present it to you, that’s why the “cell statistics gather” wait event shows up:

PARSING IN CURSOR #140596385017248 len=42 dep=0 uid=0 oct=3 lid=0 tim=1366404896817011 hv=4063158547 ad='19e452578' sqlid='63awy1gg
t2xs8m'
select count(*) from v$cell_thread_history
END OF STMT
PARSE #140596385017248:c=0,e=115,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=4272803188,tim=1366404896817010
EXEC #140596385017248:c=0,e=27,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=4272803188,tim=1366404896817127
WAIT #140596385017248: nam='SQL*Net message to client' ela= 1 driver id=1413697536 #bytes=1 p3=0 obj#=40 tim=1366404896817152
WAIT #140596385017248: nam='cell statistics gather' ela= 283 cellhash#=0 p2=0 p3=0 obj#=40 tim=1366404896818846
WAIT #140596385017248: nam='cell statistics gather' ela= 352 cellhash#=0 p2=0 p3=0 obj#=40 tim=1366404896819317
WAIT #140596385017248: nam='cell statistics gather' ela= 376 cellhash#=0 p2=0 p3=0 obj#=40 tim=1366404896820929
WAIT #140596385017248: nam='cell statistics gather' ela= 326 cellhash#=0 p2=0 p3=0 obj#=40 tim=1366404896822198
WAIT #140596385017248: nam='cell statistics gather' ela= 580 cellhash#=0 p2=0 p3=0 obj#=40 tim=1366404896823620
...

Now, how frequently do the cells sample their “ASH” data:

SQL> SELECT DISTINCT (snapshot_time - LAG(snapshot_time) 
                      OVER (ORDER BY snapshot_time)) * 86400 lag_seconds
     FROM (SELECT distinct snapshot_time 
           FROM v$cell_thread_history WHERE cell_name = '192.168.12.3');

LAG_SECONDS
-----------

          1

Looks like the sampling is done exactly once per second!

So, great, what can we do with this data?

Here’s an Exadata performance script (cth.sql) which uses V$CELL_THREAD_HISTORY to dig into cell activity from the database layer.

[Read more...]

Public Appearances and Exadata Performance Training

I will be doing a lot of (Exadata) talking and teaching in the coming months. Here’s a list of events where you’ll see me speaking, teaching, hacking, learning and hopefully also drinking beer:

  • 16 April 2013UKOUG Engineered Systems Summit
    • London, UK
    • I will talk about the common Exadata performance lessons learned in my “4 years of Exadata performance” talk
    • Andy Colvin and Martin Bach of Enkitec will also speak there (yes, Martin has joined Enkitec in Europe!!! :-)
  • 30 April 2013NYOUG Training Day
    • Manhattan, NYC
    • This is the first ever public delivery of my new Getting the Most Out of Oracle’s Active Session History, Time Model and Wait events seminar!
  • 2-3 May 2013Advanced Exadata Performance seminar – in-classroom!
    • First public delivery of this class – and we’ll be going very deep into Exadata internals and performance ;-)
    • Irving, TX (in Enkitec HQ)
  • 13-16 May 2013Advanced Exadata Performance seminar – online
    • Same as previous, but online.
  • 10-11 June 2013 – My new ASH seminar – online
    • Getting the Most Out of Oracle’s Active Session History, Time Model and Wait events
    • More details will appear in the training & seminar page soon
  • 13-14 June 2013 – Oracle Data Warehouse Global Leaders forum (by invitation-only event run by Oracle)
    • Amsterdam, Netherlands
    • I will speak about my “4 years of Exadata performance” experience and probably learn from others’ experience too
  • 5-6 Aug 2013Enkitec Extreme Exadata Expo (E4) conference
    • Irving, TX
    • I will be speaking there, haven’t set the exact topic yet, but it will include demos and hacking something I suspect ;-)
  • 22-26 September 2013Oracle OpenWorld conference
    • San Francisco, CA
    • I will attend the ACE Directors briefing before the conference, likely speak, hang out at Enkitec booth, probably hack something at Enkitec’s Conference HQ for fun and definitely drink beer there. Maybe I’ll even attend the Wednesday’s party for a change!

As you see, my great plans to not travel much are not going to work out well :)

Actually it is better, this year I’ve managed to only travel twice so far (and one of the trips was for vacation!) and I haven’t had to do too many of the crazy around-the-world in 5 days trips I did when living in back Singapore… but looks like I’ll get to 2 million km nevertheless this year:

Tanel's Tripit stats April 2013

Well, see you in some corner of the world some day!

Listing Exadata storage cells and their configuration info from V$CELL_CONFIG

If you attended my Exadata hacking session today, you saw me using the cellver.sql script which lists some basic configuration info about the currently connected storage cells:

SQL> @exadata/cellver
Show Exadata cell versions from V$CELL_CONFIG....

CELLNAME             CELLSRV_VERSION      FLASH_CACHE_MODE     CPU_COUNT  UPTIME               KERNEL_VERSION                 MAKE_MODEL
-------------------- -------------------- -------------------- ---------- -------------------- ------------------------------ --------------------------------------------------
192.168.12.10        11.2.3.2.1           WriteBack            24         8 days, 2:07         2.6.32-400.11.1.el5uek         Oracle Corporation SUN FIRE X4270 M2 SERVER SAS
192.168.12.11        11.2.3.2.1           WriteBack            24         8 days, 2:06         2.6.32-400.11.1.el5uek         Oracle Corporation SUN FIRE X4270 M2 SERVER SAS
192.168.12.8         11.2.3.2.1           WriteBack            24         8 days, 2:06         2.6.32-400.11.1.el5uek         Oracle Corporation SUN FIRE X4270 M2 SERVER SAS
192.168.12.9         11.2.3.2.1           WriteBack            24         8 days, 2:06         2.6.32-400.11.1.el5uek         Oracle Corporation SUN FIRE X4270 M2 SERVER SAS

The output is pretty self-explanatory. One thing to note is that the CPU_COUNT is not the number of CPU cores, but just the number of “virtual” CPU threads presented to the OS, in this case 24 threads over 12 cores.

The script itself is simple, just extracting some XML values from the “CELL” type config records:

SQL> l
  1  SELECT
  2      cellname cv_cellname
  3    , CAST(extract(xmltype(confval), '/cli-output/cell/releaseVersion/text()') AS VARCHAR2(20))  cv_cellVersion
  4    , CAST(extract(xmltype(confval), '/cli-output/cell/flashCacheMode/text()') AS VARCHAR2(20))  cv_flashcachemode
  5    , CAST(extract(xmltype(confval), '/cli-output/cell/cpuCount/text()')       AS VARCHAR2(10))  cpu_count
  6    , CAST(extract(xmltype(confval), '/cli-output/cell/upTime/text()')         AS VARCHAR2(20))  uptime
  7    , CAST(extract(xmltype(confval), '/cli-output/cell/kernelVersion/text()')  AS VARCHAR2(30))  kernel_version
  8    , CAST(extract(xmltype(confval), '/cli-output/cell/makeModel/text()')      AS VARCHAR2(50))  make_model
  9  FROM
 10      v$cell_config  -- gv$ isn't needed, all cells should be visible in all instances
 11  WHERE
 12      conftype = 'CELL'
 13  ORDER BY
 14*     cv_cellname

I will add some more scripts to the exadata directory over the coming days.

Enjoy! :)

ExaSnapper 0.7 beta download and the hacking session videos

Thank you for attending the Exadata Snapper (ExaSnapper) hacking session!

I have split the recording of this session into 3 pieces and uploaded to enkitec.tv. The ExaSnapper v0.7 beta that I demoed is also available now at enkitec.com (registration needed). See the links below.

For quick reference, here’s the syntax of running ExaSnapper – there are two modes, one is the before/after capture (think Tom Kyte’s runstats, but for exadata metrics) and the other is more like a DBA-monitoring mode, where you can just measure a few seconds worth of a long-running query runtime and get the IO and efficiency figures from there.

Here’s an excerpt from the install script documentation section:

-- Usage:       Take a snapshot of a running session (use QC SID if PX):
--
--                a) Monitor a running query - "DBA mode" 
--
--                   SELECT * FROM TABLE(exasnap.display_sid(, [snap_seconds], [detail_level]));
--
--                   The SID argument can be just a number (SID in local instance) or a remote SID with
--                   @instance after it (like '123@4')
--
--                   SELECT * FROM TABLE(exasnap.display_sid(123));
--                   SELECT * FROM TABLE(exasnap.display_sid('123@4', p_detail=>'%');
--
--                b) Take Before & After snapshots of a query execution - "Developer Mode"
--
--                   1) SELECT exasnap.begin_snap(123) FROM dual;
--                        or
--                      EXEC :begin_snap_id := exasnap.begin_snap(123);
-- 
--                   2) Run your query, wait until it finishes (or CTRL+C)
--
--                   3) SELECT exasnap.end_snap(123) FROM dual;
--                        or
--                      EXEC :end_snap_id := exasnap.end_snap(123);
--
--                   4) SELECT * FROM TABLE(exasnap.display_snap(:begin_snap_id, :end_snap_id, '%'));
--

One slide illustrating the idea of ExaSnapper:

[Read more...]

Drilling Deep Into Exadata Performance with ASH, SQL Monitoring and Exadata Snapper – slides and a hacking session!

The promised hacking session about Exadata performance troubleshooting will happen on Thursday 21st February 9am-11am PST (2 hours). I will show a number of examples and demos where the Exadata Snapper shows what was going on in storage cells when running Smart Scans or just doing IO.

Register here (free stuff!):

Slides:

I will post the scripts on the day of this session.

Note that you should read through the slides before the hacking session – as I’m not going to spend much time on presenting slides, as the name says – it’s a hacking session, so 95% demos. I will show a couple of important slides which explain the essence of Exadata Snapper (check slide 17, 21, 22), but otherwise it’s all demos.

This hacking session will also give a little preview of what’s going to happen in my new Advanced Exadata Performance seminars – see below ;-) As far as I know, there is no other seminar in the world (well at least available to public) where you can learn so deep knowledge about Exadata performance troubleshooting, optimization and internals – all arranged around a systematic approach for dealing with Exadata performance issues.

New Seminar – Advanced Exadata Performance: Troubleshooting and Optimization

This advanced seminar is meant for DBAs and developers who already have at least the basic experience of working with Exadata and have seen some of the common performance problems that may occur in production or performance tests. In this class we will dive deep into details of how Exadata works internally and then how to systematically troubleshoot performance problems caused either by application, database, storage cell software or Exadata hardware issues. The demos and troubleshooting examples are shown on Enkitec’s in-house Exadata X2-2 and V2 machines and we will use both standard Oracle tools and advanced Exadata troubleshooting scripts written by Tanel.

What students will learn:

  • Understand how Exadata features, the storage cell servers and the DB-cell communication work internally
  • How to use database, OS and cell metrics for systematically troubleshooting Exadata performance issues, from basic to complex problems involving multiple components, bugs and hardware issues
  • How to identify which workloads and SQL statements are not Exadata-friendly and what to do about it
  • How to design for optimal Exadata performance and avoid design and coding pitfalls

What this class is not:

  • This class is not an introductory “what is Exadata” class, but an advanced seminar about getting the most out of your Exadatas and resolving even complex performance problems systematically and fast
  • This class is not an introductory SQL or database tuning class, but focuses on troubleshooting only Exadata-related performance issues of your workloads
  • This class is not a general Exadata infrastructure troubleshooting class (like hardware, networking and Grid Infrastructure configuration problems), instead we focus entirely on performance

 

Day 1 – Exadata Internals, Systematic Troubleshooting and Smart Scans

    • Understanding the low-level internals of Exadata data flow all the way from cell disks to the database nodes – and the physics behind its performance
    • Using basic performance tools like V$SQLSTATS, ASH and SQL Monitoring views on Exadata
    • Troubleshooting why Smart Scans (and direct path reads) do not get used
    • Measuring the efficiency of Smart Scans and how much work gets offloaded to the storage cells
    • Troubleshooting advanced Smart Scan performance problems, like passthrough mode and selective fallback to block IO – with Exadata Snapper
    • Using CELLCLI and cell-level metrics where database & session metrics aren’t enough
Day 2 – Parallel Execution, Hybrid Columnar Compression and OLTP

    • PX issues, including the In-Memory Parallel Execution feature
    • Optimizing hash joins for efficient bloom filtering in the cells
    • Troubleshooting Exadata Hybrid Columnar Compression issues
    • Troubleshooting long commit and disk write latency issues on Exadata
    • Troubleshooting flash caching issues and optimizing for tiered caching
    • Learn to use the Exadata scripts, tools and OS commands Tanel regularly uses for Exadata performance troubleshooting
    • Q&A and free-form demo session and/or reviewing Exadata performance case studies

If you like this – check out my Oracle training and seminars page, the first sessions happen already in May! (And there’s even one on-site in Enkitec office in Irving, TX).

Select statement generating redo – and lost write detection

Gwen Shapira has written a nice summary of a problem case where the classic wait interface based troubleshooting method is not always enough for troubleshooting low-level issues.

The top SQL + top wait approach should usually be used as the starting point of troubleshooting a session, workload etc, but often the troubleshooting does not stop there. So, when the wait interface and related tools don’t explain the problem well enough, then you either start guessing from there or dig deeper into performance data. And Gwen used the V$SESSTAT metrics (using my Snapper tool) to understand why was a select statement generating redo this time (there are multiple reasons for that – and V$SESSTAT tells you why).

By the way, I’ve repeated this many times, here’s the data and the order of it what I normally use when I want to understand what’s going on in a session or workload:

  1. TOP Waits / TOP SQL
  2. V$SESSTAT counters
  3. Process stack samples

Sometimes looking into #1 is enough, often I need to look into #1 and #2 and rarely, but regularly I need to look into #3 too. But I always start from #1, then #2 and then, if needed, #3. I call this the “Advanced Oracle Troubleshooting” method ;-)

Read Gwen’s article here and another recent blog entry of mine about how V$SESSTAT helps you to measure, not guess.

Optimizer statistics-driven direct path read decision for full table scans (_direct_read_decision_statistics_driven)

Hello all fellow Oracle geeks and technology enthusiasts! Long time no see ;-)

In the hacking session about Oracle full table scans and direct path reads I explained how the direct path read decision is not done by the optimizer, but instead during every execution, separately for every single segment (partition) scanned in the query. I also explained how the _small_table_threshold parameter and the X$KCBOQH.NUM_BUF(which keeps track of how many buffers of any segment are currently cached) are used for determining whether to scan using direct path reads or not.

If you don’t know what the _small_table_threshold parameter is and how it relates to the direct path read decision, then read this post by Alex Fatkulin first.

In addition to the abovementioned values, Oracle needs to know how many blocks there are to scan (under the HWM) in a segment. This was traditionally done by reading in the segment header block first (using a regular buffered single block read – “db file sequential read”).

Starting from Oracle 11.2.0.2, things have changed a little. Instead of making the direct path read decision based on the actual segment block count extracted from the segment header, Oracle actually takes this number from TAB$.BLKCNT (dba_tables.blocks) or TABPART$.BLKCNT, IND$.LEAFCNT etc.

There’s a new parameter _direct_read_decision_statistics_driven which controls this:

 SQL> @pd direct_read_decision
 Show all parameters and session values from x$ksppi/x$ksppcv...

NAME                                               VALUE    DESCRIPTION
 -------------------------------------------------- -------- ----------------------------------------------------------
 _direct_read_decision_statistics_driven            TRUE     enable direct read decision based on optimizer statistics
  • When this parameter is FALSE, the direct path read decision is done based on the segment header’s block count (actual block count).
  • When TRUE (default in 11.2.0.2+), the direct path read decision is done based on the block count stored in one of the base tables (TAB$, IND$) – the optimizer statistics

 
Note that even though the block counts are taken from the optimizer statistics in data dictionary, it’s not the optimizer who does the direct path read decision in the traditional sense (it’s not a cost-based decision).

Here’s an example from my test database:

SQL> CREATE TABLE t AS SELECT * FROM dba_source;

Table created.

SQL> @gts t
Gather Table Statistics for table t...

PL/SQL procedure successfully completed.

SQL>
SQL> SELECT blocks FROM user_tables WHERE table_name = 'T';

BLOCKS
----------
10704

SQL>

The table uses 10704 blocks (up HWM).

SQL> SELECT name,block_size,buffers FROM v$buffer_pool;

NAME                 BLOCK_SIZE    BUFFERS
-------------------- ---------- ----------
DEFAULT                    8192       9424

The table (10704 blocks) is bigger than the entire buffer cache (9424 buffers). And this is way bigger than the _small_table_threshold value of 168 in my instance (watch the full table scans and direct path reads) hacking session for more about this parameter.

So whenever I run a SELECT COUNT(*) FROM t, I see direct path reads show up:

SQL> @snapper ash,stats=sw,sinclude=table.scan|gets 5 1 99
Sampling SID 99 with interval 5 seconds, taking 1 snapshots...

-- Session Snapper v3.54 by Tanel Poder ( http://blog.tanelpoder.com )

----------------------------------------------------------------------------------------------------------------------------------------
    SID, USERNAME  , TYPE, STATISTIC                                                 ,         DELTA, HDELTA/SEC,    %TIME, GRAPH
----------------------------------------------------------------------------------------------------------------------------------------
     99, SYS       , STAT, db block gets                                             ,             1,         .2,
     99, SYS       , STAT, db block gets from cache                                  ,             1,         .2,
     99, SYS       , STAT, consistent gets                                           ,         11867,      2.37k,
     99, SYS       , STAT, consistent gets from cache                                ,             7,        1.4,
     99, SYS       , STAT, consistent gets from cache (fastpath)                     ,             6,        1.2,
     99, SYS       , STAT, consistent gets - examination                             ,             1,         .2,
     99, SYS       , STAT, consistent gets direct                                    ,         11860,      2.37k,
     99, SYS       , STAT, no work - consistent read gets                            ,         11859,      2.37k,
     99, SYS       , STAT, cleanouts only - consistent read gets                     ,             1,         .2,
     99, SYS       , STAT, table scans (long tables)                                 ,             1,         .2,
     99, SYS       , STAT, table scans (direct read)                                 ,             1,         .2,
     99, SYS       , STAT, table scan rows gotten                                    ,        739834,    147.97k,
     99, SYS       , STAT, table scan blocks gotten                                  ,         11860,      2.37k,
     99, SYS       , TIME, parse time elapsed                                        ,            46,      9.2us,      .0%, |          |
     99, SYS       , TIME, DB CPU                                                    ,         79988,       16ms,     1.6%, |@         |
     99, SYS       , TIME, sql execute elapsed time                                  ,        254990,       51ms,     5.1%, |@         |
     99, SYS       , TIME, DB time                                                   ,        255375,    51.08ms,     5.1%, |@         |
     99, SYS       , WAIT, enq: KO - fast object checkpoint                          ,        174947,    34.99ms,     3.5%, |@         |
     99, SYS       , WAIT, direct path read                                          ,          1280,      256us,      .0%, |          |
     99, SYS       , WAIT, SQL*Net message to client                                 ,             9,      1.8us,      .0%, |          |
     99, SYS       , WAIT, SQL*Net message from client                               ,       4672912,   934.58ms,    93.5%, |@@@@@@@@@@|
     99, SYS       , WAIT, events in waitclass Other                                 ,             6,      1.2us,      .0%, |          |
--  End of Stats snap 1, end=2012-09-02 20:03:55, seconds=5

---------------------------------------------------------------------------------
Active% | SQL_ID          | EVENT                               | WAIT_CLASS
---------------------------------------------------------------------------------
     2% | 88r4qn9mwhcf5   | enq: KO - fast object checkpoint    | Application
     2% | 88r4qn9mwhcf5   | ON CPU                              | ON CPU

--  End of ASH snap 1, end=2012-09-02 20:03:55, seconds=5, samples_taken=43

Let’s now fake the table stats so it looks like that there’s only 5 blocks in it – way below the _small_table_threshold value

SQL> EXEC DBMS_STATS.SET_TABLE_STATS(user,'T',numblks=>5);

PL/SQL procedure successfully completed.

SQL> SELECT COUNT(*) FROM t /* attempt 2 */;

  COUNT(*)
----------
      1000

The direct path reads are gone – we are doing regular buffered reads now!

SQL> @snapper ash,stats=sw,sinclude=table.scan|gets 5 1 99

Sampling SID 99 with interval 5 seconds, taking 1 snapshots...

-- Session Snapper v3.54 by Tanel Poder ( http://blog.tanelpoder.com )

----------------------------------------------------------------------------------------------------------------------------------------
    SID, USERNAME  , TYPE, STATISTIC                                                 ,         DELTA, HDELTA/SEC,    %TIME, GRAPH
----------------------------------------------------------------------------------------------------------------------------------------
     99, SYS       , STAT, db block gets                                             ,             1,        .17,
     99, SYS       , STAT, db block gets from cache                                  ,             1,        .17,
     99, SYS       , STAT, consistent gets                                           ,         11865,      1.98k,
     99, SYS       , STAT, consistent gets from cache                                ,         11865,      1.98k,
     99, SYS       , STAT, consistent gets from cache (fastpath)                     ,         11528,      1.92k,
     99, SYS       , STAT, consistent gets - examination                             ,             1,        .17,
     99, SYS       , STAT, no work - consistent read gets                            ,         11851,      1.98k,
     99, SYS       , STAT, cleanouts only - consistent read gets                     ,             1,        .17,
     99, SYS       , STAT, table scans (long tables)                                 ,             1,        .17,
     99, SYS       , STAT, table scan rows gotten                                    ,        738834,    123.14k,
     99, SYS       , STAT, table scan blocks gotten                                  ,         11852,      1.98k,
     99, SYS       , TIME, parse time elapsed                                        ,            84,       14us,      .0%, |          |
     99, SYS       , TIME, DB CPU                                                    ,        109983,    18.33ms,     1.8%, |@         |
     99, SYS       , TIME, sql execute elapsed time                                  ,        116709,    19.45ms,     1.9%, |@         |
     99, SYS       , TIME, DB time                                                   ,        117102,    19.52ms,     2.0%, |@         |
     99, SYS       , WAIT, db file scattered read                                    ,         63956,    10.66ms,     1.1%, |@         |
     99, SYS       , WAIT, SQL*Net message to client                                 ,             8,     1.33us,      .0%, |          |
     99, SYS       , WAIT, SQL*Net message from client                               ,       5119722,   853.29ms,    85.3%, |@@@@@@@@@ |
--  End of Stats snap 1, end=2012-09-02 20:06:19, seconds=6

---------------------------------------------------------------------------------
Active% | SQL_ID          | EVENT                               | WAIT_CLASS
---------------------------------------------------------------------------------
     2% | 07sgczqj432mr   | db file scattered read              | User I/O

--  End of ASH snap 1, end=2012-09-02 20:06:19, seconds=5, samples_taken=46

Note that I deliberately forced a hard parse (with the “attempt 2″ comment) to compile a new cursor. The _direct_read_decision_statistics_driven parameter is not part of the optimizer environment, so a new child cursor would not be automatically created after the parameter change (the same applies to the _small_table_threshold and _serial_direct_read parameters, by the way). But when I change the SQL text, then an entirely new (parent and child) cursor will be compiled anyway.

But wait a minute! Why do I need to compile a new cursor to get Oracle to read the new block count value from optimizer stats?!

I have said in the beginning of this post (and in many other places) that the direct path read decision is not done by the optimizer anyway and is a runtime decision done during every execution, every time any segment (including individual partitions) is scanned during query runtime. This is true for the old (up to 11.2.0.1) Oracle versions, where a direct path decision is done based on the actual, current block count in the segment header, thus the decision can suddenly change when a segment grows by a few blocks, crossing the _small_table_threshold calculation threshold. Perhaps due to performance stability reasons, this seems to have changed.

My tests on 11.2.0.2 have so far shown that when using the new statistics-driven direct path read decisions, each segments’ block counts are stored somewhere in the compiled cursor and reused during next executions of it, even if the block count of the segment changes in the optimizer stats later on! This might result in somewhat better stability as long as you don’t gather new stats – and your buffer cache size (and already cached block counts) don’t change. However if the amount of cached blocks of a segment does change (due to other, index-based accesses for example), then the direct path decision can still change during runtime. It’s just the block counts which are stored in the cursor, but the other factors affecting the decision (buffer cache size, cached block counts) can still change.

This topic is especially relevant on Exadata, as the entire Smart Scanning functionality depends on whether a direct path read IO method gets picked for full segment scans. When experimenting with this, you’ve got to be pretty careful and thorough (to not come to wrong conclusions) as there multiple moving parts and factors involved in the decisions:

  1. Are block counts taken from optimizer stats or segment header
  2. Do the segment header and/or optimizer stats block counts change
  3. Does the buffer cache size change (thus the _small_table_threshold too)
  4. Does the amount of cached blocks of a segment change
  5. Parallel vs Serial execution
  6. Are buffered parallel full scans allowed (the in-memory PX feature of 11.2)
  7. Did a new child cursor get created or the old one reused
  8. etc :)