Combining Bloom Filter Offloading and Storage Indexes on Exadata

Here’s a little known feature of Exadata – you can use a Bloom filter computed from a join column of a table to skip disk I/Os against another table it is joined to. This not the same as the Bloom filtering of the datablock contents in Exadata storage cells, but rather avoiding reading in some storage regions from the disks completely.

So, you can use storage indexes to skip I/Os against your large fact table, based on a bloom filter calculated from a small dimension table!

This is useful especially for dimensional star schemas, as your SQL statements might not have direct predicates on your large fact tables at all, all results will be determined by looking up relevant dimension records and then performing a hash join to the fact table (whether you should have some direct predicates against the fact tables, for performance reasons, is a separate topic for some other day :-)

Let me show an example using the SwingBench Order Entry schema. The first output is from Oracle 11.2.0.3 BP21 on Cellsrv 12.1.1.1.0:

SQL> ALTER SESSION SET "_serial_direct_read"=ALWAYS;

Session altered.

SQL> SELECT
  2      /*+ LEADING(c)
  3          NO_SWAP_JOIN_INPUTS(o)
  4          INDEX_RS_ASC(c(cust_email))
  5          FULL(o)
  6          MONITOR
  7      */
  8      *
  9  FROM
 10      soe.customers c
 11    , soe.orders o
 12  WHERE
 13      o.customer_id = c.customer_id
 14  AND c.cust_email = 'florencio@ivtboge.com'
 15  /

CUSTOMER_ID CUST_FIRST_NAME                CUST_LAST_NAME  ...
----------- ------------------------------ --------------
  399999199 brooks                         laxton        

Elapsed: 00:00:55.81

You can ignore the hints in the query, I just used these for get the plan I wanted for my demo. I also forced the serial full segment scans to use direct path reads (and thus attempt Smart Scans on Exadata) using the _serial_direct_read parameter.

Note that while I do have a direct filter predicate on the “small” CUSTOMERS table, I don’t have any predicates on the “large” ORDERS table, so there’s no filter predicate to offload to storage layer on the ORDERS table (but the column projection and HCC decompression can still be offloaded on that table too). Anyway, the query ran in over 55 seconds.

Let’s run this now with PARALLEL degree 2 and compare the results:

SQL> ALTER SESSION FORCE PARALLEL QUERY PARALLEL 2;

Session altered.

SQL> SELECT
  2      /*+ LEADING(c)
  3          NO_SWAP_JOIN_INPUTS(o)
  4          INDEX_RS_ASC(c(cust_email))
  5          FULL(o)
  6          MONITOR
  7      */
  8      *
  9  FROM
 10      soe.customers c
 11    , soe.orders o
 12  WHERE
 13      o.customer_id = c.customer_id
 14  AND c.cust_email = 'florencio@ivtboge.com'
 15  /

CUSTOMER_ID CUST_FIRST_NAME                CUST_LAST_NAME       
----------- ------------------------------ ---------------------
  399999199 brooks                         laxton               

Elapsed: 00:00:03.80

Now the query ran in less than 4 seconds. How come did the same query run close to 15 times faster than in serial? The parallel degree 2 for this simple query should give me max 4 slaves doing the work… 15x speedup indicates that something else has changed as well.

The first thing I would normally suspect is that perhaps the direct path reads were not attempted for the serial query (and Smart Scans did not kick in). That would sure explain the big performance difference… but I did explicitly force the serial direct path reads in my session. Anyway, let’s stop guessing and instead know for sure by measuring these experiments!

SQL Monitoring reports are a good starting point (I have added the MONITOR hint to the query so that the SQL monitoring would kick in immediately also for serial queries).

Here’s the slow serial query:

Global Stats
====================================================================================================
| Elapsed |   Cpu   |    IO    | Application |  Other   | Fetch | Buffer | Read  | Read  |  Cell   |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) | Calls |  Gets  | Reqs  | Bytes | Offload |
====================================================================================================
|      56 |      49 |     6.64 |        0.01 |     0.01 |     2 |     4M | 53927 |  29GB |  14.53% |
====================================================================================================

SQL Plan Monitoring Details (Plan Hash Value=2205859845)
==========================================================================================================================================
| Id |               Operation               |     Name      | Execs |   Rows   | Read  |  Cell   | Activity |      Activity Detail      |
|    |                                       |               |       | (Actual) | Bytes | Offload |   (%)    |        (# samples)        |
==========================================================================================================================================
|  0 | SELECT STATEMENT                      |               |     1 |        1 |       |         |          |                           |
|  1 |   HASH JOIN                           |               |     1 |        1 |       |         |    42.86 | Cpu (24)                  |
|  2 |    TABLE ACCESS BY GLOBAL INDEX ROWID | CUSTOMERS     |     1 |        1 |       |         |          |                           |
|  3 |     INDEX RANGE SCAN                  | CUST_EMAIL_IX |     1 |        1 |       |         |          |                           |
|  4 |    PARTITION HASH ALL                 |               |     1 |     458M |       |         |          |                           |
|  5 |     TABLE ACCESS STORAGE FULL         | ORDERS        |    64 |     458M |  29GB |  14.53% |    57.14 | Cpu (29)                  |
|    |                                       |               |       |          |       |         |          | cell smart table scan (3) |
==========================================================================================================================================

So, the serial query issued 29 GB worth of IO to the Exadata storage cells, but only 14.53% less data was sent back… not that great reduction in the storage interconnect traffic. Also, looks like all 458 Million ORDERS table rows were sent back from the storage cells (as the ORDERS table didn’t have any direct SQL predicates against it). Those rows were then fed to the HASH JOIN parent row source (DB CPU usage!) and it just threw all of the rows away but one. Talk about inefficiency!

Ok, this is the fast parallel query:

Global Stats
==============================================================================================================
| Elapsed | Queuing |   Cpu   |    IO    | Application |  Other   | Fetch | Buffer | Read  | Read  |  Cell   |
| Time(s) | Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) | Calls |  Gets  | Reqs  | Bytes | Offload |
==============================================================================================================
|    4.88 |    0.00 |    0.84 |     4.03 |        0.00 |     0.01 |     2 |     4M | 30056 |  29GB |  99.99% |
==============================================================================================================

SQL Plan Monitoring Details (Plan Hash Value=2121763430)
===============================================================================================================================================
| Id |                 Operation                  |     Name      | Execs |   Rows   | Read  |  Cell   | Activity |      Activity Detail      |
|    |                                            |               |       | (Actual) | Bytes | Offload |   (%)    |        (# samples)        |       
===============================================================================================================================================
|  0 | SELECT STATEMENT                           |               |     3 |        1 |       |         |          |                           |       
|  1 |   PX COORDINATOR                           |               |     3 |        1 |       |         |          |                           |       
|  2 |    PX SEND QC (RANDOM)                     | :TQ10001      |     2 |        1 |       |         |          |                           |       
|  3 |     HASH JOIN                              |               |     2 |        1 |       |         |          |                           |       
|  4 |      BUFFER SORT                           |               |     2 |        2 |       |         |          |                           |       
|  5 |       PX RECEIVE                           |               |     2 |        2 |       |         |          |                           |       
|  6 |        PX SEND BROADCAST                   | :TQ10000      |     1 |        2 |       |         |          |                           |       
|  7 |         TABLE ACCESS BY GLOBAL INDEX ROWID | CUSTOMERS     |     1 |        1 |       |         |          |                           |       
|  8 |          INDEX RANGE SCAN                  | CUST_EMAIL_IX |     1 |        1 |       |         |          |                           |       
|  9 |      PX PARTITION HASH ALL                 |               |     2 |     4488 |       |         |          |                           |       
| 10 |       TABLE ACCESS STORAGE FULL            | ORDERS        |    64 |     4488 |  29GB |  99.99% |   100.00 | cell smart table scan (4) |
===============================================================================================================================================

The fast parallel query still issued 29 GB of IO, but only 0.01% of data (compared to the amount of IOs issued) was sent back from the storage cells, so the offload efficiency for that table scan is 99.99%. Also, only a few thousand rows were returned back from the full table scan (so the throw-away by the hash join later in the plan is smaller).

So, where does the big difference in runtime and offload efficiency metrics come from? It’s the same data, same query (looking for the same values) so how come do we have so different Offload Efficiency %?

The problem with looking into a single metric showing some percentage (of what?!) is that it hides a lot of detail. So, lets get systematic and look into some detailed metrics :-)

I’ll use the Exadata Snapper for this purpose, although I could just list the V$SESSTAT metrics it uses as its source. The ExaSnapper lists various Exadata I/O metrics of the monitored session all in one “chart”, in the same unit (MB) & scale – so it will be easy to compare these different cases. If you do not know what the Exadata Snapper is, then check out this article and video.

Here’s the slow serial query (you may need to scroll the output right to see the MB numbers):

SQL> SELECT * FROM TABLE(exasnap.display_snap(:t1,:t2));

NAME
-------------------------------------------------------------------------------------------------------------------------------------------
-- ExaSnapper v0.81 BETA by Tanel Poder @ Enkitec - The Exadata Experts ( http://www.enkitec.com )
-------------------------------------------------------------------------------------------------------------------------------------------
DB_LAYER_IO                    DB_PHYSIO_BYTES               |##################################################|    29291 MB    365 MB/sec
DB_LAYER_IO                    DB_PHYSRD_BYTES               |##################################################|    29291 MB    365 MB/sec
DB_LAYER_IO                    DB_PHYSWR_BYTES               |                                                  |        0 MB      0 MB/sec
AVOID_DISK_IO                  PHYRD_FLASH_RD_BYTES          |################################################# |    29134 MB    363 MB/sec
AVOID_DISK_IO                  PHYRD_STORIDX_SAVED_BYTES     |                                                  |        0 MB      0 MB/sec
REAL_DISK_IO                   SPIN_DISK_IO_BYTES            |                                                  |      157 MB      2 MB/sec
REAL_DISK_IO                   SPIN_DISK_RD_BYTES            |                                                  |      157 MB      2 MB/sec
REAL_DISK_IO                   SPIN_DISK_WR_BYTES            |                                                  |        0 MB      0 MB/sec
REDUCE_INTERCONNECT            PRED_OFFLOADABLE_BYTES        |##################################################|    29291 MB    365 MB/sec
REDUCE_INTERCONNECT            TOTAL_IC_BYTES                |##########################################        |    24993 MB    312 MB/sec
REDUCE_INTERCONNECT            SMART_SCAN_RET_BYTES          |##########################################        |    24993 MB    312 MB/sec
REDUCE_INTERCONNECT            NON_SMART_SCAN_BYTES          |                                                  |        0 MB      0 MB/sec
CELL_PROC_DEPTH                CELL_PROC_DATA_BYTES          |##################################################|    29478 MB    368 MB/sec
CELL_PROC_DEPTH                CELL_PROC_INDEX_BYTES         |                                                  |        0 MB      0 MB/sec
CLIENT_COMMUNICATION           NET_TO_CLIENT_BYTES           |                                                  |        0 MB      0 MB/sec
CLIENT_COMMUNICATION           NET_FROM_CLIENT_BYTES         |                                                  |        0 MB      0 MB/sec

The DB_PHYSDR_BYTES says this session submitted 29291 MB of I/O as far as the DB layer sees it. The PRED_OFFLOADABLE_BYTES says that all of this 29291 MB of I/O was attempted to be done in a smart way (smart scan). The SMART_SCAN_RET_BYTES tells us that 24993 MB worth of data was sent back from the storage cells as a result of the smart scan (about 14.5% less than the amount of IO issued).

However, the PHYRD_STORIDX_SAVED_BYTES shows zero, we couldn’t avoid doing (skip) any I/Os when scanning the ORDERS table. We couldn’t use the storage index as we did not have any direct filter predicates on the ORDERS table.

Anyway, the story is different for the fast parallel query:

SQL> SELECT * FROM TABLE(exasnap.display_snap(:t1,:t2));     

NAME
-------------------------------------------------------------------------------------------------------------------------------------------
-- ExaSnapper v0.81 BETA by Tanel Poder @ Enkitec - The Exadata Experts ( http://www.enkitec.com )              
-------------------------------------------------------------------------------------------------------------------------------------------
DB_LAYER_IO                    DB_PHYSIO_BYTES               |##################################################|    29291 MB   2212 MB/sec
DB_LAYER_IO                    DB_PHYSRD_BYTES               |##################################################|    29291 MB   2212 MB/sec
DB_LAYER_IO                    DB_PHYSWR_BYTES               |                                                  |        0 MB      0 MB/sec
AVOID_DISK_IO                  PHYRD_FLASH_RD_BYTES          |#######################                           |    13321 MB   1006 MB/sec
AVOID_DISK_IO                  PHYRD_STORIDX_SAVED_BYTES     |###########################                       |    15700 MB   1186 MB/sec
REAL_DISK_IO                   SPIN_DISK_IO_BYTES            |                                                  |      270 MB     20 MB/sec
REAL_DISK_IO                   SPIN_DISK_RD_BYTES            |                                                  |      270 MB     20 MB/sec
REAL_DISK_IO                   SPIN_DISK_WR_BYTES            |                                                  |        0 MB      0 MB/sec
REDUCE_INTERCONNECT            PRED_OFFLOADABLE_BYTES        |##################################################|    29291 MB   2212 MB/sec
REDUCE_INTERCONNECT            TOTAL_IC_BYTES                |                                                  |        4 MB      0 MB/sec
REDUCE_INTERCONNECT            SMART_SCAN_RET_BYTES          |                                                  |        4 MB      0 MB/sec
REDUCE_INTERCONNECT            NON_SMART_SCAN_BYTES          |                                                  |        0 MB      0 MB/sec
CELL_PROC_DEPTH                CELL_PROC_DATA_BYTES          |#######################                           |    13591 MB   1027 MB/sec
CELL_PROC_DEPTH                CELL_PROC_INDEX_BYTES         |                                                  |        0 MB      0 MB/sec
CLIENT_COMMUNICATION           NET_TO_CLIENT_BYTES           |                                                  |        0 MB      0 MB/sec
CLIENT_COMMUNICATION           NET_FROM_CLIENT_BYTES         |                                                  |        0 MB      0 MB/sec

The DB_PHYSIO_BYTES is still the same (29GB) as in the previous case – as this database layer metric knows only about the amount of I/Os it has requested, not what actually gets (or doesn’t get) done inside the storage cells. SMART_SCAN_RET_BYTES is only 4MB (compared to almost 25GB previously) so evidently there must be some early filtering going on somewhere in the lower layers.

And now to the main topic of this article – the PHYRD_STORIDX_SAVED_BYTES metric (that comes from the cell physical IO bytes saved by storage index metric in V$SESSTAT) shows that we have managed to avoid doing 15700 MB worth of IO completely thanks to the storage indexes. And this is without having any direct SQL filter predicates on the large ORDERS table! How is this possible? Keep reading :-)

Before going deeper, let’s look into the predicate section of these cursors.

The slow serial query:

--------------------------------------------------------------------------------------
| Id  | Operation                           | Name          | E-Rows | Pstart| Pstop |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |               |        |       |       |
|*  1 |  HASH JOIN                          |               |      1 |       |       |
|   2 |   TABLE ACCESS BY GLOBAL INDEX ROWID| CUSTOMERS     |      1 | ROWID | ROWID |
|*  3 |    INDEX RANGE SCAN                 | CUST_EMAIL_IX |      1 |       |       |
|   4 |   PARTITION HASH ALL                |               |    506M|     1 |    64 |
|   5 |    TABLE ACCESS STORAGE FULL        | ORDERS        |    506M|     1 |    64 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("O"."CUSTOMER_ID"="C"."CUSTOMER_ID")
   3 - access("C"."CUST_EMAIL"='florencio@ivtboge.com')

The fast parallel query:

-------------------------------------------------------------------------------------------
| Id  | Operation                                | Name          | E-Rows | Pstart| Pstop |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |               |        |       |       |
|   1 |  PX COORDINATOR                          |               |        |       |       |
|   2 |   PX SEND QC (RANDOM)                    | :TQ10001      |      1 |       |       |
|*  3 |    HASH JOIN                             |               |      1 |       |       |
|   4 |     BUFFER SORT                          |               |        |       |       |
|   5 |      PX RECEIVE                          |               |      1 |       |       |
|   6 |       PX SEND BROADCAST                  | :TQ10000      |      1 |       |       |
|   7 |        TABLE ACCESS BY GLOBAL INDEX ROWID| CUSTOMERS     |      1 | ROWID | ROWID |
|*  8 |         INDEX RANGE SCAN                 | CUST_EMAIL_IX |      1 |       |       |
|   9 |     PX PARTITION HASH ALL                |               |    506M|     1 |    64 |
|* 10 |      TABLE ACCESS STORAGE FULL           | ORDERS        |    506M|     1 |    64 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("O"."CUSTOMER_ID"="C"."CUSTOMER_ID")
   8 - access("C"."CUST_EMAIL"='florencio@ivtboge.com')
  10 - storage(SYS_OP_BLOOM_FILTER(:BF0000,"O"."CUSTOMER_ID"))
       filter(SYS_OP_BLOOM_FILTER(:BF0000,"O"."CUSTOMER_ID"))

See how the parallel plan has the storage() predicate with SYS_OP_BLOOM_FILTER in it and the bloom filter :BF0000 will be compared to the hashed CUSTOMER_ID column values when scanning the ORDERS table in storage cells. So, this shows that your session attempts to push the Bloom filter down to the storage layer.

However, merely seeing the Bloom filter storage predicate in the plan doesn’t tell you whether any IOs were avoided thanks to the storage indexes and Bloom filters. The only way to know would be to run the query and look into the cell physical IO bytes saved by storage index metric (or PHYRD_STORIDX_SAVED_BYTES in Exadata Snapper). Unfortunately it will not be easy to measure this at rowsource level when you have multiple smart scans happening in different locations of your execution plan (more about this in the next blog entry :-)

Anyway, how is Oracle able to avoid reading in blocks of one table based on the output of another table involved in the join?

The answer lies in the parameter _bloom_minmax_enabled. It’s description says “enable or disable bloom min max filtering”. This parameter is true by default and on Exadata it means that in addition to computing the Bloom filter bitmap (based on hashed values in the join column) we also keep track of the smallest and biggest value retrieved from the driving table of the join.

In our example, both the Bloom filter bitmap of all matching CUSTOMER_IDs retrieved from the CUSTOMERS table (after any direct predicate filtering) would be sent to the storage cells and also the biggest and smallest CUSTOMER_ID retrieved from the driving table of the join.

In our hand-crafted example where only a single customer was taken from the driving table, only one bit in the bloom filter would be set and both the MIN and MAX CUSTOMER_ID of interest (in the joined ORDERS table) was set to 399999199 (I had about 400 Million customers generated in my benchmark dataset).

So, knowing that we were looking for only CUSTOMER_IDs in the single value “range” of 399999199, the smart scan could start skipping IOs thanks to the storage indexes in memory! You can think of this as runtime BETWEEN predicate generation from a driving table in a join that then gets offloaded to the storage cell and applied when scanning the other table in the join. IOs can be avoided thanks to knowing the range of values we are looking for and then further row-level filtering can be done using the usual Bloom filter bitmap comparison.

Note that the above examples are from Oracle 11.2.0.3 – in this version the Bloom filtering features should kick in only for parallel queries (although Jonathan Lewis has blogged about a case when this happens also on 11.2.0.3).

And this is the ultimate reason why the serial execution was so much slower and less efficient than the parallel run – in my demos, on 11.2.0.3 the Bloom filter usage kicked in only when running the query in parallel.

However, this has changed on Oracle 11.2.0.4 – now also serial queries frequently compute bloom filters and push these to the storage, for the usual bloom filtering reasons and for IO skipping by comparing the bloom filter min/max values to the Exadata storage index memory structures.

This is an example from Oracle 11.2.0.4, the plan is slightly different because in this database my test tables are not partitioned (and I didn’t have any indexes on the tables):

------------------------------------------------------------------
| Id  | Operation                   | Name      | E-Rows |E-Bytes|
------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |        |       |
|*  1 |  HASH JOIN                  |           |      1 |   114 |
|   2 |   JOIN FILTER CREATE        | :BF0000   |      1 |    64 |
|*  3 |    TABLE ACCESS STORAGE FULL| CUSTOMERS |      1 |    64 |
|   4 |   JOIN FILTER USE           | :BF0000   |   4581K|   218M|
|*  5 |    TABLE ACCESS STORAGE FULL| ORDERS    |   4581K|   218M|
------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("O"."CUSTOMER_ID"="C"."CUSTOMER_ID")
   3 - storage("C"."CUST_EMAIL"='florencio@ivtboge.com')
       filter("C"."CUST_EMAIL"='florencio@ivtboge.com')
   5 - storage(SYS_OP_BLOOM_FILTER(:BF0000,"O"."CUSTOMER_ID"))
       filter(SYS_OP_BLOOM_FILTER(:BF0000,"O"."CUSTOMER_ID"))

The Bloom bitmap (and MIN/MAX value) would be computed in the step #2 and would then be used by the full table scan in the step #5 – having the filtering info pushed all the way to the storage layer (assuming that the smart scan did kick in).

If you look into the Outline hints section of this plan, you’ll see the PX_JOIN_FILTER hint that has showed up there, this instructs the optimizer to set up and use the bloom filters (and despite the name, it now can be used for serial queries):

Outline Data
-------------

  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('11.2.0.4')
      DB_VERSION('11.2.0.4')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$1")
      FULL(@"SEL$1" "C"@"SEL$1")
      FULL(@"SEL$1" "O"@"SEL$1")
      LEADING(@"SEL$1" "C"@"SEL$1" "O"@"SEL$1")
      USE_HASH(@"SEL$1" "O"@"SEL$1")
      PX_JOIN_FILTER(@"SEL$1" "O"@"SEL$1")
      END_OUTLINE_DATA
  */

So if the CBO doesn’t choose it automatically, you can use this hint for testing whether your queries could potentially benefit from this feature (note that it controls the whole join filter propagation logic, not only the Exadata stuff). You can also use the _bloom_serial_filter parameter to disable this behavior on Exadata (not that you should).

Note that I usually treat the storage index IO savings as just an added performance benefit at the system level, something that allows to keep the disk devices a little less busy, therefore yielding higher overall throughput. I do not design my applications’ performance overly dependent on the storage indexes as it’s not possible to easily control which columns end up in the storage index hashtables and that may give you unpredictable results.

That’s all – hopefully this shed some light on this cool combination of multiple different features (Smart Scans + Hash Joins + Bloom Filters + Storage Indexes).

Combining Bloom Filter Offloading and Storage Indexes on Exadata

Tanel Poder

2014-05-18