This post applies both to non-Exadata and Exadata systems.
Before Oracle 11.2 came out, it was true to say that Oracle Parallel Execution slaves always do direct path reads (bypassing buffer cache) when doing full segment scans. This should not be taken simplistically though. Even when you were doing full table scans, then yes the scanning was done with direct path multiblock reads – but if you had to visit other, additional blocks out of the scanning sequence, then these extra IOs were done with regular buffered reads. For example, next row piece fetching of chained rows or or undo block access for CR reads was done with buffered single block reads, or even buffered multiblock reads, if some form of prefetching kicked in.
In addition to that, random table/index accesses like index range scans and the following table block fetches are always done in a buffered way both in serial and parallel execution cases.
Starting from Oracle 11.2 though, Oracle parallel execution slaves can also do the parallel full segment scans via the buffer cache. The feature is called In-Memory Parallel Execution, not to be confused with the Oracle 12c upcoming In-Memory Option (which gives you a columnar, compressed, in-memory cache of your on-disk data).
The 11.2 in-memory parallel execution doesn’t introduce any new data formats, but just allows you to cache your hottest tables across buffer caches of multiple RAC nodes (different nodes hold different “fragments” of a segment) and allow PX slaves to avoid physical disk reads and even work entirely from the local buffer cache of the RAC node the PX slave is running on (data locality!). You should use Oracle 220.127.116.11+ for in-mem PX as it has some improvements and bugfixes for node- and NUMA-affinity for PX slaves.
So, this is a great feature – if disk scanning, data retrieval IO is your bottleneck (and you have plenty of memory). But on Exadata, the storage cells give you awesome disk scanning speeds and data filtering/projection/decompression offloading anyway. If you use buffered reads, then you won’t use Smart Scans – as smart scans need direct path reads as a prerequisite. And if you don’t use smart scans, the Exadata storage cells will just act as block IO servers for you – and even if you have all the data cached in RAM, your DB nodes (compute nodes) would be used for all the filtering and decompression of billions of rows. Also, there’s no storage indexing in memory (well, unless you use zone-maps which do a similar thing at higher level in 12c).
So, long story short, you likely do not want to use In-Memory PX on Exadata – and even on non-Exadata, you probably do not want it to kick in automatically at “random” times without you controlling this. So, this leads to the question of when does the in-memory PX kick in and how it is controlled?
Long story short – the in-memory PX can kick when either of the following options is true:
- When _parallel_cluster_cache_policy = CACHED (this will be set so from default when parallel_degree_policy = AUTO)
- When the segment (or partition) is marked as CACHE or KEEP, for example:
- ALTER TABLE t STORAGE (BUFFER_POOL KEEP)
- ALTER TABLE t CACHE
So, while #1 is relatively well known behavior, the #2 is not. So, the In-Memory PX can kick in even if your parallel_degree_policy = MANUAL!
The default value for _parallel_cluster_cache_policy is ADAPTIVE – and that provides a clue. I read it the way that when set to ADAPTIVE, then the in-memory PX caching decision is “adaptively” done based on whether the segment is marked as CACHE/KEEP in data dictionary or not. My brief tests showed that when marking an object for buffer pool KEEP, then Oracle (18.104.22.168) tries to do the in-memory PX even if the combined KEEP pool size in the RAC cluster was smaller than the segment itself. Therefore, all scans ended up re-reading all blocks in from disk to the buffer cache again (and throwing “earlier” blocks of the segment out). When the KEEP pool was not configured, then the DEFAULT buffer cache was used (even if the object was marked to be in the KEEP pool). So, when marking your objects for keep or cache, make sure you have enough buffer cache allocated for this.
Note that with parallel_degree_policy=AUTO & _parallel_cluster_cache_policy=CACHED, Oracle tries to be more intelligent about this, allowing only up to _parallel_cluster_cache_pct (default value 80) percent of total buffer cache in the RAC cluster to be used for in-memory PX. I haven’t tested this deep enough to say I know how the algorithm works. So far I have preferred the manual CACHE/KEEP approach for the latest/hottest partitions (in a few rare cases that I’ve used it).
So, as long as you don’t mark your tables/indexes/partitions as CACHE or KEEP and use parallel_degree_policy=MANUAL or LIMITED, you should not get the in-mem PX to kick in and all parallel full table scans should get nicely offloaded on Exadata (unless you hit any of the other limitations that block a smart scan from happening).
Update 1: I just wrote Part 2 for this article too.
Update 2: Frits Hoogland pointed me to a bug/MOS Note which may be interesting to you if you use KEEP pool:
KEEP BUFFER POOL Does Not Work for Large Objects on 11g (Doc ID 1081553.1):
- Due to this bug, tables with size >10% of cache size, were being treated as ‘large tables’ for their reads and this resulted in execution of a new SERIAL_DIRECT_READ path in 11g.
- With the bug fix applied, any object in the KEEP buffer pool, whose size is less than DB_KEEP_CACHE_SIZE, is considered as a small or medium sized object. This will cache the read blocks and avoid subsequent direct read for these objects .
NB! If you want to move to the "New World" - and benefit from the awesomeness of Hadoop, without having to re-engineer your existing applications - check out Gluent, my new startup that will make history! ;-)