In the previous article I explained that the main requirement for high-speed in-memory data scanning is column-oriented storage format for in-memory data. SIMD instruction processing is just icing on the cake. Let’s dig deeper. This is a long post, you’ve been warned.
I will cover full test results in the next article in this series. First, let’s look into the test setup, environment and what tools I used for peeking inside CPU hardware.
I was running the tests on a relatively old machine with 2 CPU sockets, with 6-core CPUs in each socket (2s12c24t):
$ egrep "MHz|^model name" /proc/cpuinfo | sort | uniq -c 24 cpu MHz : 2926.171 24 model name : Intel(R) Xeon(R) CPU X5670 @ 2.93GHz
The CPUs support SSE4.2 SIMD extensions (but not the newer AVX stuff):
$ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/ /\n/g' | egrep "avx|sse|popcnt" | sort | uniq popcnt sse sse2 sse4_1 sse4_2 ssse3
Even though the /proc/cpuinfo above shows the CPU clock frequency as 2.93GHz, these CPUs have Intel Turboboost feature that allows some cores run at up to 3.33GHz frequency when not all cores are fully busy and the CPUs aren’t too hot.
Indeed, the turbostat command below shows that the CPU core executing my Oracle process was running at 3.19GHz frequency:
# turbostat -p sleep 1 pk cor CPU %c0 GHz TSC SMI %c1 %c3 %c6 CTMP %pc3 %pc6 6.43 3.02 2.93 0 93.57 0.00 0.00 59 0.00 0.00 0 0 0 4.49 3.19 2.93 0 95.51 0.00 0.00 46 0.00 0.00 0 1 1 10.05 3.19 2.93 0 89.95 0.00 0.00 50 0 2 2 2.48 3.19 2.93 0 97.52 0.00 0.00 45 0 8 3 2.05 3.19 2.93 0 97.95 0.00 0.00 44 0 9 4 0.50 3.20 2.93 0 99.50 0.00 0.00 50 0 10 5 100.00 3.19 2.93 0 0.00 0.00 0.00 59 1 0 6 6.25 2.23 2.93 0 93.75 0.00 0.00 44 0.00 0.00 1 1 7 3.93 2.04 2.93 0 96.07 0.00 0.00 43 1 2 8 0.82 2.15 2.93 0 99.18 0.00 0.00 44 1 8 9 0.41 2.48 2.93 0 99.59 0.00 0.00 41 1 9 10 0.99 2.35 2.93 0 99.01 0.00 0.00 43 1 10 11 0.76 2.36 2.93 0 99.24 0.00 0.00 44
I will come back to this CPU frequency turbo-boosting later when explaining some performance metrics.
I ran the experiments in Oct/Nov 2014, so used a relatively early Oracle 22.214.171.124.1 version with a bundle patch (19189240) for in-memory stuff.
The test was deliberately very simple as I was researching raw in-memory scanning and filtering speed and was not looking into join/aggregation performance. I was running the query below with different hints and parameters to change access path options:
SELECT COUNT(cust_valid) FROM customers_nopart c WHERE cust_id > 0
I used the CUSTOMERS table of Swingbench Sales History schema. I deliberately didn’t use COUNT(*), but COUNT(col) on an actual column “cust_valid” that was nullable, so values in that actual column had to be accessed for correct counting.
Also, I picked the last column in the table as accessing columns in the physical “end” of a row (in row-oriented storage format) would cause more memory/cache accesses and CPU execution branch jumps due to the run-length encoded structure of a row in a datablock. Of course this depends on number of columns and width of the row too, plus hardware characteristics like cache line size (64 bytes on my machine).
Anyway, querying the last column helps to illustrate better what kind of overhead you may be suffering from when filtering that 500-column fact table using columns in the end of it.
SQL> @desc ssh.customers_nopart Name Null? Type ------------------------------- -------- ---------------------------- 1 CUST_ID NOT NULL NUMBER 2 CUST_FIRST_NAME NOT NULL VARCHAR2(20) 3 CUST_LAST_NAME NOT NULL VARCHAR2(40) ... 22 CUST_EFF_TO DATE 23 CUST_VALID VARCHAR2(1)
The table has 69,642,625 rows in it and its segment size is 1613824 blocks / 12608 MB on disk (actual used space in it was slightly lower due to some unused blocks in the segment). I set the table PCTFREE to zero to use all space in the blocks. I also created a HCC-compressed copy of the same table for comparison reasons.
SQL> @seg tanel.customers_nopart SEG_MB OWNER SEGMENT_NAME SEGMENT_TYPE BLOCKS -------- ------ ------------------------- ------------- ---------- 12608 TANEL CUSTOMERS_NOPART TABLE 1613824 6416 TANEL CUSTOMERS_NOPART_HCC_QL TABLE 821248
I made sure that the test tables were completely cached in Oracle buffer cache to eliminate any physical IO component from tests and also enabled in-memory columnar caching for the CUSTOMERS_NOPART table.
SQL> @imseg %.% SEG_MB INMEM_MB %POP IMSEG_OWNER IMSEG_SEGMENT_NAME SEGMENT_TYPE POP_ST ---------- ---------- ----- ------------- ------------------- ------------- ------ 12608 5913 100% TANEL CUSTOMERS_NOPART TABLE COMPLE ---------- ---------- 12608 5913
CPU Activity Measurement Tools
In addition to the usual suspects (Oracle SQL Monitoring reports and Snapper), I used the awesome Linux tool called perf, but not in the typical way you might have used it in past.
On Linux, perf can be used for profiling code executing on CPUs by sampling the instruction pointer and stack backtraces (perf top), but also for taking snapshots of internal CPU performance counters (perf stat). These CPU performance counters (CPC) tell us what happened inside the CPU during my experiments.
This way we can go way deeper than the high level tools like top utility or getrusage() syscall would ever allow us to go. We’ll be able to measure what physically happened inside the CPU. For example: for how many cycles the CPU core was actually doing useful work pushing instructions further in the execution pipeline vs. was stalled, waiting for requested memory to arrive or for some other internal condition to come true. Also, we can estimate the amount of traffic between the CPU and main memory, plus CPU cache hits/misses at multiple cache levels.
Perf can do CPC snapshotting and accounting also at OS process level. This means you can measure the internal CPU/memory activity of a single OS process under examination and that was great for my experiment.
Note that these kinds of tools are nothing new, they’ve been around with CPU vendor code profilers ever since CPUs were instrumented with performance counters (but undocumented in the early days). Perf stat just makes this stuff easily accessible on Linux. For example, since Solaris 8, you could use cputrack for extracting similar process-level CPU counter “usage”, other platforms have their own tools.
I used the following command (-p specifies the target PID) for measuring internal CPU activity when running my queries:
perf stat -e task-clock,cycles,instructions,branches,branch-misses \ -e stalled-cycles-frontend,stalled-cycles-backend \ -e cache-references,cache-misses \ -e LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses \ -p 92106 sleep 30
In RHEL6 equivalents (and later) you can use perf stat -d option for getting similar detailed output without specifying all the counters separately – but I was on OEL5.8. Also, different CPU versions support different performance counters. Read the manuals and start from simpler stuff.
Below is an example output from one test run – where I ran a full table scan against the last column of a regular row-oriented table (all cached in buffer cache) and took a perf stat snapshot of the entire SQL execution. Note that even though the table was cached in Oracle’s in-memory column store, I had disabled its use with the NO_INMEMORY hint, so this full table scan was done entirely via traditional Oracle buffer cache (no physical IOs!):
Performance counter stats for process id '34783': 27373.819908 task-clock # 0.912 CPUs utilized 86,428,653,040 cycles # 3.157 GHz [33.33%] 32,115,412,877 instructions # 0.37 insns per cycle # 2.39 stalled cycles per insn [40.00%] 7,386,220,210 branches # 269.828 M/sec [39.99%] 22,056,397 branch-misses # 0.30% of all branches [40.00%] 76,697,049,420 stalled-cycles-frontend # 88.74% frontend cycles idle [40.00%] 58,627,393,395 stalled-cycles-backend # 67.83% backend cycles idle [40.00%] 256,440,384 cache-references # 9.368 M/sec [26.67%] 222,036,981 cache-misses # 86.584 % of all cache refs [26.66%] 234,361,189 LLC-loads # 8.562 M/sec [26.66%] 218,570,294 LLC-load-misses # 93.26% of all LL-cache hits [ 6.67%] 18,493,582 LLC-stores # 0.676 M/sec [ 6.67%] 3,233,231 LLC-store-misses # 0.118 M/sec [ 6.67%] 7,324,946,042 L1-dcache-loads # 267.589 M/sec [13.33%] 305,276,341 L1-dcache-load-misses # 4.17% of all L1-dcache hits [20.00%] 36,890,302 L1-dcache-prefetches # 1.348 M/sec [26.66%] 30.000601214 seconds time elapsed
I ran perf for 30 seconds for the above experiment, kicked it running just before executing the Oracle SQL and it finished right after the SQL had completed.
Let’s go through some of the above metrics – top down. I’m explaining these metrics at a fairly high level and in the context of my experiment – fully measuring a single SQL execution in a single Oracle process:
Basic CPU Performance Counter Reference
- task-clock (~27373 milliseconds)
– This is a software event and shows how much the target Linux task (my Oracle process) spent running on CPU during the SQL execution, as far as the OS scheduler knows (roughly 27 seconds on CPU).
– So while perf took a 30 second snapshot of my process, my test SQL ran in a couple of seconds shorter time (so didn’t run on CPU all 30 seconds). That should explain the “0.912 CPU utilized” derived metric above.
- cycles – (86B cycles)
– This hardware metric shows how many CPU cycles did my process (running a SQL statement) consume during perf runtime.
– dividing 86B CPU cycles with 27 CPU seconds gives that the CPU core must have operated at around 3.1 GHz speed (on average) during my SQL run.
– Remember, earlier in this article I used turbostat to show how these 2.93 GHz CPU cores happened to be running at 3.19 GHz frequency thanks to turbo-boost!
- instructions – (32B instructions)
– This hardware metric shows how many instructions the CPU managed to successfully execute (and retire). This is where things get interesting:
– It’s worth mentioning that modern CPUs are superscalar and pipelined. They have multiple internal execution units, can have multiple instructions (decoded to µops) executed in its pipeline, memory loads & stores happening concurrently and possibly out-of-order – instruction level parallelism, data level parallelism.
– When dividing 32B executed instructions with 86B CPU cycles we see that we managed to execute only ~0.37 instructions per CPU cycle (IPC) on average!
– When inverting this number we get 86B/32B=2.69 Cycles Per Instruction (CPI). So, on average, every CPU instruction took ~2.69 CPU cycles to execute! We’ll get to the “why” part later.
- branches – (7.3B branches)
– This hardware metric shows how many branches the execututed code took
– A branch is basically a jump (unconditional JMP instruction or a conditional jump like JZ, JNZ and many more – this is how basic IF/THEN/ELSE, CASE and various LOOP statmenents work at CPU level).
– The more decision-points in your code, the more branches it takes.
– Branches are like speedbumps in a CPU execution pipeline, obstructing the execution flow and prefetching due to the uncertainty of which branch will be taken.
– That’s why features like branch prediction with speculative execution are built into modern CPUs to alleviate this problem.
– Knowing that we scanned through roughly 70M rows in this table, this is over 100 branches taken per row scanned (and counted)!
– Oracle’s traditional block format rows are stored in run-length encoded format, where you know where the following column starts only after reading (and testing) the previous column’s length byte(s). The more columns you need to traverse, the more branches you’ll take per row scanned.
- branch-misses – (22M, 0.3% of all branches)
– This hardware metric shows how many times the CPU branch predictor (described above) mispredicted which branch would be taken and causing a pipeline stall.
– Correctly predicting branches (where will the code execution jump next?) is good, as this allows to speculatively execute upcoming instructions and prefetch data required by them.
– However, the branch predictor doesn’t always predict the future correctly, despite various advancements in modern branch prediction, like branch history tables and branch target buffers etc.
– In a branch misprediction case, the mispredicted branch state has to be discarded, pipeline flushed and the correct branch’es instructions will be fetched & put into the start of execution pipeline (in short: mispredictions waste CPU cycles).
- stalled-cycles-frontend – (~76.6M cycles, 88.7% of all cycles)
– This hardware metric shows for how many cycles the front-end of the CPU were spent stalled, not producing new µops into pipeline for backend execution.
– The front-end of an Intel pipelined CPU is basically the unit that fetches the good old x86/x86_64 instructions from L1 instruction cache (or RAM if needed), decodes these to RISC-like µops (newer CPUs also cache those µops) and puts these into backend instruction queue for execution.
– The front-end also deals with remembering taken brances and branch prediction (decoding and sending the predicted branch’es instructions into the backend).
– The frontend can stall due to various reasons, like instruction cache misses (waiting for memory lines containing instructions to arrive from RAM or a lower level cache), branch mispredictions or simply because the backend can not accept more instructions into its pipeline due to some bottlenecks there.
- stalled-cycles-backend – (~58.6M cycles, 67.8% of all cycles)
– This hardware metric shows how many cycles in the back-end of the CPU were spent stalled instead of advancing the pipeline.
– The backend of the CPU is where actual computation on data happens – any computation referencing main memory (not only registers) will have to wait until the referenced memory locations have arrived in CPU L1 cache
– A common reason for backend stalls is due to waiting for a cache line to arrive from RAM (or a lower level cache), although there are many other reasons.
– To reduce memory-access related stalls, the program should be optimized to do less memory accesses or switch to more compact data structures to avoid loading data it doesn’t need.
– Simpler, more predictable data structures also help as the CPU hardware prefetcher may detect an “array scan” and start prefetching required memory lines in advance.
– In the context of this blog series – sequentially scanning and filtering a column of a table’s data is good for reducing memory access related CPU stalls. Walking through random pointers of linked lists (cache buffers chains) and skipping through row pointers in blocks, plus many columns’ length bytes before getting to your single column of interest causes memory access related stalls.
- cache-references – (256M references)
– Now we get into a series of CPU cache traffic related metrics, some of these overlap
– This metric shows how many Last Level Cache accesses (both read and write) were done.
– The memory location that CPU tried to access was not in a higher level (L1/L2) cache, thus the lowest cache, Last Level Cache, was checked.
– Last Level Cache, also called LLC or Lower Level Cache or Longest Latency Cache is usually L3 cache on modern CPUs (although there are some hints that some perf versions still report L2 cache as LLC). I need to read some more perf source code to figure this out, but for this experiment’s purposes it doesn’t matter much if it’s L2 or L3. If I scan through a multi-GB table, it won’t fit into either level cache anyway.
- cache-misses – (222M misses)
– This metric shows how many times the cache reference could not be satisified by the Last Level Cache and therefore RAM access was needed.
- LLC-loads – (234M loads)
– The following 4 metrics just break down the above two in more detail.
– This metric shows how many times a cache line was requested from LLC as it wasn’t available (or valid) in a higher level cache.
- LLC-load-misses – (218M misses)
– This metric shows how many LLC loads could not be satisfied from the LLC and therefore RAM access was needed.
- LLC-stores – (18M stores)
– This metric shows how many times a cache line was written into a LLC.
- LLC-store-misses – (3M misses)
– This metric shows how many times we had to first read the cache line into LLC before completing the LLC write.
– This may happen due to partial writes (for example: cache line size is 64 bytes and not currently present in LLC and the CPU tries to write into first 8 bytes of the line).
– This metric may get incremented due to other cache coherency related reasons where the store fails as other CPU(s) currently own the memory line and have locked and modified it since it was loaded into current CPU cache.
- L1-dcache-loads – (7300M loads)
– The following 3 metrics are similar as above, but for the small (but fast) L1 cache.
– This metric shows how many times the CPU attempted to load a cache line from L1-data cache into a register.
– The dcache in the metric name means data accesses from memory (icache means instruction cache – memory lines fetched from L1I cache with instructions for execution).
– Note how the L1D cache loads metric is way higher than LLC-loads (7300M vs 234M) as many of the repeated tight loops over small internal memory structures can be satisfied from L1 cache.
- L1-dcache-load-misses – (305M misses)
– This metric shows how many data cache loads from L1D cache couldn’t be satisfied from that cache and therefore a next (lower) level cache was needed.
– If you are wondering how come the L1D cache load misses is much larger than the LLC-loads (305M vs 234M – shouldn’t they be equal), one explanation is that as there’s a L2 cache between L1 & L3, some of the memory accesses got satisfied in L2 cache (and some more explanations illustrating the complexity of CPU cache metrics are here).
- L1-dcache-prefetches – (37M prefetches)
– This metric shows how many cache lines the CPU prefetched as the L1D cache prefetch (DCU prefetcher).
– Usually this simple prefetcher just fetches the next cache line to the “currently” accessed one.
– It would be interesting to know if this prefetcher is smart enough to prefetch previous cache lines as regular row-formatted Oracle datablocks are filled from bottom up (this does not apply to the column-oriented stuff).
– If the full table scan code walks the block’s row directory so that it jumps to the bottom of the block first and works its way upwards, this means that some memory accesses will look like scanning backwards – and may affect prefetching.
I hope that this is a useful reference when measuring what’s going on inside a CPU. This is actually pretty basic stuff in the modern CPU world, there’s much more that you can measure in CPUs (via raw performance counters for example) and also different tools that you can use, like Intel VTune. It’s not trivial though, as at such low level even different CPU models by the same vendor may have different meaning (and numbering & flags) for their performance counters.
I won’t pretend to be a CPU & cache coherency expert, however these basic metrics and my understanding looks to be correct enough for comparing different Oracle access paths and storage formats (more about this in next parts of the series).
One bit of warning: It’s the best to run these experiments on a bare-metal server, not in a virtual machine. This is a low-level measurement exercise and in a VM you could suffer from all kinds of additional noise, plus some of the hardware counters would not be available for perf. Some hypervisors do not allow the guest OS to access hardware performance counters by default. One interesting article (by Frits Hoogland) about running perf in VMs is here.
Ok, enough writing for today! I actually started this post more than a month ago and it got way longer than planned. In the next part of this series I will interpret this post’s full-table scan SQL CPU metrics using the above reference (and explain where the bottleneck/inefficiency is). And in Part 4 I’ll show you all the metrics from a series of experiments – testing memory access efficiency of different Oracle data access paths (indexes vs full table scan vs HCC vs in-memory column store).
Update: I have corrected a couple of typos, we had 86B cpu cycles and 32B instructions instead of 86M/32M as I had mistakenly typed in before. The instructions-per-cycle ratio calculation and the point remains the same though. Thanks to Mark Farnham for letting me know.
NB! If you want to move to the "New World" - offload your data and workloads to Hadoop, without having to re-write your existing applications - check out Gluent. We are making history! ;-)