Oracle In-Memory Column Store Internals – Part 1 – Which SIMD extensions are getting used?

This is the first entry in a series of random articles about some useful internals-to-know of the awesome Oracle Database In-Memory column store. I intend to write about Oracle’s IM stuff that’s not already covered somewhere else and also about some general CPU topics (that are well covered elsewhere, but not always so well known in the Oracle DBA/developer world).

Before going into further details, you might want to review the Part 0 of this series and also our recent Oracle Database In-Memory Option in Action presentation with some examples. And then read this doc by Intel if you want more info on how the SIMD registers and instructions get used.

There’s a lot of talk about the use of your CPUs’ SIMD vector processing capabilities in the Oracle inmemory module, let’s start by checking if it’s enabled in your database at all. We’ll look into Linux/Intel examples here.

The first generation of SIMD extensions in Intel Pentium world was called MMX. It added 8 new XMMn registers, 64 bits each. Over time the registers got widened, more registers and new features were added. The extensions were called Streaming SIMD Extensions (SSE, SSE2, SSSE3, SSE4.1, SSE4.2) and Advanced Vector Extensions (AVX and AVX2).

The currently available AVX2 extensions provide 16 x 256 bit YMMn registers and the AVX-512 in upcoming King’s Landing microarchitecture (year 2015) will provide 32 x 512 bit ZMMn registers for vector processing.

So how to check which extensions does your CPU support? On Linux, the “flags” column in /proc/cpuinfo easily provides this info.

Let’s check the Exadatas in our research lab:

Exadata V2:

$ grep "^model name" /proc/cpuinfo | sort | uniq
model name	: Intel(R) Xeon(R) CPU           E5540  @ 2.53GHz

$ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/ /\n/g' | egrep "avx|sse|popcnt" | sort | uniq
popcnt
sse
sse2
sse4_1
sse4_2
ssse3

So the highest SIMD extension support on this Exadata V2 is SSE4.2 (No AVX!)

Exadata X2:

$ grep "^model name" /proc/cpuinfo | sort | uniq
model name	: Intel(R) Xeon(R) CPU           X5670  @ 2.93GHz

$ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/ /\n/g' | egrep "avx|sse|popcnt" | sort | uniq
popcnt
sse
sse2
sse4_1
sse4_2
ssse3

Exadata X2 also has SSE4.2 but no AVX.

Exadata X3:

$ grep "^model name" /proc/cpuinfo | sort | uniq
model name	: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz

$ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/ /\n/g' | egrep "avx|sse|popcnt" | sort | uniq
avx
popcnt
sse
sse2
sse4_1
sse4_2
ssse3

The Exadata X3 supports the newer AVX too.

My laptop (Macbook Pro late 2013):
The Exadata X4 has not yet arrived to our lab, so I’m using my laptop as an example of a latest available CPU with AVX2:

Update: Jason Arneil commented that the X4 does not have AVX2 capable CPUs (but the X5 will)

$ grep "^model name" /proc/cpuinfo | sort | uniq
model name	: Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz

$ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/ /\n/g' | egrep "avx|sse|popcnt" | sort | uniq
avx
avx2
popcnt
sse
sse2
sse4_1
sse4_2
ssse3

The Core-i7 generation supports everything up to the current AVX2 extension set.

So, which extensions is Oracle actually using? Let’s check!

As Oracle needs to run different binary code on CPUs with different capabilities, some of the In-Memory Data (kdm) layer code has been duplicated into separate external libraries – and then gets dynamically loaded into Oracle executable address space as needed. You can run pmap on one of your Oracle server processes and grep for libshpk:

$ pmap 21401 | grep libshpk
00007f0368594000   1604K r-x--  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpksse4212.so
00007f0368725000   2044K -----  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpksse4212.so
00007f0368924000     72K rw---  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpksse4212.so

My (educated) guess is that the “shpk” in libshpk above stands for oS dependent High Performance [K]ompression. “s” prefix normally means platform dependent (OSD) code and this low-level SIMD code sure is platform and CPU microarchitecture version dependent stuff.

Anyway, the above output from an Exadata X2 shows that SSE4.2 SIMD HPK libraries are used on this platform (and indeed, X2 CPUs do support SSE4.2, but not AVX).

Let’s list similar files from $ORACLE_HOME/lib:

$ cd $ORACLE_HOME/lib
$ ls -l libshpk*.so
-rw-r--r-- 1 oracle oinstall 1818445 Jul  7 04:16 libshpkavx12.so
-rw-r--r-- 1 oracle oinstall    8813 Jul  7 04:16 libshpkavx212.so
-rw-r--r-- 1 oracle oinstall 1863576 Jul  7 04:16 libshpksse4212.so

So, there are libraries for AVX and AVX2 in the lib directory too (the “12” suffix for all file names just means Oracle version 12). The AVX2 library is almost empty though (and the nm/objdump commands don’t show any Oracle functions in it, unlike in the other files).

Let’s run pmap on a process in my new laptop (which supports AVX and AVX2 ) to see if the AVX2 library gets used:

$ pmap 18969 | grep libshpk     
00007f85741b1000   1560K r-x-- libshpkavx12.so
00007f8574337000   2044K ----- libshpkavx12.so
00007f8574536000     72K rw--- libshpkavx12.so

Despite my new laptop supporting AVX2, only the AVX library is used (the AVX2 library is named libshpkavx212.so). So it looks like the AVX2 extensions are not used yet in this version (it’s the first Oracle 12.1.0.2 GA release without any patches). I’m sure this will be added soon, along with more features and bugfixes.

To be continued …

NB! I am running one more Advanced Oracle Troubleshooting training in 2018! You can attend the live online training and can download personal video recordings too. The Part 1 starts on 29th January 2018 - sign up here!

This entry was posted in InMemory, Oracle, Oracle 12c. Bookmark the permalink.

22 Responses to Oracle In-Memory Column Store Internals – Part 1 – Which SIMD extensions are getting used?

  1. jason arneil says:

    Having the X4 in the lab wouldn’t have helped with finding a CPU with AVX2 – the X4 doesn’t have that. But the new Intel E5 v3 chips that will be in the X5 do. i.e.:

    http://ark.intel.com/products/81061/Intel-Xeon-Processor-E5-2699-v3-45M-Cache-2_30-GHz

    • Tanel Poder says:

      Cool, thanks for letting me know, I assumed that X4 would have had AVX2, but didn’t check the CPU specs. Now I guess I’ll wait for the AVX-512 release and get a new laptop then :-)

  2. Yury Pudovchenko says:

    Hello, Tanel,
    Very interesting !
    Thanks for sharing your knowleges !

    My question: in Exadata is a 7 years old gcc compiler :( .
    It is obvious that such old gcc doesn’t know about latest Intel’s CPU and their features like AVX, SSE4 …
    How do you think, if upgrade gcc* packages to latest version on DB nodes – will it be possible to get a more performance ?

    Thank you !

    • Tanel Poder says:

      Oracle code on Linux/Intel platform is compiled by Intel’s C compiler as far as I know, so the GCC version doesn’t matter to Oracle DB performance. The libraries are shipped in binary form.

      • Tanel Poder says:

        Of course I’m talking about Oracle RDBMS kernel performance here, not any other application or OS library that could be recompiled in theory (in practice I would stick to the standard supported OS packages)

  3. Yury Pudovchenko says:

    Exadata X4-2:

    # grep “^model name” /proc/cpuinfo | sort | uniq
    model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz

    # grep ^flags /proc/cpuinfo | egrep “avx|sse|popcnt” | sed ‘s/ /\n/g’ | egrep “avx|sse|popcnt” | sort | uniq
    avx
    popcnt
    sse
    sse2
    sse4_1
    sse4_2
    ssse3

  4. Frank says:

    Hi tanel.
    This are your commands and the answers from the db node of “my” X4-2:

    $ grep “^model name” /proc/cpuinfo | sort | uniq
    model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz

    $ grep ^flags /proc/cpuinfo | egrep “avx|sse|popcnt” | sed ‘s/ /\n/g’ | egrep “avx|sse|popcnt” | sort | uniq
    avx
    popcnt
    sse
    sse2
    sse4_1
    sse4_2
    ssse3

  5. Scott says:

    What SIMD instructions are supported is also O/S dependent. Linux AVX support wasn’t added until 2009 but RHEL/OL 5 was shipped in 2007. Wonder if AXV support was ever backported to 2.6.18 kernel? If there Exadata X3 booting with the 2.6.18 kernel they could be missing out on AVX.

    • Tanel Poder says:

      Indeed, but doesn’t the /proc/cpuinfo just list the bits from CPUID as they are? (regardless of the OS support)

      Yep, whether the OS supports storing/loading the extended registers is a whole another story… not sure if there’s any more related stuff that the OS needs to be aware of?

    • Tanel Poder says:

      (Btw, I think all our Exadatas are running some UEK version…)

  6. Scott says:

    Hopefully all Exas are running UEK but wouldn’t be surprised to find a X2-2 that hasn’t upgraded in a long time.

    Not sure about /proc/meminfo, would be a good test to take a AVX CPU and boot RHEL 5 with the 2.6.18 kernel and see if 12c AVX library is loaded.

    From this linux kernel git commit it appears O/S support is needed for AVX. http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a30469e7921a6dd2067e9e836d7787cfa0105627

    Windows didn’t support AVX until Server 2008 R2 http://msdn.microsoft.com/en-us/library/ff545910.aspx.

    My guess other OSes did require updates as well.

  7. Jakub Wartak says:

    Tanel,

    any plans to do develop ./smallTestWithSIMD_columnar_rows.c , ./smallTestWithoutSIMD_classic_memory_rows.c and ./smallTestWithoutSIMD_classic_memory_rows.c to really benchmark how finding value in memory would be different in those ? I’ve done my last assembly long time ago (and on POWER, sigh! does recent Linux’s objdump -c really shows SSE instructions in those libs?) but SSE XOR variant might looks like the right thing for the job. Pardon me being in the casual DBA corner – meaning I’m always skeptical ;) – but I’m still not “buying” this for Oracle’s price per core should I add ;)

    Why?

    Based on this http://stackoverflow.com/questions/17791892/c-use-sse-instructions-for-comparing-huge-vectors-of-ints – I can assume (or not?) that just by switching to SSE instructions on *columnar* memory when doing FTS to raw gcc/Intel C compilers would give me at most ~10% performance more. I think that we gain much more just by just storing data in columnar way than by storing them in classical row level… but then the thing that mostly would help the performance would be the CPU cache L[1-4] sequential data prefetch algorithms, right? ;)

    The whole thing would be interesting posts of low level journey (gcc with various options/objdump confirming the asm instructions/checks for CPU performance counters) and it should be possible to measure “under the hood” what really CPU is doing… plus maybe dumps of how how the DB block itself looks like in this columnar mode ;)

    -J, always skeptical! ;)

    • Tanel Poder says:

      Hi Jakub!

      Yep, agreed, it’s first about the columnar memory structures that greatly reduce the memory traffic. That’s why I said “less memory traffic” and “SIMD would be useless if you waited on main memory all the time” on the Secret Sauce slide of my recent OOW presentation (slide 20):

      http://www.slideshare.net/tanelp/oracle-database-inmemory-option-in-action/20

      But once you do have less memory traffic when filtering rows, you’ll be stuck waiting for RAM less, so vector operations start to pay off – you’ll have less loops, less CPU branching (less memory stalls, less pipeline stalls thanks to less complexity in the tight loops). A lot depends on how the memory structures have been set up (does the CPU have to perform pointer chasing for every single value or is it sequential access etc). Also a lot depends on the opcodes used (and whether all registers get used too – for better instruction level parallelism) – so if using an optimizing C compiler, you’d probably want to use the latest and greatest one. As far s I know, Oracle uses Intel’s compiler on Linux, which can probably uses the latest SIMD logic better.

      Yep I plan to write an article and some tests about this stuff – when I have time :-)

  8. Mitch Loren says:

    I know IMCU are stored inside the SGA. Will IMCU be placed in Exadata Smart Flash Cache, in main memory inside the compute nodes, or in memory inside storage cells, or a combination. If combination, in what order are these buffers utilized.

    Thanks

    • Tanel Poder says:

      IMCUs are just temporary structures created and stored in SGA memory only. They are created by scanning source data in DB layer and converting to the IMCU format. The source data can be regular uncompressed blocks, 9iR2 dictionary-compressed blocks or HCC blocks, doesn’t really matter, as long as the source table is a regular heap table (i.e. not an external table or IOT).

      So the IMCU format and creation is decoupled from the physical on-disk source format somewhat. IMCUs, being temporary database memory-layer structures, don’t get cached anywhere in storage layer (disk, flash or storage cell memory). There’s something called in-memory fast-start that ought to allow writing compressed IMCUs to disk for faster startup after an instance failure (no-need for CPU-intensive conversion work, just load the computed IMCUs back to memory from disk). But this feature is not enabled (probably even not completely built out) in the current 12.1.0.2 DB version.

      • Mitch Loren says:

        Tanel-thanks for the quick response. To get into this further. Are the in-memory data structures held in the SGA ever stored in the Exadata Smart Flash Cache or are they always maintained in main memory in the compute nodes. Thanks again for your insight.

        • Tanel Poder says:

          In current version they’re always main memory only. In memory data scanning only happens in main memory.

          In a future Oracle version (judging from some hidden parameters in a future version the IMCUs may be persisted on disk for faster startup time in case of a crash, but they won’t be scanned from there … it wouldn’t be much of “in-memory” in this case).

  9. Fairlie Rego says:

    Values from an X5-2 storage node. Will get my hands on a compute by the end of this month

    [root@/usr/local/bin/imageinfo ~]# /usr/sbin/dmidecode -s system-product-name
    ORACLE SERVER X5-2L

    [root@/usr/local/bin/imageinfo ~]# grep ^flags /proc/cpuinfo | egrep “avx|sse|popcnt” | sed ‘s/ /\n/g’ | egrep “avx|sse|popcnt” | sort | uniq
    avx
    popcnt
    sse
    sse2
    sse4_1
    sse4_2
    ssse3

Leave a Reply

Your email address will not be published. Required fields are marked *