Oracle Exadata Performance series – Part 1: Should I use Hugepages on Linux Database Nodes?

There was a question in LinkedIn forum about whether Linux Hugepages should be used in Oracle Exadata Database layer, as they aren’t enabled by default during ACS install. I’m putting my answer into this blog entry – apparently LinkedIn forums have a limit of 4000 characters per reply… (interestingly familiar number, by the way…:)

So, I thought that it’s time to start writing my Oracle Exadata Performance series articles what I’ve planned for a while… with some war stories from the field, some stuff what I’ve overcome when researching for writing the Expert Oracle Exadata book etc.

I’ve previously published an article about Troubleshooting Exadata Smart Scan performance and some slides from my experience with VLDB Data Warehouse migrations to Exadata.

Here’s the first article (initially planned as a short response in LinkedIn, but it turned out much longer though):

As far as I’ve heard, the initial decision to not enable hugepages by default was that the hugepages aren’t flexible & dynamic enough – you’ll have to always configure the hugepages at OS level to match your desired SGA size (to avoid wastage). So, different shops may want radically different SGA sizes (larger SGA for single-block read oriented databases like transactional/OLTP or OLAP cubes), but smaller SGA for smart scan/parallel scan oriented DWs. If you configure 40GB of hugepages on a node, but only use 1GB of SGA, then 39GB memory is just reserved, not used, wasted – as hugepages are pre-allocated. AMM, using regular pages, will only use the pages what it touches, so there’s no memory wastage due to any pre-allocation issues…

So, Oracle chose to use an approach which is more universal and doesn’t require extra OS level configuration (which isn’t hard at all though if you pay attention, but not all people do). So, less people will end up in trouble with their first deployments although they might not be getting the most out of their hardware.

However, before enabling hugepages “because it makes things faster” you should ask yourself what exact benefit would they bring you?

There are 3 main reasons why hugepages may be useful in Linux:

1) Smaller kernel memory usage thanks to less PTEs thanks to larger pagesizes

This means less pagetable entries (PTEs) and less kernel memory usage. The bigger your SGA and the more processes you have logged on, the bigger the memory usage.

You can measure this in your case – just “grep Page /proc/meminfo” and see how big portion of your RAM has been used by “PageTables”. Many people have blogged about this, but Kevin Closson’s blog is probably the best source to read about this:

2) Lower CPU usage due to less TLB misses in CPU and soft page-fault processing when accessing SGA.

It’s harder to measure this on Linux with standard tools, although it is sure possible (on Solaris you can just run prstat -m to get microstate accounting and look into TFL,DFL,TRP stats).

Anyway, the catch here is that if you are running parallel scans and smart scans, then you don’t access that much of buffer cache in SGA at all, all IOs or smart scan result-sets are read directly to PGAs of server processes – which don’t use large pages at all, regardless of whether hugepages for SGA have been configured or not. There are some special cases, like when a block clone has to be rolled back for read consistency, you’ll have to access some undo blocks via buffer cache… but again this should be a small part of total workload.

So, in a DW, which using mostly smarts scans or direct path reads, there won’t be much CPU efficiency win from large pages as you bypass buffer cache anyway and use small pages of private process memory. All the sorting, hashing etc all happens using small pages anyway. Again I have to mention that on (my favorite OS) Solaris it is possible to configure even PGAs to use large pages (via _realfree_heap_pagesize_hint parameter) … so it’ll be interesting to see how this would help DW workloads on the Exadata X2-8 monsters which can run Solaris 11.

3) Lock SGA pages into RAM so they won’t be paged out when memory shortage happens (for whatever reason).

Hugepages are pre-allocated and never paged out. So, when you have extreme memory shortage, your SGAs won’t be paged out “by accident”. Of course it’s better to ensure that such memory shortages won’t happen – configure the SGA/PGA_AGGREGATE_TARGET sizes properly and don’t allow third party programs consume crazy amounts of memory etc. Of course there’s the lock_sga parameter in Oracle which should allow to do this on Linux with small pages too, but first I have never used it on Linux so I don’t know whether it works ok at all and also in 11g AMM perhaps the mlock() calls aren’t supported on the /dev/shm files at all (haven’t checked and don’t care – it’s better to stay away from extreme memory shortages). Read more about how the AMM MEMORY_TARGET (/dev/shm) works from my article written back in 2007 when 11g came out ( Oracle 11g internals – Automatic Memory Management ).

So, the only realistic win (for DW workload) would be the reduction of kernel pagetables structure size – and you can measure this using PageTables statistic in /proc/meminfo. Kevin demonistrated in his article that 500 connections to an instance with ~8 GB SGA consisting of small pages resulted in 7 GB of kernel pagetables usage, while the usage with large pages (still 500 connections, 8 GB SGA) was about 265 MB. So you could win over 6 GB of RAM, which you can then give to PGA_AGGREGATE_TARGET or to further inrease SGA. The more processes you have connected to Oracle, the more pagetable space is used… Similarly, the bigger the SGA is, the more pagetable space is used…

This is great, but the tradeoff here is manageability and some extra effort you have to put in to always check whether the large pages actually got used or not. After starting up your instance, you should really check whether the HugePages_Free in /proc/meminfo shrank and HugePages_Rsvd increased (when instance has just started up and Oracle hasn’t touched all the SGA pages yet, some pages will show up as Rsvd – reserved).

With a single instance per node this is trivial – you know how much SGA you want and pre-allocate the amount of hugepages for that. If you want to increase the SGA, you’ll have to shut down the instance and increase the Linux hugepages setting too. This can be done dynamically by issuing a command like echo N > /proc/sys/vm/nr_hugepages (where N is the number of huge pages), BUT in real life this may not work out well as if Linux kernel can’t free enough small pages from right physical RAM locations to consolidate 2 or 4 MB contiguous pages, the above command may fail to create the requested amount of new hugepages.

And this means you should restart the whole node to do the change. Note that if you increase your SGA larger to the number of hugepages (or you forget to increase the memlock setting in /etc/security/limits.conf accordingly) then your instance will silently just use the small pages, while all the memory pre-allocated for hugepages stays reserved for hugepages and is not usable for anything else!).

So, this may become more of a problem when you have multiple database instances per cluster node or you expect to start up and shut down instances on different nodes based on demand (or when some cluster nodes fail).

Long story short – I do configure hugepages in “static” production environments, to save kernel memory (and some CPU time for OLTP type environments using buffer cache heavily), also on Exadata. However for various test and development environments with lots of instances per server and constant action, I don’t bother myself (and the client) with hugepages and make everyone’s life easier… Small instances with small number of connections won’t use that many PTEs anyway…

For production environments with multiple database instances per node (and where failovers are expected) I would take the extra effort to ensure that whatever hugepages I have preallocated, won’t get silently wasted because an instance wants more SGA than the available hugepages can accommodate. You can do this by monitoring /proc/meminfo’s HugePage entries as explained above. And remember, the ASM instance (which is started before DB instances) will also grab itself some hugepages when it starts!

Note that this year’s only Advanced Oracle Troubleshooting class takes place in the end of April/May 2014, so sign up now if you plan to attend this year!

This entry was posted in Oracle and tagged , , , , . Bookmark the permalink.

15 Responses to Oracle Exadata Performance series – Part 1: Should I use Hugepages on Linux Database Nodes?

  1. Scott says:

    Thanks for the post. With the release of the linux kernel 2.6.38 what are your thoughts on transparent huge pages in making it easier to use with many instances on the same server?

    http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/vm/transhuge.txt;hb=HEAD

  2. Tanel Poder says:

    @Scott
    Thanks Scott – I wasn’t even aware that they had put this in to Linux mainline kernel… Sounds interesting.

    Here are my thoughts:

    1) First, Oracle would need to run on kernel 2.6.38 … Redhat 5.x runs on 2.6.18 + lots of patches… Redhat 6 has major code pieces from up to 2.6.34 + patches… Oracle’s unbreakable linux kernel is 2.6.32 + oracle patches… No 2.6.38 in sight…

    2) The doc you linked says this:

    “Currently it only works for anonymous memory mappings but in the future it can expand over the pagecache layer starting with tmpfs.”

    So, it only works for anonymous (private) memory. The SGA issues remain the same so far. But you could use hugepages for PGAs, just like on Solaris. This can be useful in DWs where each process may allocated tens to hundreds of MB for work areas. But Oracle on Linux has to allocate/map larger amount of private memory at a time… Not sure whether hugepages would be allocated when Oracle extends the top-level PGA heap by 64kB at at time… Setting the _realfree_heap_pagesize_hint to 2 or 4MB (depending on hugepage size) may help here.

    So, right now this feature is not useful for us… but perhaps when/if Oracle comes out with a new version of their unbreakable linux kernel, this could be the fastest way for us to run Oracle in a supported configuration and take advantage of this feature.

    This article does a pretty good job in explaining the CPU performance benefits you get from hugepages…

    • Kevin says:

      Sorry to comment on an old post, but I’d like to point out realfree heap is mmap() and transparent hugepages is only for anon pages (malloc). To get hugepages behind PGA Oracle would still have to call mmap with MAP_HUGETLB. Am I wrong on that or missing your point?

      Thanks, Tanel.

  3. Pingback: Exadata CAN do smart scans on bitmap indexes | Tanel Poder's blog: IT & Mobile for Geeks and Pros

  4. Tanel Poder says:

    I was wrong – apparently Redhat has included the Transparent Huge Pages feature into Redhat 6! So, they’ve just added this functionality to their kernel version before it made it to the mainline kernel. Interesting! So, OEL 6 will have this feature too.

    Well, someone better do some quality research on this and publish the results – I’m too damn busy for this :(

  5. Scott says:

    @Tanel Poder
    Thanks for the insight. One downside of using hugepages is you can’t use the memory_target for AMM. With transparent huge pages coming to tmpfs maybe in the near future Oracle can use huge pages for /dev/shm.

    If I every get enough free time will try out pga with huge pages.

  6. Ofir Manor says:

    Hi Tanel,
    great writeup.
    I just wanted to add that Hugepages configuration is officially recommended is some of Oracle MAA best practices for Exadata here:
    http://www.oracle.com/technetwork/database/features/availability/exadata-maa-best-practices-155385.html

    For example, the “Oracle E-Business Suite on Exadata” paper says:

    Hugepages should be configured for the Oracle E-Business Suite database System Global Area
    (SGA). This will result in more efficient memory usage, especially with a large SGA or if there
    are high numbers of concurrent database connections.

    the “Siebel on Exadata” paper says:

    Siebel will typically run with many database connections and a large SGA and so configuring
    hugepages for the Siebel database instances is required.

    the “PeopleSoft on Exadata” paper says “Configure Linux huge pages on each database machine compute node”. That paper also has a test case with 42GB Huge Pages.

    So, while HugePages is less relevant for classic DW with a small SGA, it is highly recommended for OLTP and consolidation. I would also think it is a must in X2-8 with a 1 TB of ram per db node…

  7. Pavol Babel says:

    Tanel,

    it is quite interesting for me, that HP-UX is using large pages for Oracle by default, no extra configuration is needed. However, if I were asked to choose from “big three” Unix players (HP-UX v3, Solaris, AIX), I would never pick HP-UX up. I think the memory reserved to HP-UX system (kernel memory and many buffers) is much bigger comparing to AIX or Solaris (even without large pages configured).

  8. Dan Norris says:

    One small note regarding Solaris. The post implied that Solaris would be available only on X2-8 and maybe you only mentioned X2-8 because of the large memory size available on that configuration. Solaris, once available, will be supported on the X2-2 database servers as well as X2-8.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>