What the heck is the SQL Execution ID – SQL_EXEC_ID?

Ok, I think it’s time to write another blog entry. I’ve been traveling and dealing with jetlag from 10-hour time difference, then traveling some more, spoken at conferences, drank beer, had fun, then traveled some more, trained customers, hacked some Exadatas and now I’m back home.

Anyway, do you know what is the SQL_EXEC_ID in V$SESSION and ASH views?

Oh yeah, it’s the “SQL Execution ID” just like the documentation says … all clear. Um … is it? I’d like to know more about it – what does it actually stand for?! Is it session level, instance level or a RAC-global counter? And why does it start from 16 million, not 1?

 

SQL> SELECT sql_exec_id FROM v$session WHERE sid = USERENV('SID');

SQL_EXEC_ID
-----------
   16777216

 

This number 16777216 looks strangely familiar – indeed, it’s 2^24.

When I run the same query again (incrementing the SQL_EXEC_ID counter for the same SQL), I see the counter going up by 1:

 

SQL> SELECT sql_exec_id FROM v$session WHERE sid = USERENV('SID');

SQL_EXEC_ID
-----------
   16777217

SQL> SELECT sql_exec_id FROM v$session WHERE sid = USERENV('SID');

SQL_EXEC_ID
-----------
   16777218

SQL> SELECT sql_exec_id FROM v$session WHERE sid = USERENV('SID');

SQL_EXEC_ID
-----------
   16777219

SQL> SELECT sql_exec_id FROM v$session WHERE sid = USERENV('SID');

SQL_EXEC_ID
-----------
   16777220

 

Further executions of the same query keep incrementing this counter, one by one – even if I run this same SQL from another session in the same instance. So, this SQL_EXEC_ID is not a session-scope value for each SQL_ID, it’s at least instance-wide. It looks like the counting starts from 2^24 (the bit representing 2^24 is set) and ignoring that bit for now, the counting works normally, one by one, starting from zero.

 

Note that changing even a single character in the SQL text (see the extra space in the end before the semi-colon) causes the SQL_ID to change and a different SQL_EXEC_ID counter to be reported (which starts from “zero” again). A separate SQL_EXEC_ID counter is maintained in shared pool for each SQL_ID:

SQL> SELECT sql_exec_id FROM v$session WHERE sid = USERENV('SID') ;

SQL_EXEC_ID
-----------
   16777216

 

So, obviously, when I have just restarted my instance and still see 16777216 as the starting SQL_EXEC_ID for any SQL I execute, it must mean that the full SQL_EXEC_ID value contains something else than just the execution number of this SQL_ID. Whenever I see such familiar values (like powers of 2), then I like to look into the values in hex format to see whether some higher order bits are used for some special purpose. Let’s run a new SQL statement:

 

SQL> SELECT sql_exec_id, TO_CHAR(sql_exec_id,'XXXXXXXX') hex FROM v$session WHERE sid = USERENV('SID') ;

SQL_EXEC_ID HEX
----------- ---------
   16777216   1000000

SQL> SELECT sql_exec_id, TO_CHAR(sql_exec_id,'XXXXXXXX') hex FROM v$session WHERE sid = USERENV('SID') ;

SQL_EXEC_ID HEX
----------- ---------
   16777217   1000001

SQL> SELECT sql_exec_id, TO_CHAR(sql_exec_id,'XXXXXXXX') hex FROM v$session WHERE sid = USERENV('SID') ;

SQL_EXEC_ID HEX
----------- ---------
   16777218   1000002

 

Indeed, it looks like the 25th bit (2^24) is always pre-set to 1, while the least significant 24 bits represent how many times this SQL ID has been executed in an instance (I have tested this with a loop – the 24 least significant bits do get used fully for representing the SQL ID’s execution count in the instance and once it reaches 0xFFFFFF – or 0x1FFFFFF with that pre-set 25th bit, it wraps to 0×1000000 – the 25th bit still remaining set!). So the SQL_EXEC_ID can reliably only track 2^24 – 1 SQL executions in an instance and then the counter wraps to beginning. This is why you should include SQL_EXEC_START (date datatype with 1 sec precision) column in your performance monitoring queries as well, to distinguish between SQL executions with a colliding SQL_EXEC_ID. As long as you’re executing your SQL statement less than 16.7 million times per second per instance, this should be fine :-)

 

Anyway, so what’s the magic 25th bit then? Well, in RAC it would be very hard to somehow coordinate the incrementing of a single counter globally (that’s why you want to keep your sequences cached in RAC), I figure that there are different counters for the same SQL ID in different RAC instances. Let’s check – I will log in to another RAC node (node 2) and run this:

 

SQL> SELECT sql_exec_id, TO_CHAR(sql_exec_id,'XXXXXXXX') hex FROM v$session WHERE sid = USERENV('SID') ;

SQL_EXEC_ID HEX
----------- ---------
   33554433   2000001

SQL> SELECT sql_exec_id, TO_CHAR(sql_exec_id,'XXXXXXXX') hex FROM v$session WHERE sid = USERENV('SID') ;

SQL_EXEC_ID HEX
----------- ---------
   33554434   2000002

SQL> SELECT sql_exec_id, TO_CHAR(sql_exec_id,'XXXXXXXX') hex FROM v$session WHERE sid = USERENV('SID') ;

SQL_EXEC_ID HEX
----------- ---------
   33554435   2000003

 

Whoa – the SQL Execution ID in the 2nd instance starts from 33 Million! And when you convert the value to hex, you’ll see that now the 26th bit is set – showing that this SQL was executed in instance #2!

So, it very much looks like that while the 24 least significant bits are used for the SQL execution ID counter, the more significant bits are used for showing which instance_id ran that SQL. Assuming that 32 bits are used for the whole SQL_EXEC_ID value, then up to 8 higher order bits could be used for storing the instance_id – supporting up to 256-node RAC clusters. This is very useful when analyzing past ASH data as you can aggregate data (count min/max exec ID difference to get the execution counts in a time range) either in each separate instance or globally – by stripping out the instance_id part from the value.

I haven’t tested the instance_id part with 256-node RAC clusters (as Santa Claus is cutting back due to poor economy), but at least on an 8-node full rack Exadata all 8 instance_ids were reported properly. Note that for serial queries, the SQL_EXEC_ID shows you the instance_id of the instance where the session is logged on to, but for inter-instance parallel query, you will see the instance_id of the query coordinator for all PX slaves, regardless of in which instances they run. Here’s a little script from a 8-node Exadata cluster to show it. I’ll leave it up to you to fully figure it out what, how and why it’s doing, but basically what it shows is that the SQL_EXEC_ID consists of the query coordinator’s instance_id value and the execution number for a SQL_ID in the instance where the query coordinator session was logged in:

 

SQL> SELECT qc_instance_id, MIN(TO_CHAR(sql_exec_id,'XXXXXXXX'))
  2  , MAX(TO_CHAR(sql_exec_id,'XXXXXXXX'))
  3* FROM gv$active_session_history GROUP BY qc_instance_id order by 1
SQL> /

QC_INSTANCE_ID MIN(TO_CH MAX(TO_CH
-------------- --------- ---------
             1   1000000   100540F
             2   2000000   20009BF
             3   3000000   300541E
             4   4000000   40000DD
             5   5000000   50C5035
             6   6000000   600018C
             7   7000000   700023D
             8   8000000   8000755
                 1000000   803DF3B

9 rows selected.

 

That’s all for today – more cool stuff is coming, I promise :-)

 

And oh, next week I’ll start another run of my Advanced Oracle Troubleshooting seminar, so check it out! ;-)

ORA-4031 errors, contention, cursor management issues and shared pool fragmentation – free secret seminar!

Free stuff! Free stuff! Free stuff! :-)

The awesome dudes at E2SN have done it again! (and yes, Tom, this time the “we at E2SN Ltd” doesn’t mean only me alone ;-)

On Tuesday 22nd March I’ll hold two (yes two) Secret Oracle Hacking Sessions – about ORA-04031: unable to allocate x bytes of shared memory errors, cursor management issues and other shared pool related problems (like fragmentation). This event is free for all! You’ll just need to be fast enough to register, both events have 100 attendee limit (due to my GotoWebinar accont limitations).

I am going to run this online event twice, so total 200 people can attend (don’t register for both events, please). One event is in the morning (my time) to cater for APAC/EMEA region and the other session is for EMEA/US/Americas audience.

The content will be the same in both sessions. There will be no slides (you cant fix your shared pool problems with slides!) but there will be demos, scripts, live examples and fun (for the geeks among us anyway – others go and read some slides instead ;-)!

Oracle Exadata Performance series – Part 1: Should I use Hugepages on Linux Database Nodes?

There was a question in LinkedIn forum about whether Linux Hugepages should be used in Oracle Exadata Database layer, as they aren’t enabled by default during ACS install. I’m putting my answer into this blog entry – apparently LinkedIn forums have a limit of 4000 characters per reply… (interestingly familiar number, by the way…:)

So, I thought that it’s time to start writing my Oracle Exadata Performance series articles what I’ve planned for a while… with some war stories from the field, some stuff what I’ve overcome when researching for writing the Expert Oracle Exadata book etc.

I’ve previously published an article about Troubleshooting Exadata Smart Scan performance and some slides from my experience with VLDB Data Warehouse migrations to Exadata.

Here’s the first article (initially planned as a short response in LinkedIn, but it turned out much longer though):

As far as I’ve heard, the initial decision to not enable hugepages by default was that the hugepages aren’t flexible & dynamic enough – you’ll have to always configure the hugepages at OS level to match your desired SGA size (to avoid wastage). So, different shops may want radically different SGA sizes (larger SGA for single-block read oriented databases like transactional/OLTP or OLAP cubes), but smaller SGA for smart scan/parallel scan oriented DWs. If you configure 40GB of hugepages on a node, but only use 1GB of SGA, then 39GB memory is just reserved, not used, wasted – as hugepages are pre-allocated. AMM, using regular pages, will only use the pages what it touches, so there’s no memory wastage due to any pre-allocation issues…

So, Oracle chose to use an approach which is more universal and doesn’t require extra OS level configuration (which isn’t hard at all though if you pay attention, but not all people do). So, less people will end up in trouble with their first deployments although they might not be getting the most out of their hardware.

However, before enabling hugepages “because it makes things faster” you should ask yourself what exact benefit would they bring you?

There are 3 main reasons why hugepages may be useful in Linux:

1) Smaller kernel memory usage thanks to less PTEs thanks to larger pagesizes

This means less pagetable entries (PTEs) and less kernel memory usage. The bigger your SGA and the more processes you have logged on, the bigger the memory usage.

You can measure this in your case – just “grep Page /proc/meminfo” and see how big portion of your RAM has been used by “PageTables”. Many people have blogged about this, but Kevin Closson’s blog is probably the best source to read about this:

2) Lower CPU usage due to less TLB misses in CPU and soft page-fault processing when accessing SGA.

It’s harder to measure this on Linux with standard tools, although it is sure possible (on Solaris you can just run prstat -m to get microstate accounting and look into TFL,DFL,TRP stats).

Anyway, the catch here is that if you are running parallel scans and smart scans, then you don’t access that much of buffer cache in SGA at all, all IOs or smart scan result-sets are read directly to PGAs of server processes – which don’t use large pages at all, regardless of whether hugepages for SGA have been configured or not. There are some special cases, like when a block clone has to be rolled back for read consistency, you’ll have to access some undo blocks via buffer cache… but again this should be a small part of total workload.

So, in a DW, which using mostly smarts scans or direct path reads, there won’t be much CPU efficiency win from large pages as you bypass buffer cache anyway and use small pages of private process memory. All the sorting, hashing etc all happens using small pages anyway. Again I have to mention that on (my favorite OS) Solaris it is possible to configure even PGAs to use large pages (via _realfree_heap_pagesize_hint parameter) … so it’ll be interesting to see how this would help DW workloads on the Exadata X2-8 monsters which can run Solaris 11.

3) Lock SGA pages into RAM so they won’t be paged out when memory shortage happens (for whatever reason).

Hugepages are pre-allocated and never paged out. So, when you have extreme memory shortage, your SGAs won’t be paged out “by accident”. Of course it’s better to ensure that such memory shortages won’t happen – configure the SGA/PGA_AGGREGATE_TARGET sizes properly and don’t allow third party programs consume crazy amounts of memory etc. Of course there’s the lock_sga parameter in Oracle which should allow to do this on Linux with small pages too, but first I have never used it on Linux so I don’t know whether it works ok at all and also in 11g AMM perhaps the mlock() calls aren’t supported on the /dev/shm files at all (haven’t checked and don’t care – it’s better to stay away from extreme memory shortages). Read more about how the AMM MEMORY_TARGET (/dev/shm) works from my article written back in 2007 when 11g came out ( Oracle 11g internals – Automatic Memory Management ).

So, the only realistic win (for DW workload) would be the reduction of kernel pagetables structure size – and you can measure this using PageTables statistic in /proc/meminfo. Kevin demonistrated in his article that 500 connections to an instance with ~8 GB SGA consisting of small pages resulted in 7 GB of kernel pagetables usage, while the usage with large pages (still 500 connections, 8 GB SGA) was about 265 MB. So you could win over 6 GB of RAM, which you can then give to PGA_AGGREGATE_TARGET or to further inrease SGA. The more processes you have connected to Oracle, the more pagetable space is used… Similarly, the bigger the SGA is, the more pagetable space is used…

This is great, but the tradeoff here is manageability and some extra effort you have to put in to always check whether the large pages actually got used or not. After starting up your instance, you should really check whether the HugePages_Free in /proc/meminfo shrank and HugePages_Rsvd increased (when instance has just started up and Oracle hasn’t touched all the SGA pages yet, some pages will show up as Rsvd – reserved).

With a single instance per node this is trivial – you know how much SGA you want and pre-allocate the amount of hugepages for that. If you want to increase the SGA, you’ll have to shut down the instance and increase the Linux hugepages setting too. This can be done dynamically by issuing a command like echo N > /proc/sys/vm/nr_hugepages (where N is the number of huge pages), BUT in real life this may not work out well as if Linux kernel can’t free enough small pages from right physical RAM locations to consolidate 2 or 4 MB contiguous pages, the above command may fail to create the requested amount of new hugepages.

And this means you should restart the whole node to do the change. Note that if you increase your SGA larger to the number of hugepages (or you forget to increase the memlock setting in /etc/security/limits.conf accordingly) then your instance will silently just use the small pages, while all the memory pre-allocated for hugepages stays reserved for hugepages and is not usable for anything else!).

So, this may become more of a problem when you have multiple database instances per cluster node or you expect to start up and shut down instances on different nodes based on demand (or when some cluster nodes fail).

Long story short – I do configure hugepages in “static” production environments, to save kernel memory (and some CPU time for OLTP type environments using buffer cache heavily), also on Exadata. However for various test and development environments with lots of instances per server and constant action, I don’t bother myself (and the client) with hugepages and make everyone’s life easier… Small instances with small number of connections won’t use that many PTEs anyway…

For production environments with multiple database instances per node (and where failovers are expected) I would take the extra effort to ensure that whatever hugepages I have preallocated, won’t get silently wasted because an instance wants more SGA than the available hugepages can accommodate. You can do this by monitoring /proc/meminfo’s HugePage entries as explained above. And remember, the ASM instance (which is started before DB instances) will also grab itself some hugepages when it starts!

A little new feature for shared pool geeks :-)

If you’ve taken any shared pool dumps from 11g+ databases lately, you might have wondered about this:

http://tech.e2sn.com/oracle/troubleshooting/shared-pool

The full power of Oracle’s diagnostic events, part 2: ORADEBUG DOC and 11g improvements

I haven’t written any blog entries for a while, so here’s a very sweet treat for low-level Oracle troubleshooters and internals geeks out there :)

Over a year ago I wrote that Oracle 11g has a completely new low-level kernel diagnostics & tracing infrastructure built in to it. I wanted to write a longer article about it with comprehensive examples and use cases, but by now I realize I won’t ever have time for this, so I’ll just point you to the right direction :)

Basically, since 11g, you can use SQL_Trace, kernel undocumented traces, various dumps and other actions at much better granularity than before.

For example, you can enable SQL_Trace for a specific SQL_ID only:

SQL> alter session set events 'sql_trace[SQL: 32cqz71gd8wy3{pgadep: exactdepth 0} {callstack: fname opiexe}
plan_stat=all_executions,wait=true,bind=true';


Session altered.

Actually I have done more in above example, I have also said that trace only when the PGA depth (the dep= in tracefile) is zero. This means that trace only top-level calls, issued directly by the client application and not recursively by some PL/SQL or by dictionary cache layer. Additionally I have added a check whether we are currently servicing opiexe function (whether the current call stack contains opiexe as a (grand)parent function) – this allows to trace & dump only in specific cases of interest!

The syntax is actually more powerful than that, in this example I’m running kernel tracing for a kernel component plus instructing Oracle to dump various other things at level 1 (callstack,process state and query block debug info) whenever a tracepoint (event) in the SQL Transformation component family is hit:

SQL> alter session set events 'trace[RDBMS.SQL_Transform[SQL: 32cqz71gd8wy3]
disk=high RDBMS.query_block_dump(1) processstate(1) callstack(1)';


Session altered.

And by now you are probably asking that where is this syntax formally documented? Google and MOS searches don’t return anything useful. Well, as with many other things, a good reference is stored within Oracle kernel itself!

Just log on as sysdba and type ORADEBUG DOC:

ORADEBUG DOC

SQL> oradebug doc
Internal Documentation
**********************
EVENT                           Help on events (syntax, event list, ...)
COMPONENT       [<comp_name>]   List all components or describe <comp_name>

This gives you the index page, now you can navigate on by running ORADEBUG DOC EVENT and take it from there. There’s lots of documentation there!

I have put the output with some comments and examples into my website too:

http://tech.e2sn.com/oracle/troubleshooting/oradebug-doc

Note that this feature is quite fresh, almost not used at all in the real (production) world, so I consider this quite experimental. I have managed to crash my session with some tests, so take the usual advice about any undocumented stuff (and oradebug) – don’t use it in production without thinking first and if you do use it, then use it at your own risk!

cursor: pin S waits, sporadic CPU spikes and systematic troubleshooting

I recently consulted one big telecom and helped to solve their sporadic performance problem which had troubled them for some months. It was an interesting case as it happened in the Oracle / OS touchpoint and it was a product of multiple “root causes”, not just one, an early Oracle mutex design bug and a Unix scheduling issue – that’s why it had been hard to resolve earlier despite multiple SRs opened etc.

Martin Meyer, their lead DBA, posted some info about the problem and technical details, so before going on, you should read his blog entry and read my comments below after this:

Problem:

So, the problem was, that occasionally the critical application transactions which should have taken very short time in the database (<1s), took 10-15 seconds or even longer and timed out.

Symptoms:

  1. When the problem happened, the CPU usage also jumped up to 100% for the problem duration (from few tens of seconds up to few minutes).
  2. In AWR snapshots (taken every 20 minutes), the “cursor: pin S” popped into top TOP5 waits (around 5-10% of total instance wait time) and sometimes also “cursor: pin S wait on X” which is a different thing, also “latch: library cache” and interestingly “log file sync”. These waits had then much higher average wait times per wait occurrence than normal (tens or hundreds of milliseconds per wait, on average).
  3. The V$EVENT_HISTOGRAM view showed lots of cursor: pin S waits taking very long time (over a second, some even 30+ seconds) and this certainly isn’t normal (Martin has these numbers in his blog entry)

AWR and OS CPU usage measurement tools are system-wide tools (as opposed to session-wide tools).

Troubleshooting:

I can’t give you exact numbers or AWR data here, but will explain the flow of troubleshooting and reasoning.

  • As the symptoms involved CPU usage spikes, I first checked whether there were perhaps logon storms going on due a bad application server configuration, where the app server suddenly decides to fire up hundreds of more connections at the same time (that happens quite often, so it’s a usual suspect when troubleshooting such issues). A logon storm can consume lots of CPU as all these new processes need to be started up in OS, they attach to SGA (syscalls, memory pagetable set-up operations) and eventually they need to find & allocate memory from shared pool and initialize session structures. This all takes CPU.However the logons cumulative statistic in AWR didn’t go up almost at all during the 20 minute snapshot, so that ruled out a logon storm. As the number of sessions in the end of AWR snapshot (compared to the beginning of it) did not go down, this ruled out a logoff storm too (which also consumes CPU as now the exiting processes need to release their resources etc).
  • It’s worth mentioning that log file sync waits also went up by over an order of magnitude (IIRC from 1-2ms to 20-60 ms per wait) during the CPU spikes. However as log file parallel write times didn’t go up so radically, this indicated that the log file sync wait time was wasted somewhere else too – which is very likely going to be CPU scheduling latency (waiting on the CPU runqueue) when CPUs are busy.
  • As one of the waits which popped up during the problem was cursor: pin S, then I chcecked V$MUTEX_SLEEP_HISTORY and it did not show any specific cursor as a significant contention point (all contention recorded in that sleep history buffer was evenly spread across many different cursors), so that indicated to me that the problem was likely not related to a single cursor related issue (a bug or just too heavy usage of that cursor). Note that this view was not queried during the worst problem time, so there was a chance that some symptoms were not in there anymore (V$MUTEX_SLEEP_HISTORY is a circular buffer of few hundred last mutex sleeps).
  • So, we had CPU starvation and very long cursor: pin S waits popping up at the same time. cursor: pin S operation should happen really fast as it’s a very simple operation (few tens of instructions only) of marking the cursor’s mutex in-flux so its reference count could be bumped up for a shared mutex get.
  • Whenever you see CPU starvation (CPUs 100% busy and runqueues are long) and latch or mutex contention, then the CPU starvation should be resolved first, as the contention may just be a symptom of the CPU starvation. The problem is that if you get unlucky and a latch or mutex holder process is preempted and taken off CPU by the scheduler, the latch/mutex holder can’t release the latch before it gets back onto CPU to complete its operation! But OS doesn’t have a clue about this, as latches/mutexes are just Oracle’s memory structures in SGA. So the latch/mutex holder is off CPU and everyone else who gets onto CPU may want to take the same latch/mutex. They can’t get it and spin shortly in hope that the holder releases it in next few microseconds, which isn’t gonna happen in this case, as the latch/mutex holder is still off CPU!
  • And now comes a big difference between latches and mutexes in Oracle 10.2: When a latch getter can’t get the latch after spinning, it will go to sleep to release the CPU. Even if there are many latch getters in the CPU runqueue before the latch holder, they all spin quickly and end up sleeping again. But when a mutex getter doesn’t get the mutex after spinning, it will not go to sleep!!! It will yield() the CPU instead, which means that it will go to the end of runqueue and try to get back onto CPU as soon as possible. So, mutex getters in 10.2 are much less graceful, they can burn a lot of CPU when the mutex they want is held by someone else for long time.
  • But so what, if a mutex holder is preempted and taken off CPU by OS scheduler – it should get back onto CPU pretty fast, once it works its way through the CPU runqueue?
  • Well, yes IF all the processes in the system have the same priority.
  • This is where a second problem comes into play – Unix process priority decay. When a process eats a lot of CPU (and does little IO / voluntary sleeping) then the OS lowers that processes CPU scheduling priority so that other, less CPU hungry processes would still get their fair share of CPU (especially when coming back from an IO wait for example etc).
  • When a mutex holder has a lower priority than most other processes and is now taken off CPU, a thing called priority inversion happens. Even though other processes do have higher priority, they can not proceed, as the critical lock or resource they need, is already held by the other process with a lower priority who can’t complete its work as the “high priority” processes keep the CPUs busy.
  • In case of latches, the problem is not that bad as the latch getters go to sleep until they are posted when the latch is released by the holder process (I’ve written about it here). But the priority inversion takes a crazy turn in case of mutexes – as their getters don’t sleep (not even for a short time) by default, but yield the CPU and try to get back to it immediately and so on until they get the mutex. That can lead to huge CPU runqueue spikes, unresponsive systems and even hangs.
  • This is why starting from Oracle 11g the mutex getters do sleep instead of just yielding the CPU and also Oracle has backported the fix into Oracle 10.2.0.4, where a patch must be applied and where the _first_spare_parameter will specify the sleep duration in centiseconds.
  • So, knowing how mutexes worked in 10.2, all the symptoms led me to suspect this priority inversion problem, greatly amplified by how the mutex getters do never sleep by default. And we checked the effective priorities of all Oracle processes in the server, and we hit the jackpot – there was a number of processes with significantly lower priorities than all other processes had. And it takes only one process with low priority to cause all this trouble, just wait until it starts modifying a mutex and is preempted while doing this.
  • So, in order to fix both of the problems which amplified each other, we had to enable HPUX_SCHED_NOAGE Oracle parameter, to prevent priority decay of the processes and set the _first_spare_parameter to 10, which meant that default mutex sleep time will be 10 centiseconds (which is pretty long time in mutex/latching world, but better than crazily retrying without any sleeping at all). That way no process (the mutex holder) is pushed back and kept away from CPU for long periods of time.

This was not a trivial problem, as it happened in Oracle / OS touchpoint and happened not because a single reason, but as a product of multiple separate reasons, amplifying each other.

There are few interesting, non-technical points here:

  1. When troubleshooting, don’t let performance tools like AWR (or any other tool!) tell you what your problem is! Your business, your users should tell you what the problem is and the tools should only be used for symptom drilldown (This is what Cary Millsap has been constantly telling us). Note how I mentioned the problem and symptoms separately in the beginning of my post – and the problem was that some business transactions (systemwide) timed out because the database response time was 5-15 seconds!
  2. The detail and scope of your performance data must have at least as good detail and scope of your performance problem!
    In other words, if your problem is measured in few seconds, then your performance data should also be sampled at least every few seconds in order to be fully systematic.The classic issue in this case was that the 20 minute AWR reports still showed IO wait times as main DB time consumers, but that was averaged over 20 minutes. But our problem happened severely and shortly within few seconds in that 20 minutes, so the averaging and aggregation over long period of time did hide the extreme performance issue that happened in a very short time.

Next time when it seems to be impossible to diagnose a problem and if the troubleshooting effort ends up going in circles, then you should ask, “what’s the real problem and who and how is experiencing it” and see if your performance data’s detail and scope matches that problem!

Oh, this is a good point to mention that in addition to my Advanced Oracle Troubleshooting/SQL Tuning seminars I also actually perform advanced Oracle troubleshooting consulting too! I eat mutexes for breakfast ;-)

Update: Srivenu Kadiyala has experienced the same problem and has written about it here.

 

Oracle 11gR2 has been released – and with column oriented storage option

You may already have noticed that Oracle 11gR2 for Linux is available for download on Oracle.com website, with documentation.

And this document ends speculation about whether Oracle 11.2 will support column-oriented storage – yes it will:

http://www.oracle.com/technology/products/database/oracle11g/pdf/oracle-database-11g-release2-overview.pdf

However, this is apparently available on Exadata storage only as a new error message below indicates:

ORA-64307: hybrid columnar compression is only supported in tablespaces residing on Exadata storage
Cause: An attempt was made to use hybrid columnar compression on unsupported storage.
Action: Create this table in a tablespace residing on Exadata storage or use a different compression type.

Update: Kevin Closson mentioned that 11gR2 doesn’t really have column oriented storage as some other products like Vertica’s and Sybase IQ use, but its rather just column oriented compression option where storage is still organized by row but individual fields in these rows use compression dictionaries whichcan span multiple block boundaries (we’ll thats my interpretation at least).

The 11gR2 release overview doc seems to be wrong in this case, as it says:

Hybrid columnar compression is a new method for organizing how data is stored. Instead of
storing the data in traditional rows, the data is grouped, ordered and stored one column at a time.

Read Kevin’s note here:

http://kevinclosson.wordpress.com/2009/09/01/oracle-switches-to-columnar-store-technology-with-oracle-database-11g-release-2/

ORA-64307: hybrid columnar compression is only supported in tablespaces residing on Exadata storage
Cause: An attempt was made to use hybrid columnar compression on unsupported storage.

Action: Create this table in a tablespace residing on Exadata storage or use a different compression type.

How to detect when a cursor was closed from SQL trace output?

After Randolf’s comment on my last post about identifying cursor SQL text from sql trace file I think one thing needs elaboration.

I mentioned earlier in this post that this cursor dumping technique works “as long as the cursor of interest is still open”.

So how do you know whether this cursor of interest is still open or has been closed and that slot reused by some other statement instead? You would not want to get misled to wrong SQL statement…

Luckily all the info you need is in SQL tracefile.

In 11g you would see “CLOSE #2″ line appearing in tracefile when a cursor is closed.

Before 11g you would see “STAT #2″ lines dumping out SQL plan execution stats – before 11g this is done when cursor is closed, so when you see STAT #2 lines, you know that at that location the cursor in slot #2 was closed.

There are few cases when the stat lines are not printed even if the cursor is closed – for example when the cursor’s library cache lookup (finding suitable shared cursor) happened before SQL trace was enabled. In this case the cursor would not use the extra instrumentation rowsources which generate the STAT lines.

In such case you would just grep the tracefile for PARSING IN CURSOR #2 after your WAIT #2′s. Whenever someone is parsing in a new cursor into slot #2, this means that existing cursor in slot #2 must have been closed somewhere between the PARSING IN CURSOR #2 line and last WAIT/FETCH/EXEC #2 line.

However, there’s one more special case; when the cursor was parsed in before SQL trace was enabled and another OPI call (like EXEC,FETCH) against this cursor is executed, then some Oracle versions actually retrieve the current SQL text and still dump it into trace just like before parsing. This means that if you just have lots of WAIT lines appearing in the trace for a SQL parsed in past, you won’t see the PARSING IN CURSOR lines dumped into tracefile, but if some OPI call is executed against this cursor, then the PARSING IN CURSOR line is dumped just before that OPI call.

I don’t in which exact version this appeared, but in my 10.2.0.1 test instance  this feature gives such output (I ran a select from dba_objects which started fetching lots of rows and then enabled SQL trace from another session):

PARSING IN CURSOR #1 len=29 dep=0 uid=0 oct=3 lid=0 tim=558589701689 hv=3336193532 ad='24f2e864'
select owner from dba_objects
END OF STMT
FETCH #1:c=0,e=5362,p=0,cr=102,cu=0,mis=0,r=500,dep=0,og=1,tim=558589701683
WAIT #1: nam='SQL*Net more data to client' ela= 23 driver id=1111838976 #bytes=2047 p3=0 obj#=39 tim=558589703066
WAIT #1: nam='SQL*Net message from client' ela= 57993 driver id=1111838976 #bytes=1 p3=0 obj#=39 tim=558589761409
WAIT #1: nam='SQL*Net message to client' ela= 3 driver id=1111838976 #bytes=1 p3=0 obj#=39 tim=558589761753
WAIT #1: nam='SQL*Net more data to client' ela= 13 driver id=1111838976 #bytes=2001 p3=0 obj#=39 tim=558589762331
FETCH #1:c=0,e=1193,p=0,cr=93,cu=0,mis=0,r=500,dep=0,og=1,tim=558589762865
WAIT #1: nam='SQL*Net message from client' ela= 60762 driver id=1111838976 #bytes=1 p3=0 obj#=39 tim=558589823801
WAIT #1: nam='SQL*Net message to client' ela= 2 driver id=1111838976 #bytes=1 p3=0 obj#=39 tim=558589824139
WAIT #1: nam='SQL*Net more data to client' ela= 19 driver id=1111838976 #bytes=2001 p3=0 obj#=39 tim=558589825284

Do you notice something missing?

The PARSE #1 and EXEC #1 lines are missing as the actual parsing and execute calls happened way before the SQL trace was enabled. But Oracle was helpful enough to dump out the SQL text when next OPI call (FETCH) was issued under SQL tracing mode (it “noticed” that the SQL text corresponding to cursor #1 had not been dumped to trace yet).

This is helpful, but of course can cause some confusion, as this PARSING IN CURSOR doesn’t really mean parsing as such (as parsing is done under PARSE call), the PARSING IN CURSOR should probably be named as “ABOUT TO PARSE CURSOR” or something like that.

“Free” DBA_HIST AWR views in 11g…

I just noticed this in 11g Licensing doc ( http://download.oracle.com/docs/cd/B28359_01/license.111/b28287/options.htm#sthref69 ):

  • All data dictionary views beginning with the prefix DBA_HIST_ are part of this pack, along with their underlying tables.The only exception are the views: DBA_HIST_SNAPSHOT, DBA_HIST_DATABASE_INSTANCE, DBA_HIST_SNAP_ERROR, DBA_HIST_SEG_STAT, DBA_HIST_SEG_STAT_OBJ, and DBA_HIST_UNDOSTAT. They can be used without the Diagnostic Pack license.

This exception is not present in 10.2 license guide, so before 11g you can query V$SEGMENT_STATISTICS and V$UNDOSTAT’s history “for free” :)

Of course, collecting this data manually with a 1-line PL/SQL loop script isn’t hard either :)

Oracle 11g: Reading alert log via SQL

Oracle has done some major improvements in the diagnosability infrastructure in version 11g. Here’s one little detail.

Before Oracle 11g it is possible to access the alert log via SQL using an external table or a pipelined function which in turn uses utl_file.
After reading the text you need to parse it to extract the information you need from there.

Starting from 11g Oracle does all this work for you. There is a fixed table X$DBGALERTEXT, when you query it, Oracle reads the log.xml from alert directory (which contains all the data what alert.log does), parses it and returns the details back as rows:

SQL> select message_text from X$DBGALERTEXT where rownum <= 20;

MESSAGE_TEXT
-----------------------------------------------------------------------------------------------------------------
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Shared memory segment for instance monitoring created
Picked latch-free SCN scheme 2
Using LOG_ARCHIVE_DEST_10 parameter default value as USE_DB_RECOVERY_FILE_DEST
Autotune of undo retention is turned on.
IMODE=BR
ILAT =18
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up ORACLE RDBMS Version: 11.1.0.7.0.
Using parameter settings in client-side pfile /u01/app/oracle/admin/LIN11G/pfile/init.ora on machine linux03
System parameters with non-default values:
  processes                = 150
  memory_target            = 404M
  control_files            = "/u01/oradata/LIN11G/control01.ctl"
  control_files            = "/u01/oradata/LIN11G/control02.ctl"
  control_files            = "/u01/oradata/LIN11G/control03.ctl"
  db_block_size            = 8192

20 rows selected.

This is the text representation, but you can get individual details from other columns as listed below:

[Read more...]