Quiz: Explaining index creation

Did you know that it’s possible to use EXPLAIN PLAN FOR CREATE INDEX ON table(col1,col2,col3) syntax for explaining what exactly would be done when an index is created?

That’s useful for example for seeing the Oracle’s estimated index size without having to actually create the index.

You can also use EXPLAIN PLAN FOR ALTER INDEX i REBUILD to see whether this operation would use a FULL TABLE SCAN or a FAST FULL INDEX SCAN (offline index rebuilds of valid indexes can use this method).

Anyway, you can experiment with this yourself, but here’s a little quiz (with a little gotcha :)

What kind of index creation statement would create such an execution plan?

SQL> explain plan for create index hack_index on hack_table ....the rest is a secret for now....
Explained.

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 457720527
-------------------------------------------------------------------------------------------------
| Id  | Operation                | Name                 | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------
|   0 | CREATE INDEX STATEMENT   |                      | 88868 |       |   657   (1)| 00:00:12 |
|   1 |  INDEX BUILD NON UNIQUE  | HACK_INDEX           |       |       |            |          |
|   2 |   SORT AGGREGATE         |                      |     1 |       |            |          |
|   3 |    VIEW                  | HACK_VIEW            | 74062 |       |   318   (1)| 00:00:06 |
|*  4 |     HASH JOIN            |                      | 74062 |  1012K|   210   (1)| 00:00:04 |
|   5 |      TABLE ACCESS FULL   | TEST_USERS           |    46 |   368 |     4   (0)| 00:00:01 |
|   6 |      INDEX FAST FULL SCAN| IDX2_INDEXED_OBJECTS | 74062 |   433K|   206   (1)| 00:00:04 |
|   7 |   SORT CREATE INDEX      |                      | 88868 |       |            |          |
|   8 |    TABLE ACCESS FULL     | HACK_TABLE           | 88868 |       |   544   (1)| 00:00:10 |
-------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("U"."USERNAME"="O"."OWNER")
Note
-----
- estimated index size: 3145K bytes

cursor: pin S waits, sporadic CPU spikes and systematic troubleshooting

I recently consulted one big telecom and helped to solve their sporadic performance problem which had troubled them for some months. It was an interesting case as it happened in the Oracle / OS touchpoint and it was a product of multiple “root causes”, not just one, an early Oracle mutex design bug and a Unix scheduling issue – that’s why it had been hard to resolve earlier despite multiple SRs opened etc.

Martin Meyer, their lead DBA, posted some info about the problem and technical details, so before going on, you should read his blog entry and read my comments below after this:

Problem:

So, the problem was, that occasionally the critical application transactions which should have taken very short time in the database (<1s), took 10-15 seconds or even longer and timed out.

Symptoms:

  1. When the problem happened, the CPU usage also jumped up to 100% for the problem duration (from few tens of seconds up to few minutes).
  2. In AWR snapshots (taken every 20 minutes), the “cursor: pin S” popped into top TOP5 waits (around 5-10% of total instance wait time) and sometimes also “cursor: pin S wait on X” which is a different thing, also “latch: library cache” and interestingly “log file sync”. These waits had then much higher average wait times per wait occurrence than normal (tens or hundreds of milliseconds per wait, on average).
  3. The V$EVENT_HISTOGRAM view showed lots of cursor: pin S waits taking very long time (over a second, some even 30+ seconds) and this certainly isn’t normal (Martin has these numbers in his blog entry)

AWR and OS CPU usage measurement tools are system-wide tools (as opposed to session-wide tools).

Troubleshooting:

I can’t give you exact numbers or AWR data here, but will explain the flow of troubleshooting and reasoning.

  • As the symptoms involved CPU usage spikes, I first checked whether there were perhaps logon storms going on due a bad application server configuration, where the app server suddenly decides to fire up hundreds of more connections at the same time (that happens quite often, so it’s a usual suspect when troubleshooting such issues). A logon storm can consume lots of CPU as all these new processes need to be started up in OS, they attach to SGA (syscalls, memory pagetable set-up operations) and eventually they need to find & allocate memory from shared pool and initialize session structures. This all takes CPU.However the logons cumulative statistic in AWR didn’t go up almost at all during the 20 minute snapshot, so that ruled out a logon storm. As the number of sessions in the end of AWR snapshot (compared to the beginning of it) did not go down, this ruled out a logoff storm too (which also consumes CPU as now the exiting processes need to release their resources etc).
  • It’s worth mentioning that log file sync waits also went up by over an order of magnitude (IIRC from 1-2ms to 20-60 ms per wait) during the CPU spikes. However as log file parallel write times didn’t go up so radically, this indicated that the log file sync wait time was wasted somewhere else too – which is very likely going to be CPU scheduling latency (waiting on the CPU runqueue) when CPUs are busy.
  • As one of the waits which popped up during the problem was cursor: pin S, then I chcecked V$MUTEX_SLEEP_HISTORY and it did not show any specific cursor as a significant contention point (all contention recorded in that sleep history buffer was evenly spread across many different cursors), so that indicated to me that the problem was likely not related to a single cursor related issue (a bug or just too heavy usage of that cursor). Note that this view was not queried during the worst problem time, so there was a chance that some symptoms were not in there anymore (V$MUTEX_SLEEP_HISTORY is a circular buffer of few hundred last mutex sleeps).
  • So, we had CPU starvation and very long cursor: pin S waits popping up at the same time. cursor: pin S operation should happen really fast as it’s a very simple operation (few tens of instructions only) of marking the cursor’s mutex in-flux so its reference count could be bumped up for a shared mutex get.
  • Whenever you see CPU starvation (CPUs 100% busy and runqueues are long) and latch or mutex contention, then the CPU starvation should be resolved first, as the contention may just be a symptom of the CPU starvation. The problem is that if you get unlucky and a latch or mutex holder process is preempted and taken off CPU by the scheduler, the latch/mutex holder can’t release the latch before it gets back onto CPU to complete its operation! But OS doesn’t have a clue about this, as latches/mutexes are just Oracle’s memory structures in SGA. So the latch/mutex holder is off CPU and everyone else who gets onto CPU may want to take the same latch/mutex. They can’t get it and spin shortly in hope that the holder releases it in next few microseconds, which isn’t gonna happen in this case, as the latch/mutex holder is still off CPU!
  • And now comes a big difference between latches and mutexes in Oracle 10.2: When a latch getter can’t get the latch after spinning, it will go to sleep to release the CPU. Even if there are many latch getters in the CPU runqueue before the latch holder, they all spin quickly and end up sleeping again. But when a mutex getter doesn’t get the mutex after spinning, it will not go to sleep!!! It will yield() the CPU instead, which means that it will go to the end of runqueue and try to get back onto CPU as soon as possible. So, mutex getters in 10.2 are much less graceful, they can burn a lot of CPU when the mutex they want is held by someone else for long time.
  • But so what, if a mutex holder is preempted and taken off CPU by OS scheduler – it should get back onto CPU pretty fast, once it works its way through the CPU runqueue?
  • Well, yes IF all the processes in the system have the same priority.
  • This is where a second problem comes into play – Unix process priority decay. When a process eats a lot of CPU (and does little IO / voluntary sleeping) then the OS lowers that processes CPU scheduling priority so that other, less CPU hungry processes would still get their fair share of CPU (especially when coming back from an IO wait for example etc).
  • When a mutex holder has a lower priority than most other processes and is now taken off CPU, a thing called priority inversion happens. Even though other processes do have higher priority, they can not proceed, as the critical lock or resource they need, is already held by the other process with a lower priority who can’t complete its work as the “high priority” processes keep the CPUs busy.
  • In case of latches, the problem is not that bad as the latch getters go to sleep until they are posted when the latch is released by the holder process (I’ve written about it here). But the priority inversion takes a crazy turn in case of mutexes – as their getters don’t sleep (not even for a short time) by default, but yield the CPU and try to get back to it immediately and so on until they get the mutex. That can lead to huge CPU runqueue spikes, unresponsive systems and even hangs.
  • This is why starting from Oracle 11g the mutex getters do sleep instead of just yielding the CPU and also Oracle has backported the fix into Oracle 10.2.0.4, where a patch must be applied and where the _first_spare_parameter will specify the sleep duration in centiseconds.
  • So, knowing how mutexes worked in 10.2, all the symptoms led me to suspect this priority inversion problem, greatly amplified by how the mutex getters do never sleep by default. And we checked the effective priorities of all Oracle processes in the server, and we hit the jackpot – there was a number of processes with significantly lower priorities than all other processes had. And it takes only one process with low priority to cause all this trouble, just wait until it starts modifying a mutex and is preempted while doing this.
  • So, in order to fix both of the problems which amplified each other, we had to enable HPUX_SCHED_NOAGE Oracle parameter, to prevent priority decay of the processes and set the _first_spare_parameter to 10, which meant that default mutex sleep time will be 10 centiseconds (which is pretty long time in mutex/latching world, but better than crazily retrying without any sleeping at all). That way no process (the mutex holder) is pushed back and kept away from CPU for long periods of time.

This was not a trivial problem, as it happened in Oracle / OS touchpoint and happened not because a single reason, but as a product of multiple separate reasons, amplifying each other.

There are few interesting, non-technical points here:

  1. When troubleshooting, don’t let performance tools like AWR (or any other tool!) tell you what your problem is! Your business, your users should tell you what the problem is and the tools should only be used for symptom drilldown (This is what Cary Millsap has been constantly telling us). Note how I mentioned the problem and symptoms separately in the beginning of my post – and the problem was that some business transactions (systemwide) timed out because the database response time was 5-15 seconds!
  2. The detail and scope of your performance data must have at least as good detail and scope of your performance problem!
    In other words, if your problem is measured in few seconds, then your performance data should also be sampled at least every few seconds in order to be fully systematic.The classic issue in this case was that the 20 minute AWR reports still showed IO wait times as main DB time consumers, but that was averaged over 20 minutes. But our problem happened severely and shortly within few seconds in that 20 minutes, so the averaging and aggregation over long period of time did hide the extreme performance issue that happened in a very short time.

Next time when it seems to be impossible to diagnose a problem and if the troubleshooting effort ends up going in circles, then you should ask, “what’s the real problem and who and how is experiencing it” and see if your performance data’s detail and scope matches that problem!

Oh, this is a good point to mention that in addition to my Advanced Oracle Troubleshooting/SQL Tuning seminars I also actually perform advanced Oracle troubleshooting consulting too! I eat mutexes for breakfast ;-)

Update: Srivenu Kadiyala has experienced the same problem and has written about it here.

 

KGH: NO ACCESS – Buffer cache inside streams pool too!

Some time ago I wrote that since Oracle 10.2, some of the buffer cache can physically reside within shared pool granules.

I just noticed this in an 11.2 instance:

SQL> select * from v$sgastat where name like ‘KGH%’;
POOL         NAME                            BYTES
------------ -------------------------- ----------
streams pool KGH: NO ACCESS                4186144
So, it looks that also streams pool can surrender parts of its memory granules to buffer cache, if it’s unable to flush everything out from the granule for complete granule handover.
Let’s see whether that’s the case:

SQL> select last_oper_type, last_oper_mode from v$sga_dynamic_components where component = 'streams pool';
LAST_OPER_TYP LAST_OPER
------------- ---------
SHRINK        DEFERRED
Yep, the last streams pool shrink operation was left in DEFERRED status, which means the granule wasn’t handed over – streams pool kept the granule for itself, marked everything it could flush out as KGH: NO ACCESS in its heap header and handed these chunks over to buffer cache manager.
Lets check whether these chunks are actually used by buffer cache buffers:
(NB! Think twice before running this query in production as it may hold your shared pool latches for very long time):
SQL> select ksmchidx,ksmchdur,ksmchcom,ksmchptr,ksmchsiz,ksmchcls from x$ksmsst where ksmchcom = 'KGH: NO ACCESS';
  KSMCHIDX   KSMCHDUR KSMCHCOM         KSMCHPTR           KSMCHSIZ KSMCHCLS
---------- ---------- ---------------- ---------------- ---------- --------
         1          4 KGH: NO ACCESS   00000003A2401FE0    4186144 no acce
So, there’s only one chunk flushed & handed over to buffer cache in the streams pool heap. The KSMCHPTR column shows the starting address of that chunk in SGA and the KSMCHSIZ is the size of that chunk.
So, let’s see if there are any buffers within that address range. First I’ll calculate the end address of that chunk (start address + size -1 = end address)

SQL> @calc 0x00000003A2401FE0 + 4186143
                     DEC                  HEX
------------------------ --------------------
         15611199487.000            3A27FFFFF
And now lets query X$BH using that address range to see if / how many buffer cache buffers have been placed in there:

SQL> select count(*) from x$bh where rawtohex(ba) between '00000003A2401FE0' and '00000003A27FFFFF';

  COUNT(*)
----------
       483
We have just proven, that there are 483 buffer cache buffers (~3.8MB, 8kB buffers) which reside physically in streams pool heap!

Calculate SQL_ID and SQL_HASH_VALUE from SQL text

Some time ago I wrote an article about the 10g+ SQL_ID being just a hash value of the SQL statement text. It’s just like the “old” SQL_HASH_VALUE, only twice longer (8 last bytes instead of 4 last bytes of the MD5 hash value of SQL text).

Slavik Markovich has written a nice python script for calculating SQL_IDs and SQL hash values from SQL text using that approach.

Slavik’s article is available here:

http://www.slaviks-blog.com/2010/03/30/oracle-sql_id-and-hash-value/

Oracle Latch Contention Troubleshooting

I wrote a latch contention troubleshooting article for IOUG Select journal last year (it was published earlier this year). I have uploaded this to tech.E2SN too, I recommend you to read it if you want to become systematic about latch contention troubleshooting:

http://tech.e2sn.com/oracle/troubleshooting

I’m working on getting the commenting & feedback work at tech.E2SN site too, but for now you can comment here at this blog entry…

Oracle Troubleshooting: How to read Oracle ERRORSTACK output?!

I have written the first article to the troubleshooting section of my new website tech.E2SN.com:

It’s about a very valuable Oracle troubleshooting tool -> ERRORSTACK trace.

I cover 4 frequently asked questions there:

  1. Reading the current executing SQL statement text from errorstack trace
  2. Reading the current executing PL/SQL package and PL/SQL source code line number from errorstack trace
  3. Reading the current bind variable values from errostack trace
  4. Identifying how much private (UGA) memory a cursor is using

You can read it here:

By the way, if you like my new website, feel free to link to it !!! ;-)

Oracle Wait Event reference

Kyle Hailey has started putting together a much needed Oracle wait event reference.

You can access it here.

By the way, Oracle documentation also has a wait event reference section, it has more events, but it’s less detailed…

I have plans to go deep into some wait events and cover some less common ones in tech.E2SN too… in the future ;-)

Beyond Oracle Wait Interface – Part 2

This is the second part of the joint blog “project” with James Morle, called “The Wait Interface Is Useless (Sometimes)”.

We already did a joint presentation on this topic at UKOUG and more conferences will follow :) Read the first part by James here for intro.

So, where do we go when Oracle’s wait interface doesn’t help us? We will show multiple techniques over time, but here’s where I normally continue when wait interface is “useless”.

I use V$SESSTAT.

Oh, were you expecting something more “advanced” instead of boring old V$SESSTAT’s performance counters which has been available in Oracle for ages? ;-)

Well, there is a reason why the V$SESSTAT has been available in Oracle for ages, very likely even before Wait Interface was introduced (but I wouldn’t know for sure, I was still in elementary school back then or something :).

And the reason is that session level performance counters are VERY useful about finding out WHAT Oracle is doing. Every time you parse, a counter goes up by one in V$SESSTAT for the session. Every time you execute, a counter goes up. Every time you do a logical IO, a counter goes up. Every time you commit, a counter goes up. Every time an index block is split, a counter goes up. You get the point.

Oracle’s V$SESSTAT tells you WHAT Oracle is doing. Wait interface tells you how much TIME is spent waiting for something, but V$SESSTAT counters just tell you how many times some operation was done. You can not conclude how much time was spent by looking just at the number of times something has happened (as Cary will tell you) but nevertheless, the V$SESSTAT counters definitely give you a good clue into WHAT THE [FU|HE]CK is an Oracle session doing – especially when wait interface says you’re not waiting for anything and SQL trace doesn’t print a line.

How many different operations are counted for each session in Oracle? Let’s check (Oracle 11.2):


SQL> select count(*) from v$sesstat where sid = 14;

COUNT(*)
----------
611

Oracle 11.2 keeps track of the counts of 611 different operations, for each session!

That’s VERY valuable source of information for understanding what Oracle is doing and its relatively easy to use.

And that’s exactly the reason why I wrote my Snapper script – I wanted this information to be VERY EASY to use!

And here’s an example – session 14 is stuck, doesn’t respond, user complains. Let’s check the wait interface:

If I want to know what a session is doing, I sample V$SESSION_WAIT first, with my sw.sql script (or I could just query ASH which gives me the same data with history):

SQL> @sw 14

    SID STATE   EVENT                                          SEQ# SEC_IN_WAIT P1                 P2                 P3                 P1TRANSL
------- ------- ---------------------------------------- ---------- ----------- ------------------ ------------------ ------------------ ------------------------------------------
     14 WORKING On CPU / runqueue                              3486         130

1 row selected.

SQL>
SQL> @sw 14

    SID STATE   EVENT                                          SEQ# SEC_IN_WAIT P1                 P2                 P3                 P1TRANSL
------- ------- ---------------------------------------- ---------- ----------- ------------------ ------------------ ------------------ ------------------------------------------
     14 WORKING On CPU / runqueue                              3486         137

1 row selected.

SQL>
SQL>
SQL> @sw 14

    SID STATE   EVENT                                          SEQ# SEC_IN_WAIT P1                 P2                 P3                 P1TRANSL
------- ------- ---------------------------------------- ---------- ----------- ------------------ ------------------ ------------------ ------------------------------------------
     14 WORKING On CPU / runqueue                              3486         139

1 row selected.

SQL>

From here we see 2 things:

  1. 3 samples out of 3 the session was ON CPU (not waiting). 3 samples is not too good for precise statistical sampling but there’s something else I spot in the output
  2. SEC_IN_WAIT (read: seconds in current state) is over 130. This means that this session has been in the current state (ON CPU) for over 130 seconds in row, without waiting for anything in between (the moment the wait state changes, the SEC_IN_WAIT goes back to 0).

So, it’s pretty evident that this session is just burning CPU and wait interface is useless here (as we are not waiting for anything, as far as Oracle sees it of course). Alternatively I could just run top or prstat and check whether the process is 100% on CPU or not.

So, before going on to Snapper, lets look into what kind of SQL this session executes right now (we could sample this also multiple times, to check whether we are constantly running the same statement):

SQL> <strong>select * from table(select dbms_xplan.display_cursor(sql_id,sql_child_number) from v$session where sid = 14);</strong>

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  5vy5qjd3fsn5c, child number 0
-------------------------------------
SELECT     MIN(t1.created), MAX(t1.created) FROM     t1   , t2   , t3
WHERE     t1.object_id = t2.object_id AND t2.object_id = t3.object_id
AND t1.owner = :v AND t2.owner = :v AND t3.owner = :v

Plan hash value: 3271631391

----------------------------------------------------------------------------------------
| Id  | Operation                       | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |      |       |       |     4 (100)|          |
|   1 |  SORT AGGREGATE                 |      |     1 |    66 |            |          |
|*  2 |   HASH JOIN                     |      |     7 |   462 |     4  (25)| 00:00:01 |
|   3 |    NESTED LOOPS                 |      |       |       |            |          |
|   4 |     NESTED LOOPS                |      |     7 |   329 |     2   (0)| 00:00:01 |
|   5 |      TABLE ACCESS BY INDEX ROWID| T1   |     7 |   196 |     1   (0)| 00:00:01 |
|*  6 |       INDEX RANGE SCAN          | I1   |     7 |       |     1   (0)| 00:00:01 |
|*  7 |      INDEX RANGE SCAN           | I2   |    27 |       |     1   (0)| 00:00:01 |
|*  8 |     TABLE ACCESS BY INDEX ROWID | T2   |     1 |    19 |     1   (0)| 00:00:01 |
|   9 |    TABLE ACCESS BY INDEX ROWID  | T3   |    40 |   760 |     1   (0)| 00:00:01 |
|* 10 |     INDEX RANGE SCAN            | I3   |    40 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("T2"."OBJECT_ID"="T3"."OBJECT_ID")
6 - access("T1"."OWNER"=:V)
7 - access("T2"."OWNER"=:V)
8 - filter("T1"."OBJECT_ID"="T2"."OBJECT_ID")
10 - access("T3"."OWNER"=:V)

33 rows selected.

SQL>

Hmm… can anyone reliably tell why this SQL statement is slow and burns all the CPU time?

The answer is no. An execution PLAN is a plan of actions which would be executed when someone runs the execution plan. It doesn’t tell you anything about what’s exactly going on right now, it doesn’t tell you what exactly is using all the CPU time. Anything we would say at this point, would be a guess! Lets use V$SESSTAT instead, for gathering more hard evidence!

(The syntax of Snapper is documented inside the script or just search for snapper in my blog, plenty of examples :)

SQL> <strong>@snapper out 5 1 14
</strong>
-- Session Snapper v2.02 by Tanel Poder ( http://www.tanelpoder.com )

--
----------------------------------------------------------------------------------------------------------------------
  SID, USERNAME  , TYPE, STATISTIC                               ,         DELTA, HDELTA/SEC,    %TIME, GRAPH
----------------------------------------------------------------------------------------------------------------------
   14, SYS       , STAT, session logical reads                   ,         81351,     16.27k,
   14, SYS       , STAT, consistent gets                         ,         81351,     16.27k,
   14, SYS       , STAT, consistent gets from cache              ,         81351,     16.27k,
   14, SYS       , STAT, consistent gets from cache (fastpath)   ,         81352,     16.27k,
   14, SYS       , STAT, no work - consistent read gets          ,         81381,     16.28k,
   14, SYS       , STAT, table fetch by rowid                    ,       2904784,    580.96k,
   14, SYS       , STAT, index scans kdiixs1                     ,            93,       18.6,
   14, SYS       , STAT, buffer is pinned count                  ,       5736560,      1.15M,
   14, SYS       , STAT, buffer is not pinned count              ,         75248,     15.05k,
   14, SYS       , STAT, no buffer to keep pinned count          ,             1,         .2,
   14, SYS       , TIME, DB CPU                                  ,       6050000,      1.21s,   121.0%, |@@@@@@@@@@|
   14, SYS       , TIME, sql execute elapsed time                ,       6059922,      1.21s,   121.2%, |@@@@@@@@@@|
   14, SYS       , TIME, DB time                                 ,       6059922,      1.21s,   121.2%, |@@@@@@@@@@|
--  End of snap 1, end=2010-01-15 17:33:22, seconds=5

PL/SQL procedure successfully completed.

As our previous sw.sql output showed  – we are not waiting for anything. If we were, then snapper would show you WAIT lines from wait interface (V$SESSION_EVENT) as well. But we don’t see any waits.

So, now its the time to go through all the V$SESSTAT stats reported in Snapper! These are the lines with STAT in them (TIME lines are V$SESS_TIME_MODEL breakdown available since 10g, but they are not detailed enough for diagnosing complex problems).

One thing we see, there is a statistic session logical reads and from the last column of Snapper output we see that this session did over 16000 logical reads (buffer gets) per second during snapper run time (of 5 seconds). That’s already a good indication of why we burn so much CPU time – we are doing lots of logical IOs.

Also, we see that we did 18.6 index range scans per second (the statistic index scans kdiixs1 shows that). That doesn’t seem “much” does it. However, check the value for buffer is pinned count statistic! This counter is updated every time we go to a buffer to get some data from it AND it already happens to be open by my session! Oracle keeps buffers pinned  and relevant buffer handles cached during a database call in some cases (like nested loop joins and index scans) to avoid reopening the buffer again (getting a buffer again if its closed would mean another logical IO).

Nevertheless, we can see that we re-visited some buffers again and again and again – over a million times per second! When checking my index structures I see that every index has only a few hundred blocks, so having millions and millions of buffer visits (buffer is pinned count + session logical reads) means we are heavily re-visiting the same blocks again and again and again! This points directly to NESTED LOOPS (and also INDEX RANGE SCANS) in the execution plan. Nested loops, as the name says loops through inner rowsource and revisits some of its data as many times as the number of rows coming from the outer rowsource (with few special cases, as usual).

And there’s one more statistic in Snapper output which I do NOT see – execute count. This statistic shows how many times the session executed cursors (executed new ones or re-executed the same one). NB! Snapper only reports these V$SESSTAT statistics which changed during the snapshot – as execute count is not reported, it means that the execute count did not increase during the snapshot, thus the session still executed the same statement! (so for example we were not hard parsing and lots of different statements to “justify” this CPU usage, Snapper would have shown parse count (hard) going up if this session had done so).

So, using the above diagnosis, this is the place where I would take a serious look whether the NESTED LOOPS in the above execution plan are a right thing to do. I would check how many rows would actually match the bind variable values (more about that later) in the SQL statement predicates – NESTED LOOPS is not a good operation for joining large number of rows. More about SQL tuning very soon ;-)

To finish this blog entry, this is the sequence of troubleshooting which I usually use (if a user complains or “something” is slow):

  1. Identify the session(s) servicing the user/task with the problem
  2. Check wait interface for these sessions (sw.sql, ASH query) – this gives quick overview whether the session is 100% stuck waiting for something, 100% busy working and burning CPU or somewhere in between
  3. Run Snapper on the sessions to (very easily) see which V$SESSTAT counters have gone up (some counters such memory usage and open cursors current can go down too)
  4. If all this fails or doesn’t give enough evidence of the problem for whatever reason, then it’s time to take a screwdriver and open up Oracle from outside – using stack traces, truss, DTrace etc. We will be blogging about this ;-)

Note that Snapper is just an anonymous PL/SQL block and it doesn’t require any changes to the database!

Ok, back to James now!

Finding the reasons for excessive logical IOs

There’s another interesting thread going on in Oracle-L, about understanding logical IOs and drilling down into their reasons. Of course sometimes (or rather usually) the excessive logical IOs come from a bad execution plan (when a nested loop loops over lots of datablocks again and again or a wrong index is used for driving a query etc), but sometimes the excessive LIOs are caused by some internal issues, like space management etc.

A convenient tool I use for reporting logical IO reasons is (again) my Snapper! It has  an option “b” for reporting Buffer get reasons or as I use below – option “a” shows All information Snapper can show.

There are couple of gotchas though which make this approach imperfect:

  1. The X$ tables Snapper uses for LIO reason reporting contain instance-wide counters, not specific to a single testing session. Thus you either need to be the single user in your database when experimenting and even then the background activity may increment some counters while you are testing too. I have sometimes suspended all other processes (kill -STOP and kill -CONT to resume)  or used Flash Freeze (oradebug ffbegin and ffresumeinst) to hang the whole instance that there would be no other activity going on.
  2. These buffer get reason counters are not maintained properly in Oracle 11g, probably due an optimization effort and some changes for faster pinning of buffer cache buffers (there’s a parameter called _fastpin_enable which is set to 1 in 11g and it enables so called fastpath buffer gets. If you see v$sesstat statistics such “consistent gets from cache (fastpath) or “db block gets from cache (fastpath)” being inremented, then fastpath buffer gets/pins are used. Note that I do have a script which works also on 11g but I’ll write about that one some time in the future :)

Anyway, if you are testing in an environment exclusively used by you, on Oracle 10.2 or lower, then you can run snapper with the gather=a option to report a bunch instance-level statistics in addition to the standard session-level stats:

  1. BUFG – Buffer get reasons (both consistent and current mode gets)
  2. LATG – Latch gets (both willing to wait and immediate gets)
  3. ENQG – Enqueue gets

Here’s an example, prepare for long output:

[Read more...]

Detect chained and migrated rows in Oracle – Part 1

I received a question about migrated rows recently.

It was about how to detect migrated rows in a 200TB data warehouse, with huge tables – as the ANALYZE TABLE xyz LIST CHAINED ROWS INTO command can not be automatically parallelized at table level (as DBMS_STATS can be, but oh, DBMS_STATS doesn’t gather the migrated/chained row info). Therefore the analyze command would pretty much run forever before returning (and committing) the chained row info in the output table. Also as there are regular maintenance jobs running on these tables (I suspect partition maintentance for example), then it wouldn’t be nice to keep running ANALYZE on the whole table constantly.

So, is there any faster or better way for finding the amount of migrated rows?

Ihave two answers to this.

Answer 1:

As we are dealing with a huge 200+ TB data warehouse its tables/indexes are most likely partitioned. Thus you could use the ANALYZE TABLE xyz PARTITION (abc) LIST CHAINED ROWS command to analyze individual partitions, even in parallel (sqlplus sessions) if you like. This would allow you to focus only on the partitions of interest (the latest ones, with the heaviest activity perhaps).

SQL> create table CHAINED_ROWS (
2    owner_name         varchar2(30),
3    table_name         varchar2(30),
4    cluster_name       varchar2(30),
5    partition_name     varchar2(30),
6    subpartition_name  varchar2(30),
7    head_rowid         rowid,     -- actual chained row's head piece address in the segment
8    analyze_timestamp  date
9  );
Table created.
SQL>
SQL> analyze table tmp partition (sys_p501) list chained rows; -- the default table name used for output is "CHAINED_ROWS"
Table analyzed.
SQL> analyze table tmp partition (sys_p502) list chained rows;
Table analyzed.
SQL> select partition_name, count(*) from chained_rows group by partition_name;
PARTITION_NAME                   COUNT(*)
------------------------------ ----------
SYS_P502                              252
SYS_P501                             5602

SQL>

So, from above you see its possible to find out partition (or even sub-partition level row chaining).

However this above command lists you both CHAINED rows and MIGRATED rows (even though Oracle calls them all chained rows internally, as the chaining mechanism is the same for both cases).

Chained row is a row which is too large to fit into a block, so will always have to be split between multiple different blocks – with an exception of intra-block chaining which is used for rows with more than 255 columns. Migrated row on the other hand is a row which has been updated larger than it initially was – and if as a result it doesn’t fit into its original block, the row itself is moved to a new block, but the header (kind of a stub pointer) of the row remains in original location. This is needed so that any indexes on the table would still be able to find that row using original ROWIDs stored in them). If Oracle didn’t leave the row head piece in place then it would always go and update all indexes which have the ROWID of the migrating row in them.

Why should we care whether a row is a real chained row or just a migrated row?

It’s because if the row is chained, then any reorgs would not help you – if a row is too big to fit into a block, its too big to fit into a block no matter how many times you move around the table. (Note that if you have large tables full of rows longer than 8KB there’s likely something wrong with your design).

But migrated rows on the other hand are “chained” into another block due some update which made them not fit into existing block anymore. This happens when PCTFREE is set too low compared to real row growth factor and sometimes you may want to fix it by reorganizing the table/partition with ALTER TABLE/PARTITION MOVE or by backing the rows up, deleting them and reinserting them back to the table (that one makes sense when only a small amount of rows in a table are migrated).

If you are completely sure that you don’t have any rows longer than the free space in an empty block (thus all individual rows would fit into a block and would need to be split among multiple blocks) then you can conclude that all the rows reported were migrated due their growth.

Another option would be to query out all or a sample of these chained/migrated rows and actually measure how long they are if all columns are put together. This could be done using vsize() function (or also dump() and lengthb() in some cases). Of course the column and row header overhead would need to be accounted in as well.

So, this already gets pretty complex and there are more tiny details which we should take into account… thus I will introduce another way to look into the row migration/chaining thing:

Answer 2: (Alternatively called “should we care?”)

[Read more...]