<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Tanel Poder's blog: Core IT for Geeks and Pros &#187; Unix/Linux</title>
	<atom:link href="http://blog.tanelpoder.com/category/unixlinux/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.tanelpoder.com</link>
	<description>Oracle troubleshooting, internals and performance tuning</description>
	<lastBuildDate>Sat, 31 Jul 2010 05:44:10 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>cursor: pin S waits, sporadic CPU spikes and systematic troubleshooting</title>
		<link>http://blog.tanelpoder.com/2010/04/21/cursor-pin-s-waits-sporadic-cpu-spikes-and-systematic-troubleshooting/</link>
		<comments>http://blog.tanelpoder.com/2010/04/21/cursor-pin-s-waits-sporadic-cpu-spikes-and-systematic-troubleshooting/#comments</comments>
		<pubDate>Wed, 21 Apr 2010 21:55:09 +0000</pubDate>
		<dc:creator>Tanel Poder</dc:creator>
				<category><![CDATA[Internals]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Oracle 11g]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Troubleshooting]]></category>
		<category><![CDATA[Unix/Linux]]></category>

		<guid isPermaLink="false">http://blog.tanelpoder.com/?p=673</guid>
		<description><![CDATA[I recently consulted one big telecom and helped to solve their sporadic performance problem which had troubled them for some months. It was an interesting case as it happened in the Oracle / OS touchpoint and it was a product of multiple &#8220;root causes&#8221;, not just one, an early Oracle mutex design bug and a [...]]]></description>
			<content:encoded><![CDATA[<p>I recently consulted one big telecom and helped to solve their sporadic performance problem which had troubled them for some months. It was an interesting case as it happened in the Oracle / OS touchpoint and it was a product of multiple &#8220;root causes&#8221;, not just one, an early Oracle mutex design bug and a Unix scheduling issue &#8211; that&#8217;s why it had been hard to resolve earlier despite multiple SRs opened etc.</p>
<p><a href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL21hcnRpbm1leWVyLmJsb2dzcG90LmNvbS8=" target=\"_blank\">Martin Meyer</a>, their lead DBA, posted some info about the problem and technical details, so before going on, you should read his blog entry and read my comments below after this:</p>
<ul>
<li><a href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL21hcnRpbm1leWVyLmJsb2dzcG90LmNvbS8yMDEwLzA0L2xvbmctd2FpdC10aW1lcy1mb3ItY3Vyc29yLXBpbi1zLWFuZC5odG1s" target=\"_blank\">http://martinmeyer.blogspot.com/2010/04/long-wait-times-for-cursor-pin-s-and.html</a></li>
</ul>
<p><strong>Problem:</strong></p>
<p>So, the problem was, that occasionally the critical application transactions which should have taken very short time in the database (&lt;1s), took 10-15 seconds or even longer and timed out.</p>
<p><strong>Symptoms:</strong></p>
<ol>
<li>When the problem happened, the CPU usage also jumped up to 100% for the problem duration (from few tens of seconds up to few minutes).</li>
<li>In AWR snapshots (taken every 20 minutes), the &#8220;cursor: pin S&#8221; popped into top TOP5 waits (around 5-10% of total instance wait time) and sometimes also &#8220;cursor: pin S wait on X&#8221; which is a different thing, also &#8220;latch: library cache&#8221; and interestingly &#8220;log file sync&#8221;. These waits had then much higher average wait times per wait occurrence than normal (tens or hundreds of milliseconds per wait, on average).</li>
<li>The V$EVENT_HISTOGRAM view showed lots of cursor: pin S waits taking very long time (over a second, some even 30+ seconds) and this certainly isn&#8217;t normal (Martin has these numbers in his blog entry)</li>
</ol>
<p>AWR and OS CPU usage measurement tools are system-wide tools (as opposed to session-wide tools).</p>
<p><strong>Troubleshooting:</strong></p>
<p><em>I can&#8217;t give you exact numbers or AWR data here, but will explain the flow of troubleshooting and reasoning.</em></p>
<ul>
<li>As the symptoms involved CPU usage spikes, I first checked whether there were perhaps<em> logon storms</em> going on due a bad application server configuration, where the app server suddenly decides to fire up hundreds of more connections at the same time (that happens quite often, so it&#8217;s a usual suspect when troubleshooting such issues). A logon storm can consume lots of CPU as all these new processes need to be started up in OS, they attach to SGA (syscalls, memory pagetable set-up operations) and eventually they need to find &amp; allocate memory from shared pool and initialize session structures. This all takes CPU.However the <em>logons cumulative</em> statistic in AWR didn&#8217;t go up almost at all during the 20 minute snapshot, so that ruled out a logon storm. As the number of sessions in the end of AWR snapshot (compared to the beginning of it) did not go down, this ruled out a <em>logoff</em> storm too (which also consumes CPU as now the exiting processes need to release their resources etc).</li>
</ul>
<ul>
<li>It&#8217;s worth mentioning that <em>log file sync</em> waits also went up by over an order of magnitude (IIRC from 1-2ms to 20-60 ms per wait) during the CPU spikes. However as <em>log file parallel write</em> times didn&#8217;t go up so radically, this indicated that the log file sync wait time was wasted somewhere else too &#8211; which is very likely going to be CPU scheduling latency (waiting on the CPU runqueue) when CPUs are busy.</li>
</ul>
<ul>
<li>As one of the waits which popped up during the problem was cursor: pin S, then I chcecked V$MUTEX_SLEEP_HISTORY and it did not show any specific cursor as a significant contention point (all contention recorded in that sleep history buffer was evenly spread across many different cursors), so that indicated to me that the problem was likely not related to a single cursor related issue (a bug or just too heavy usage of that cursor). Note that this view was not queried during the worst problem time, so there was a chance that some symptoms were not in there anymore (V$MUTEX_SLEEP_HISTORY is a circular buffer of few hundred last mutex sleeps).</li>
</ul>
<ul>
<li>So, we had CPU starvation and very long cursor: pin S waits popping up at the same time. cursor: pin S operation should happen really fast as it&#8217;s a very simple operation (few tens of instructions only) of marking the cursor&#8217;s mutex <em>in-flux </em>so its reference count could be bumped up for a shared mutex get.</li>
</ul>
<ul>
<li>Whenever you see CPU starvation (CPUs 100% busy and runqueues are long) <em>and </em>latch or mutex contention, then the CPU starvation should be resolved first, as the contention may just be a symptom of the CPU starvation. The problem is that if you get unlucky and a latch or mutex holder process is preempted and taken off CPU by the scheduler, the latch/mutex holder can&#8217;t release the latch before it gets back onto CPU to complete its operation! But OS doesn&#8217;t have a clue about this, as latches/mutexes are just Oracle&#8217;s memory structures in SGA. So the latch/mutex holder is off CPU and everyone else who gets onto CPU may want to take the same latch/mutex. They can&#8217;t get it and spin shortly in hope that the holder releases it in next few microseconds, which isn&#8217;t gonna happen in this case, as the latch/mutex holder is still off CPU!</li>
</ul>
<ul>
<li>And now comes a big difference between latches and mutexes in Oracle 10.2: When a latch getter can&#8217;t get the latch after spinning, it will go to sleep to release the CPU. Even if there are many latch getters in the CPU runqueue before the latch holder, they all spin quickly and end up sleeping again. But when a mutex getter doesn&#8217;t get the mutex after spinning, it will not go to sleep!!! It will yield() the CPU instead, which means that it will go to the end of runqueue and try to get back onto CPU as soon as possible. So, mutex getters in 10.2 are much less graceful, they can burn a lot of CPU when the mutex they want is held by someone else for long time.</li>
<li>But so what, if a mutex holder is preempted and taken off CPU by OS scheduler &#8211; it should get back onto CPU pretty fast, once it works its way through the CPU runqueue?</li>
</ul>
<ul>
<li>Well, yes IF all the processes in the system have the same priority.</li>
</ul>
<ul>
<li>This is where a second problem comes into play &#8211; Unix process priority decay. When a process eats a lot of CPU (and does little IO / voluntary sleeping) then the OS lowers that processes CPU scheduling priority so that other, less CPU hungry processes would still get their fair share of CPU (especially when coming back from an IO wait for example etc).</li>
</ul>
<ul>
<li>When a mutex holder has a lower priority than most other processes and is now taken off CPU, a thing called <em>priority inversion</em> happens. Even though other processes do have higher priority, they can not proceed, as the critical lock or resource they need, is already held by the other process with a lower priority who can&#8217;t complete its work as the &#8220;high priority&#8221; processes keep the CPUs busy.</li>
</ul>
<ul>
<li>In case of latches, the problem is not that bad as the latch getters go to sleep until they are posted when the latch is released by the holder process (I&#8217;ve written about it <a href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL2Jsb2cudGFuZWxwb2Rlci5jb20vMjAwOS8wMS8yMC9yZWxpYWJsZS1sYXRjaC13YWl0cy1hbmQtYS1uZXctYmxvZy8=" target=\"_blank\">here</a>). But the priority inversion takes a crazy turn in case of mutexes &#8211; as their getters don&#8217;t sleep (not even for a short time) by default, but yield the CPU and try to get back to it immediately and so on until they get the mutex. That can lead to huge CPU runqueue spikes, unresponsive systems and even hangs.</li>
</ul>
<ul>
<li>This is why starting from Oracle 11g the mutex getters do sleep instead of just yielding the CPU and also Oracle has backported the fix into Oracle 10.2.0.4, where a patch must be applied and where the <em>_first_spare_parameter</em> will specify the sleep duration in centiseconds.</li>
</ul>
<ul>
<li>So, knowing how mutexes worked in 10.2, all the symptoms led me to suspect this priority inversion problem, greatly amplified by how the mutex getters do never sleep by default. And we checked the effective priorities of all Oracle processes in the server, and we hit the jackpot &#8211; there was a number of processes with significantly lower priorities than all other processes had. And it takes only one process with low priority to cause all this trouble, just wait until it starts modifying a mutex and is preempted while doing this.</li>
</ul>
<ul>
<li>So, in order to fix both of the problems which amplified each other, we had to enable HPUX_SCHED_NOAGE Oracle parameter, to prevent priority decay of the processes and set the _first_spare_parameter to 10, which meant that default mutex sleep time will be 10 centiseconds (which is pretty long time in mutex/latching world, but better than crazily retrying without any sleeping at all). That way no process (the mutex holder) is pushed back and kept away from CPU for long periods of time.</li>
</ul>
<p>This was not a trivial problem, as it happened in Oracle / OS touchpoint and happened not because a single reason, but as a product of multiple separate reasons, amplifying each other.</p>
<p>There are few interesting, non-technical points here:</p>
<ol>
<li>When troubleshooting, don&#8217;t let performance tools like AWR (or any other tool!) tell you what your <em>problem</em> is! Your business, your users should tell you what the problem is and the tools should only be used for symptom drilldown (This is what <a href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL2NhcnltaWxsc2FwLmJsb2dzcG90LmNvbS8=" target=\"_blank\">Cary Millsap</a> has been constantly telling us). Note how I mentioned the problem and symptoms separately in the beginning of my post &#8211; and the problem was that some business transactions (systemwide) timed out because the database response time was 5-15 seconds!</li>
<li>The detail and scope of your performance data must have <em>at least </em>as good detail and scope of your performance problem!<br />
In other words, if your problem is measured in few seconds, then your performance data should also be sampled at least every few seconds in order to be fully systematic.</p>
<p>The classic issue in this case was that the 20 minute AWR reports still showed IO wait times as main DB time consumers, but that was averaged over 20 minutes. But our <em>problem</em> happened severely and shortly within few seconds in that 20 minutes, so the averaging and aggregation over long period of time did hide the extreme performance issue that happened in a very short time.</li>
</ol>
<p>Next time when it seems to be impossible to diagnose a problem and if the troubleshooting effort ends up going in circles, then you should ask, &#8220;what&#8217;s the real problem and who and how is experiencing it&#8221; and see if your performance data&#8217;s detail and scope matches that problem!</p>
<p>Oh, this is a good point to mention that in addition to my Advanced Oracle Troubleshooting/SQL Tuning <a href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL3RlY2guZTJzbi5jb20vb3JhY2xlLXRyYWluaW5nLXNlbWluYXJz">seminars</a> I also actually perform advanced Oracle troubleshooting <a href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL2Jsb2cudGFuZWxwb2Rlci5jb20vY29udGFjdC8=">consulting</a> too! I eat mutexes for breakfast ;-)</p>
<div class="facebook_like_button"><iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fblog.tanelpoder.com%2F2010%2F04%2F21%2Fcursor-pin-s-waits-sporadic-cpu-spikes-and-systematic-troubleshooting%2F&amp;layout=standard&amp;show-faces=true&amp;width=450&amp;action=like&amp;font=arial&amp;colorscheme=light" scrolling="no" frameborder="0" allowTransparency="true" style="padding: 0px 0px; border:none; overflow:hidden; width:450px; height:70px;"></iframe></div> <img src="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?view=1&post_id=673" width="1" height="1" style="display: none;" /><p><a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save"><img src="http://blog.tanelpoder.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share/Bookmark"/></a> </p>]]></content:encoded>
			<wfw:commentRss>http://blog.tanelpoder.com/2010/04/21/cursor-pin-s-waits-sporadic-cpu-spikes-and-systematic-troubleshooting/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Sometimes things are easy (Part 1): How to fix wrapped execution plan text?</title>
		<link>http://blog.tanelpoder.com/2010/01/18/sometimes-things-are-easy-part-1-how-to-fix-wrapped-execution-plan-text/</link>
		<comments>http://blog.tanelpoder.com/2010/01/18/sometimes-things-are-easy-part-1-how-to-fix-wrapped-execution-plan-text/#comments</comments>
		<pubDate>Mon, 18 Jan 2010 16:26:33 +0000</pubDate>
		<dc:creator>Tanel Poder</dc:creator>
				<category><![CDATA[Administration]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Productivity]]></category>
		<category><![CDATA[Tools]]></category>
		<category><![CDATA[Unix/Linux]]></category>

		<guid isPermaLink="false">http://blog.tanelpoder.com/?p=568</guid>
		<description><![CDATA[What you see below is a common problem. Someone sends you (or posts to a forum) a wide execution plan, which is unreadable because of wrapped lines. For example, this one below: -------------------------------------------------------------------------------- ------------------- &#124; Id  &#124; Operation                   &#124; Name                    &#124; E-Rows &#124;  OMem &#124; 1Mem &#124; Used-Mem &#124; -------------------------------------------------------------------------------- ------------------- &#124;   0 &#124; SELECT [...]]]></description>
			<content:encoded><![CDATA[<p>What you see below is a common problem. Someone sends you (or posts to a forum) a wide execution plan, which is unreadable because of wrapped lines. For example, this one below:</p>
<pre>--------------------------------------------------------------------------------
-------------------

| Id  | Operation                   | Name                    | E-Rows |  OMem |
 1Mem | Used-Mem |

--------------------------------------------------------------------------------
-------------------

|   0 | SELECT STATEMENT            |                         |        |       |
 |          |

|   1 |  SORT AGGREGATE             |                         |      1 |       |
 |          |

|*  2 |   HASH JOIN                 |                         |     13 |  1102K|
 1102K|  355K (0)|

|*  3 |    HASH JOIN                |                         |     13 |   988K|
 988K|  367K (0)|

|*  4 |     HASH JOIN               |                         |     13 |   921K|
 921K|  621K (0)|

|*  5 |      HASH JOIN OUTER        |                         |     13 |   836K|
 836K| 1224K (0)|

|*  6 |       HASH JOIN             |                         |     13 |   821K|
 821K|  501K (0)|

|*  7 |        HASH JOIN            |                         |     13 |  1102K|
 1102K|  501K (0)|

|   8 |         MERGE JOIN CARTESIAN|                         |      1 |       |
 |          |

|*  9 |          TABLE ACCESS FULL  | PROFILE$                |      1 |       |
 |          |

|  10 |          BUFFER SORT        |                         |      1 | 73728 |
 73728 |          |

|* 11 |           TABLE ACCESS FULL | PROFILE$                |      1 |       |
 |          |

|* 12 |         TABLE ACCESS FULL   | USER$                   |     36 |       |
 |          |

|  13 |        TABLE ACCESS FULL    | PROFNAME$               |      1 |       |
 |          |

|* 14 |       TABLE ACCESS FULL     | RESOURCE_GROUP_MAPPING$ |      1 |       |
 |          |

|  15 |      TABLE ACCESS FULL      | TS$                     |      7 |       |
 |          |

|  16 |     TABLE ACCESS FULL       | TS$                     |      7 |       |
 |          |

|  17 |    TABLE ACCESS FULL        | USER_ASTATUS_MAP        |      9 |       |
 |          |

--------------------------------------------------------------------------------
-------------------
</pre>
<p>So now you either try to manually edit and fix the execution plan text so you could read it or ask the developer to send the execution plan again. Both approaches take time.</p>
<p>Well, sometimes things are easy &#8211; in this particular case I saved the above into a file called /tmp/x and ran the following command:</p>
<pre>$ cat /tmp/x | <strong>awk '{ printf "%s", $0 ; if (NR % 3 == 0) print } END { print }'</strong>
---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name                    | E-Rows |  OMem | 1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                         |        |       | |          |
|   1 |  SORT AGGREGATE             |                         |      1 |       | |          |
|*  2 |   HASH JOIN                 |                         |     13 |  1102K| 1102K|  355K (0)|
|*  3 |    HASH JOIN                |                         |     13 |   988K| 988K|  367K (0)|
|*  4 |     HASH JOIN               |                         |     13 |   921K| 921K|  621K (0)|
|*  5 |      HASH JOIN OUTER        |                         |     13 |   836K| 836K| 1224K (0)|
|*  6 |       HASH JOIN             |                         |     13 |   821K| 821K|  501K (0)|
|*  7 |        HASH JOIN            |                         |     13 |  1102K| 1102K|  501K (0)|
|   8 |         MERGE JOIN CARTESIAN|                         |      1 |       | |          |
|*  9 |          TABLE ACCESS FULL  | PROFILE$                |      1 |       | |          |
|  10 |          BUFFER SORT        |                         |      1 | 73728 | 73728 |          |
|* 11 |           TABLE ACCESS FULL | PROFILE$                |      1 |       | |          |
|* 12 |         TABLE ACCESS FULL   | USER$                   |     36 |       | |          |
|  13 |        TABLE ACCESS FULL    | PROFNAME$               |      1 |       | |          |
|* 14 |       TABLE ACCESS FULL     | RESOURCE_GROUP_MAPPING$ |      1 |       | |          |
|  15 |      TABLE ACCESS FULL      | TS$                     |      7 |       | |          |
|  16 |     TABLE ACCESS FULL       | TS$                     |      7 |       | |          |
|  17 |    TABLE ACCESS FULL        | USER_ASTATUS_MAP        |      9 |       | |          |
---------------------------------------------------------------------------------------------------</pre>
<p>All I did here was that I stripped out line feeds from all lines except every 3rd line (which is the real end of the original line).</p>
<p>Note that if your linesize is very wide (and trimspool/trimout settings are ON) then this script would need some adjustment&#8230;</p>
<p>I&#8217;m sure this trivial approach doesn&#8217;t work in all situations, but with this article I wanted to illustrate that sometimes things which seem hard can be made much easier with a little scripting knowledge. If you are thinking which technology you should learn next &#8211; then better check out a Perl, Python or some shell+AWK book :)</p>
<p>By the way, if you want real flexibility displaying your execution plans (from library cache), then check this out:</p>
<p><a href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL2Jsb2cudGFuZWxwb2Rlci5jb20vMjAwOS8wNS8yNi9zY3JpcHRzLWZvci1zaG93aW5nLWV4ZWN1dGlvbi1wbGFucy12aWEtcGxhaW4tc3FsLWFuZC1hbHNvLWluLW9yYWNsZS05aS8=">http://blog.tanelpoder.com/2009/05/26/scripts-for-showing-execution-plans-via-plain-sql-and-also-in-oracle-9i/</a></p>
<div class="facebook_like_button"><iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fblog.tanelpoder.com%2F2010%2F01%2F18%2Fsometimes-things-are-easy-part-1-how-to-fix-wrapped-execution-plan-text%2F&amp;layout=standard&amp;show-faces=true&amp;width=450&amp;action=like&amp;font=arial&amp;colorscheme=light" scrolling="no" frameborder="0" allowTransparency="true" style="padding: 0px 0px; border:none; overflow:hidden; width:450px; height:70px;"></iframe></div> <img src="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?view=1&post_id=568" width="1" height="1" style="display: none;" /><p><a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save"><img src="http://blog.tanelpoder.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share/Bookmark"/></a> </p>]]></content:encoded>
			<wfw:commentRss>http://blog.tanelpoder.com/2010/01/18/sometimes-things-are-easy-part-1-how-to-fix-wrapped-execution-plan-text/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Identifying shared memory segment users using lsof</title>
		<link>http://blog.tanelpoder.com/2009/01/22/identifying-shared-memory-segment-users-using-lsof/</link>
		<comments>http://blog.tanelpoder.com/2009/01/22/identifying-shared-memory-segment-users-using-lsof/#comments</comments>
		<pubDate>Thu, 22 Jan 2009 12:42:45 +0000</pubDate>
		<dc:creator>Tanel Poder</dc:creator>
				<category><![CDATA[Administration]]></category>
		<category><![CDATA[Networking]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Tools]]></category>
		<category><![CDATA[Troubleshooting]]></category>
		<category><![CDATA[Unix/Linux]]></category>

		<guid isPermaLink="false">http://blog.tanelpoder.com/?p=192</guid>
		<description><![CDATA[Lsof (list open files) is a really useful tool for troubleshooting open file decriptors which prevent a deleted file from being released or a shared memory segment from being removed. Here&#8217;s a little situation on Linux where an Oracle shared memory segment was not released as someone was still using it. $ ipcs -ma ------ [...]]]></description>
			<content:encoded><![CDATA[<p>Lsof (list open files) is a really useful tool for troubleshooting open file decriptors which prevent a deleted file from being released or a shared memory segment from being removed.</p>
<p>Here&#8217;s a little situation on Linux where an Oracle shared memory segment was not released as someone was still using it.</p>
<pre><code>$ ipcs -ma

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
<b>0x00000000 393216     oracle    640        289406976  1          dest
</b>0xbfb94e30 425985     oracle    640        289406976  18
0x3cf13430 557058     oracle    660        423624704  22

------ Semaphore Arrays --------
key        semid      owner      perms      nsems
0xe2260ff0 1409024    oracle    640        154
0x9df96b74 1671169    oracle    660        154

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages

</code></pre>
<p>The bold line should have disappeared after instance shutdown, but it didn&#8217;t. From &#8220;natcch&#8221; (number of attached processes) column I see there is still some process using the shared memory segment. Thus the segment was not released and even <b>ipcrm</b> command did not remove it (just like with normal files if someone has them open).</p>
<p>So, I needed to identify which process was still using the memory segment. If that had been a normal existing file, I&#8217;d could have used <b>/sbin/fuser</b> command to see which process still holds it open, but this only works for existing files with existing directory entries.</p>
<p>However for deleted files, sockets and shared memory segments, you can use lsof command (it&#8217;s normally installed by default on Linux, but for Unixes you need to separately download and install).</p>
<p>The SHM ID of that segment was 393216 as ipcs -ma showed, so I simply run lsof to show all open file descriptors and grep for that SHM ID:</p>
<pre><code>$ lsof | egrep "393216|COMMAND"
COMMAND     PID      USER   FD      TYPE     DEVICE       SIZE       NODE NAME
python    18811    oracle  DEL       REG        0,8                393216 /SYSVbfb94e30

</code></pre>
<p>See how the NODE column corresponds to SHM ID in ipcs output.</p>
<p>So I kill the PID 18811 which is still attached to the SHM segment:</p>
<pre><code>$ kill 18811

$ ipcs -ma

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0xbfb94e30 425985     oracle    640        289406976  18
0x3cf13430 557058     oracle    660        423624704  25

------ Semaphore Arrays --------
key        semid      owner      perms      nsems
0xe2260ff0 1409024    oracle    640        154
0x9df96b74 1671169    oracle    660        154

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages

</code></pre>
<p>Now the shared memory segment is gone and its memory released.</p>
<p>Note that the <b>lsof</b> command is very useful for many other tasks as well. For example it allows you to list open sockets by network protocol, IP, port etc. For example you can determine to which client some server process is talking to, from OS level:</p>
<pre><code>$ <b>lsof -i:1521</b>
COMMAND   PID   USER   FD   TYPE DEVICE SIZE NODE NAME
tnslsnr  6212 oracle   11u  IPv4  49486       TCP *:1521 (LISTEN)
tnslsnr  6212 oracle   13u  IPv4 276708       TCP linux03:1521->linux03:37277 (ESTABLISHED)
tnslsnr  6212 oracle   14u  IPv4 264894       TCP linux03:1521->linux03:41122 (ESTABLISHED)
oracle  22687 oracle   20u  IPv4 264893       TCP linux03:41122->linux03:1521 (ESTABLISHED)
oracle  25250 oracle   15u  IPv4 276707       TCP linux03:37277->linux03:1521 (ESTABLISHED)
oracle  25530 oracle   15u  IPv4 279910       TCP linux03:1521->192.168.247.1:nimsh (ESTABLISHED)

</code></pre>
<p>Unfortunately lsof is not installed by default in classic Unixes, but in some shops the sysadmins have chosen to install it. But even then, it may not work for regular users as lsof requires access to kernel memory structures through /dev/kmem or similar. If you can&#8217;t get access to lsof then there may be other tools available which can do some tricks lsof can do. For example on Solaris, there&#8217;s an useful command <b>pfiles</b> which can show open files of a process and since Solaris 9 ( I think ) it can also report the TCP connection endpoints of network sockets&#8230;</p>
<div class="facebook_like_button"><iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fblog.tanelpoder.com%2F2009%2F01%2F22%2Fidentifying-shared-memory-segment-users-using-lsof%2F&amp;layout=standard&amp;show-faces=true&amp;width=450&amp;action=like&amp;font=arial&amp;colorscheme=light" scrolling="no" frameborder="0" allowTransparency="true" style="padding: 0px 0px; border:none; overflow:hidden; width:450px; height:70px;"></iframe></div> <img src="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?view=1&post_id=192" width="1" height="1" style="display: none;" /><p><a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save"><img src="http://blog.tanelpoder.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share/Bookmark"/></a> </p>]]></content:encoded>
			<wfw:commentRss>http://blog.tanelpoder.com/2009/01/22/identifying-shared-memory-segment-users-using-lsof/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Reliable latch waits and a new blog</title>
		<link>http://blog.tanelpoder.com/2009/01/20/reliable-latch-waits-and-a-new-blog/</link>
		<comments>http://blog.tanelpoder.com/2009/01/20/reliable-latch-waits-and-a-new-blog/#comments</comments>
		<pubDate>Tue, 20 Jan 2009 09:12:16 +0000</pubDate>
		<dc:creator>Tanel Poder</dc:creator>
				<category><![CDATA[Internals]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Unix/Linux]]></category>

		<guid isPermaLink="false">http://blog.tanelpoder.com/2009/01/20/reliable-latch-waits-and-a-new-blog/</guid>
		<description><![CDATA[Here&#8217;s a link to Alex Fatkulin&#8217;s blog if you haven&#8217;t seen it already: http://afatkulin.blogspot.com/ He has some good Oracle internals information in there, I also like his research style. Alex just blogged about a finding (on Oracle 11g on Linux) that when Oracle process doesn&#8217;t get a latch after spinning, it goes to sleep using [...]]]></description>
			<content:encoded><![CDATA[<p>Here&#8217;s a link to Alex Fatkulin&#8217;s blog if you haven&#8217;t seen it already: <a href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL2FmYXRrdWxpbi5ibG9nc3BvdC5jb20v">http://afatkulin.blogspot.com/</a></p>
<p>He has some good Oracle internals information in there, I also like his research style.</p>
<p>Alex just blogged about <a href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL2FmYXRrdWxpbi5ibG9nc3BvdC5jb20vMjAwOS8wMS9sb25naG9sZC1sYXRjaC13YWl0cy1vbi1saW51eC5odG1s">a finding</a> (on Oracle 11g on Linux) that when Oracle process doesn&#8217;t get a latch after spinning, it goes to sleep using <strong>semop()</strong> system call, which never wakes up unless this semaphore is posted by another process. From past versions we remember that Oracle processes go to sleep for a short period of time, wake up, try to get the latch and sleep again for a longer period of time if unsuccessful (up to _max_exponential_sleep centiseconds). This kind of sleeping with timeout is done using sem<strong>timed</strong>op() syscall on Linux.</p>
<p>Also, for some latches the latch waiter posting was available. If a process failed to get a latch, it put a pointer to its state object into the waiter list for that latch and went to sleep for some centiseconds. If the latch holder released this latch, it scanned through a <em>waiter list</em> for that latch and posted the waiters, so that they would not have to sleep until the end of this <em>x</em> centisecond sleep. This used to be controllable using <em>_latch_wait_posting</em> parameter, but since 9i this parameter has been removed and most latches do have wait posting enabled by default.</p>
<p>With semaphores and posting, on most Unixes there have been problems in past with missed posts/wakeups, sometimes due bugs, sometimes just due the implementation of signals and semaphore operations in Unix kernel. So that&#8217;s why the past Oracle versions always have some kind of timeout for latch sleeps (and most enqueue sleeps and buffer busy waits as well, as a matter of fact). If a process manages to miss the wakeup call, it will wake up after some timeout anyway. Performance suffers, but at least the process won&#8217;t hang infinitely. And this was achieved using sem<strong>timed</strong>op() systemcall (on Linux).</p>
<p>So, how come Alex saw just semop() calls in <a href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL2FmYXRrdWxpbi5ibG9nc3BvdC5jb20vMjAwOS8wMS9sb25naG9sZC1sYXRjaC13YWl0cy1vbi1saW51eC5odG1s">his test</a>?</p>
<p>The answer is that apparently the minimum required Linux kernel used for Oracle 10.2+ does support reliable posting using semaphores, so Oracle is taking use of this.</p>
<p>There is a new parameter called <strong>_enable_reliable_latch_waits</strong> in 10.2+ and (at least) on Linux it is true. When it&#8217;s true, Oracle trusts that all wakeup calls (through semaphores) are received by the sleeping processes, thus there&#8217;s no need to periodically wake up to check whether the latch has become &#8220;available&#8221;.</p>
<p>Here&#8217;s a little test case:<br />
<span id="more-190"></span></p>
<p>I ran my <a href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL3d3dy50YW5lbHBvZGVyLmNvbS9maWxlcy9zY3JpcHRzL2xvdHNocGFyc2VzLnNxbA==">lotshparses.sql</a> script in three sqlplus sessions (on a 2-CPU machine) to create some shared pool latch contention (and sleeping for the latches).</p>
<p>Then I used strace to see what kind of semaphore operations one of the processes is using.</p>
<p>This is what I see when _enable_reliable_latch_waits = true ( the default in 10.2+, <em>reliable wakeups</em> )</p>
<p>$ <strong>strace -e trace=semop,semtimedop</strong> -p 15423<br />
Process 15423 attached &#8211; interrupt to quit<br />
<strong>semop</strong>(622592, 0xbf8afc70, 1) = 0<br />
semop(622592, 0xbf8b0f30, 1) = 0<br />
semop(622592, 0xbf8afc1c, 1) = 0<br />
semop(622592, 0xbf8b0f30, 1) = 0<br />
semop(622592, 0xbf8b0f30, 1) = 0<br />
semop(622592, 0xbf8b0f30, 1) = 0<br />
semop(622592, 0xbf8afc70, 1) = 0</p>
<p>This is what I see when I set _enable_reliable_latch_waits = false ( old fashioned behaviour, <em>non-reliable wakeups</em>, thus need to wake up every <em>300000000</em> nanoseconds ):</p>
<p>$ strace -e trace=semop,semtimedop -p 15423<br />
Process 15423 attached &#8211; interrupt to quit<br />
<strong>semtimedop</strong>(622592, 0xbf8af4b8, 1, {0, 300000000}) = 0<br />
semtimedop(622592, 0xbf8aef10, 1, {0, 300000000}) = 0<br />
semtimedop(622592, 0xbf8ac9c4, 1, {0, 300000000}) = 0<br />
semtimedop(622592, 0xbf8ac9c4, 1, {0, 300000000}) = 0<br />
semtimedop(622592, 0xbf8b23f8, 1, {0, 300000000}) = 0<br />
semtimedop(622592, 0xbf8b083c, 1, {0, 300000000}) = 0<br />
semtimedop(622592, 0xbf8b083c, 1, {0, 300000000}) = 0<br />
semtimedop(622592, 0xbf8b083c, 1, {0, 300000000}) = 0<br />
semtimedop(622592, 0xbf8b0f30, 1, {0, 300000000}) = 0<br />
semtimedop(622592, 0xbf8b0468, 1, {0, 300000000}) = 0</p>
<div class="facebook_like_button"><iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fblog.tanelpoder.com%2F2009%2F01%2F20%2Freliable-latch-waits-and-a-new-blog%2F&amp;layout=standard&amp;show-faces=true&amp;width=450&amp;action=like&amp;font=arial&amp;colorscheme=light" scrolling="no" frameborder="0" allowTransparency="true" style="padding: 0px 0px; border:none; overflow:hidden; width:450px; height:70px;"></iframe></div> <img src="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?view=1&post_id=190" width="1" height="1" style="display: none;" /><p><a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save"><img src="http://blog.tanelpoder.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share/Bookmark"/></a> </p>]]></content:encoded>
			<wfw:commentRss>http://blog.tanelpoder.com/2009/01/20/reliable-latch-waits-and-a-new-blog/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Oracle memory troubleshooting, Part 1: Heapdump Analyzer</title>
		<link>http://blog.tanelpoder.com/2009/01/02/oracle-memory-troubleshooting-part-1-heapdump-analyzer/</link>
		<comments>http://blog.tanelpoder.com/2009/01/02/oracle-memory-troubleshooting-part-1-heapdump-analyzer/#comments</comments>
		<pubDate>Fri, 02 Jan 2009 11:55:36 +0000</pubDate>
		<dc:creator>Tanel Poder</dc:creator>
				<category><![CDATA[Internals]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Tools]]></category>
		<category><![CDATA[Troubleshooting]]></category>
		<category><![CDATA[Unix/Linux]]></category>

		<guid isPermaLink="false">http://blog.tanelpoder.com/2009/01/02/oracle-memory-troubleshooting-part-1-heapdump-analyzer/</guid>
		<description><![CDATA[When troubleshooting Oracle process memory issues like ORA-4030&#8242;s or just excessive memory usage, you may want to get a detailed breakdown of PGA, UGA and Call heaps to see which component in there is the largest one. The same goes for shared pool memory issues and ORA-4031&#8242;s &#8211; sometimes you need to dump the shared [...]]]></description>
			<content:encoded><![CDATA[<p>When troubleshooting Oracle process memory issues like ORA-4030&#8242;s or just excessive memory usage, you may want to get a detailed breakdown of PGA, UGA and Call heaps to see which component in there is the largest one.</p>
<p>The same goes for shared pool memory issues and ORA-4031&#8242;s &#8211; sometimes you need to dump the shared pool heap metadata for understanding what kind of allocations take most of space in there.</p>
<p>The heap dumping can be done using a HEAPDUMP event, see <a href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL3d3dy5qdWxpYW5keWtlLmNvbS9EaWFnbm9zdGljcy9EdW1wcy9EdW1wcy5odG1s">http://www.juliandyke.com/Diagnostics/Dumps/Dumps.html</a> for syntax.</p>
<p><strong>NB! </strong>Note that when dumping SGA heaps (like shared, large, java and streams pools), your process holds shared pool latches for the entire dump duration so this should be used only as a last resort in busy production instances. Dumping a big shared pool could hang your instance for quite some time. Dumping private process heaps is safer as that way only the target process is affected.</p>
<p>The heapdump output file structure is actually very simple, all you need to look at is the HEAP DUMP header to see in which heap the following chunks of memory belong (as there may be multiple heaps dumped into a single tracefile).</p>
<pre><code>HEAP DUMP heap name="<strong>sga heap(1,1)</strong>"  desc=04EA22D0
 extent sz=0xfc4 alt=108 het=32767 rec=9 flg=-125 opc=0
 parent=00000000 owner=00000000 nex=00000000 xsz=0x400000
EXTENT 0 addr=20800000
  <strong>Chunk 20800038 sz=   374904    free      "               "</strong>
  Chunk 2085b8b0 sz=      540    recreate  "KGL handles    "  latch=00000000
  Chunk 2085bacc sz=      540    recreate  "KGL handles    "  latch=00000000
  Chunk 2085bce8 sz=     1036    freeable  "parameter table"
  Chunk 2085c0f4 sz=     1036    freeable  "parameter table"
  Chunk 2085c500 sz=     1036    freeable  "parameter table"
  Chunk 2085c90c sz=     1036    freeable  "parameter table"
  Chunk 2085cd18 sz=     1036    freeable  "parameter table"
  Chunk 2085d124 sz=      228    recreate  "KGL handles    "  latch=00000000
  Chunk 2085d208 sz=      228    recreate  "KGL handles    "  latch=00000000
  Chunk 2085d2ec sz=      228    recreate  "KGL handles    "  latch=00000000
  Chunk 2085d3d0 sz=      228    recreate  "KGL handles    "  latch=00000000
  Chunk 2085d4b4 sz=      228    recreate  "KGL handles    "  latch=00000000
  Chunk 2085d598 sz=      540    recreate  "KQR PO         "  latch=2734AA00
  Chunk 2085d7b4 sz=      540    recreate  "KQR PO         "  latch=2734AA00
  Chunk 2085d9d0 sz=      228    recreate  "KGL handles    "  latch=00000000
...
</code></pre>
<p>The first list of chunks after HEAP DUMP (the list above) is the list of all chunks in the heap. There are more lists such as freelists and LRU lists in a regular heap, but lets ignore those for now, I&#8217;ll write more about heaps in an upcoming post.</p>
<p>After identifying heap name from HEAP DUMP line, you can see all individual chunks from the &#8220;Chunk&#8221; lines. The second column after Chunk shows the start address of a chunk, <em>sz=</em> means chunk size, the next column shows the type of a chunk (free, freeable, recreate, perm, R-free, R-freeable).</p>
<p>The next column is important one for troublehsooting, it shows the reason why a chunk was allocated (such <em>KGL handles</em> for library cache handles, <em>KGR PO</em> for dictionary cache parent objects etc). Every chunk in a heap has a fixed 16 byte area in the chunk header which stores the allocation reason (comment) of a chunk. Whenever a client layer (calling a kghal* chunk allocation function) allocates heap memory, it needs to pass in a comment up to 16 bytes and it&#8217;s stored in the newly allocated chunk header.</p>
<p>This is a trivial technique for troubleshooting memory leaks and other memory allocation problems. When having memory issues you can just dump all the heap&#8217;s chunks sizes and aggregate these by allocation reason/comment. That would show you the biggest heap occupier and give further hints where to look next.</p>
<p>As there can be lots of chunks in large heaps, aggregating the data manually would be time consuming (and boring). Here&#8217;s a little shell script which can summarize Oracle heapdump output tracefile contents for you:</p>
<p><span id="more-173"></span></p>
<pre><code><a href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL3d3dy50YW5lbHBvZGVyLmNvbS9maWxlcy9zY3JpcHRzL2hlYXBkdW1wX2FuYWx5emVy">http://www.tanelpoder.com/files/scripts/heapdump_analyzer</a>

</code></pre>
<p>After taking a heapdump, you just run to get a heap summary, total allocation sizes grouped by parent heap, chunk comment and chunk size.</p>
<pre><code>heapdump_analyzer <em>tracefile.trc</em></code></pre>
<p>Here&#8217;s an example of a shared pool dump analysis (heapdump at level 2):</p>
<pre><code>SQL&gt; alter session set events 'immediate trace name heapdump level 2';

Session altered.

SQL&gt; exit
...

$ <strong>heapdump_analyzer</strong> lin10g_ora_7145.trc

  -- Heapdump Analyzer v1.00 by Tanel Poder ( http://www.tanelpoder.com )

  Total_size #Chunks  Chunk_size,        From_heap,       Chunk_type,  Alloc_reason
  ---------- ------- ------------ ----------------- ----------------- -----------------
    <strong>11943936       3    3981312 ,    sga heap(1,3),             free,
</strong>     3981244       1    3981244 ,    sga heap(1,0),             perm,  perm
     3980656       1    3980656 ,    sga heap(1,0),             perm,  perm
     3980116       1    3980116 ,    sga heap(1,0),             perm,  perm
     3978136       1    3978136 ,    sga heap(1,0),             perm,  perm
     3977156       1    3977156 ,    sga heap(1,1),         recreate,  KSFD SGA I/O b
     3800712       1    3800712 ,    sga heap(1,0),             perm,  perm
     3680560       1    3680560 ,    sga heap(1,0),             perm,  perm
     3518780       1    3518780 ,    sga heap(1,0),             perm,  perm
     3409016       1    3409016 ,    sga heap(1,0),             perm,  perm
     3394124       1    3394124 ,    sga heap(1,0),             perm,  perm
     2475420       1    2475420 ,    sga heap(1,1),             free,
     2319892       1    2319892 ,    sga heap(1,3),             free,
     2084864     509       4096 ,    sga heap(1,3),         freeable,  sql area
...

</code></pre>
<p>It shows that the biggest component in shared pool is 11943936 bytes, it consists of 3 <em>free</em> chunks, which reside in shared pool subpool 1 and sub-sub-pool 3 (see the <em>sga heap(1,3)</em> section).</p>
<p>Note that my script is very trivial as of now, it reports different sized chunks on different lines so you still may need to do some manual aggregation if there&#8217;s no obvious troublemaker seen in the top of the list.</p>
<p>Here&#8217;s an example of a summarized heapdump level 29 ( PGA + UGA + call heaps ):</p>
<pre><code>$ heapdump_analyzer lin10g_ora_7145_0002.trc

  -- Heapdump Analyzer v1.00 by Tanel Poder ( http://www.tanelpoder.com )

  Total_size #Chunks  Chunk_size,        From_heap,       Chunk_type,  Alloc_reason
  ---------- ------- ------------ ----------------- ----------------- -----------------
     7595216     116      65476 ,     top uga heap,         freeable,  session heap
     6779640     105      64568 ,     session heap,         freeable,  kxs-heap-w
     2035808       8     254476 ,         callheap,         freeable,  kllcqas:kllsltb
     1017984       4     254496 ,    top call heap,         freeable,  callheap
      987712       8     123464 ,     top uga heap,         freeable,  session heap
      987552       8     123444 ,     session heap,         freeable,  kxs-heap-w
      196260       3      65420 ,     session heap,         freeable,  kxs-heap-w
      159000       5      31800 ,     session heap,         freeable,  kxs-heap-w
      112320      52       2160 ,         callheap,             free,
       93240     105        888 ,     session heap,             free,
       82200       5      16440 ,     session heap,         freeable,  kxs-heap-w
       65476       1      65476 ,     top uga heap,         recreate,  session heap
       65244       1      65244 ,    top call heap,             free,
       56680      26       2180 ,    top call heap,         freeable,  callheap
       55936       1      55936 ,     session heap,         freeable,  kxs-heap-w
...

</code></pre>
<p>You can also use <em>-t</em> option to show total heap sizes in the output (this total is not computed by my script, I just take the &#8220;Total&#8221; lines from the heapdump tracefile):</p>
<pre><code>$ <strong>heapdump_analyzer -t</strong> lin10g_ora_7145_0002.trc | grep Total
  Total_size #Chunks  Chunk_size,        From_heap,       Chunk_type,  Alloc_reason
     8714788       1    8714788 ,     top uga heap,            TOTAL,  Total heap size
     8653464       1    8653464 ,     session heap,            TOTAL,  Total heap size
     2169328       2    1084664 ,         callheap,            TOTAL,  Total heap size
     1179576       1    1179576 ,    top call heap,            TOTAL,  Total heap size
      191892       1     191892 ,         pga heap,            TOTAL,  Total heap size

</code></pre>
<p><strong>References:<br />
</strong></p>
<ul>
<li>Metalink note 396940.1 &#8211; <a href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cHM6Ly9tZXRhbGluazIub3JhY2xlLmNvbS9tZXRhbGluay9wbHNxbC9tbDJfZG9jdW1lbnRzLnNob3dEb2N1bWVudD9wX2RhdGFiYXNlX2lkPU5PVCZhbXA7cF9pZD0zOTY5NDAuMQ==">Troubleshooting and Diagnosing ORA-4031 Error</a></li>
<li>Heapdump syntax &#8211; <a href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL3d3dy5qdWxpYW5keWtlLmNvbS9EaWFnbm9zdGljcy9EdW1wcy9EdW1wcy5odG1s">http://www.juliandyke.com/Diagnostics/Dumps/Dumps.html</a></li>
<li>Heapdump analyzer &#8211; <a href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL3d3dy50YW5lbHBvZGVyLmNvbS9maWxlcy9zY3JpcHRzL2hlYXBkdW1wX2FuYWx5emVy">http://www.tanelpoder.com/files/scripts/heapdump_analyzer</a></li>
</ul>
<div class="facebook_like_button"><iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fblog.tanelpoder.com%2F2009%2F01%2F02%2Foracle-memory-troubleshooting-part-1-heapdump-analyzer%2F&amp;layout=standard&amp;show-faces=true&amp;width=450&amp;action=like&amp;font=arial&amp;colorscheme=light" scrolling="no" frameborder="0" allowTransparency="true" style="padding: 0px 0px; border:none; overflow:hidden; width:450px; height:70px;"></iframe></div> <img src="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?view=1&post_id=173" width="1" height="1" style="display: none;" /><p><a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save"><img src="http://blog.tanelpoder.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share/Bookmark"/></a> </p>]]></content:encoded>
			<wfw:commentRss>http://blog.tanelpoder.com/2009/01/02/oracle-memory-troubleshooting-part-1-heapdump-analyzer/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Advanced Oracle Troubleshooting Guide, Part 6: Understanding Oracle execution plans with os_explain</title>
		<link>http://blog.tanelpoder.com/2008/06/15/advanced-oracle-troubleshooting-guide-part-6-understanding-oracle-execution-plans-with-os_explain/</link>
		<comments>http://blog.tanelpoder.com/2008/06/15/advanced-oracle-troubleshooting-guide-part-6-understanding-oracle-execution-plans-with-os_explain/#comments</comments>
		<pubDate>Sun, 15 Jun 2008 14:40:12 +0000</pubDate>
		<dc:creator>Tanel Poder</dc:creator>
				<category><![CDATA[Cool stuff]]></category>
		<category><![CDATA[Internals]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[Tools]]></category>
		<category><![CDATA[Troubleshooting]]></category>
		<category><![CDATA[Unix/Linux]]></category>

		<guid isPermaLink="false">http://tanelpoder.wordpress.com/2008/06/15/advanced-oracle-troubleshooting-guide-part-6-understanding-oracle-execution-plans-with-os_explain/</guid>
		<description><![CDATA[Get ready for some more adventures in Oracle process stack! Before proceeding though, please read this post about safety of different stack sampling approaches. I have had few non-trivial Oracle troubleshooting cases, related to query hangs and bad performance, where I&#8217;ve wanted to know where exactly in execution plan the current execution is. Remember, Oracle [...]]]></description>
			<content:encoded><![CDATA[<p>Get ready for some more adventures in Oracle process stack!</p>
<p>Before proceeding though, please read <a href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL2Jsb2cudGFuZWxwb2Rlci5jb20vMjAwOC8wNi8xNC9kZWJ1Z2dlci1kYW5nZXJzLw==" target=\"_blank\">this post</a> about safety of different stack sampling approaches.</p>
<p>I have had few non-trivial Oracle troubleshooting cases, related to query hangs and bad performance, where I&#8217;ve wanted to know where exactly in execution plan the current execution is.<br />
Remember, Oracle is just another program executing instructions clustered in functions on your server, so stack sampling can help out here as well.</p>
<p>So, I was looking into the following stack trace taken from an Oracle 10.1 database on Solaris SPARC, running a SQL with <a href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL3d3dy50YW5lbHBvZGVyLmNvbS9maWxlcy9zYW1wbGVzL29zX2V4cGxhaW5fcGxhbi50eHQ=" target=\"_blank\">this execution plan</a>.</p>
<p><span id="more-81"></span></p>
<pre><code>$ cat pstack.txt
 ------------------------------------------------------------------------
 0000000101eca6e4 kdirfrs (ffffffff78eca5d8, 1, 0, 2, 86521ee00, 0) + 24
 0000000102502adc qerixFetchFastFullScan (ffffffff78eca3e0, 10248ebc0, 0, 1, 86521ebf8, fffd) + 33c
 0000000102558720 qergiFetch (ffffffff78ecaca8, 10248ebc0, ffffffff7fff5ee8, 10449dba0, 1, 86bb6bb38) + 80
 000000010248edd8 qerjoFetch (45, 10248ebc0, ffffffff7fff5ee8, ffffffff78ecad60, 86bb6bc08, 114) + 118
 000000010248eeb8 qerjoFetch (7ffe, 10248ebc0, ffffffff7fff5fe8, ffffffff78ecb5e8, 7f399b6f8, 7ffe) + 1f8
 00000001025171b8 rwsfcd (8177747a8, 10248e1e0, ffffffff7fff6168, 7fff, f0, 10449dba0) + 78
 000000010248e4dc qeruaFetch (745c0ab40, 10248e1e0, ffffffff7fff6168, ffffffff78ecb620, 10449dba8, 8177747a8) + 11c
 000000010248d840 qervwFetch (1d, 1d, ffffffff7fff6238, 7fff, 7cfc87c50, 10449d000) + a0
 00000001025171b8 rwsfcd (7876e8d18, 10248e1e0, ffffffff7fff63b8, 7fff, c0, 10449dba0) + 78
 000000010248e4dc qeruaFetch (817d1e438, 10248e1e0, ffffffff7fff63b8, ffffffff78e7d318, 10449dba8, 7876e8d18) + 11c
 000000010248d840 qervwFetch (1d, 1d, ffffffff7fff6488, 7fff, 82355a348, 10449d000) + a0
 00000001025171b8 rwsfcd (7454b0408, 10249e4e0, ffffffff7fff6608, 7fff, c0, 10449dba0) + 78
 00000001024a4620 qerhjFetch (86b1d64d8, 0, 0, 1, ffffffff78e7df08, 0) + 300
 000000010248f99c qerjotFetch (78d8ea1d8, 0, 0, 1, ffffffff78e7e8d0, 10449dba0) + dc
 000000010248eeb8 qerjoFetch (1, 10248ebc0, ffffffff7fff6838, ffffffff78e7f2b0, 7d53bb730, 1) + 1f8
 000000010248eeb8 qerjoFetch (1, 10248ebc0, ffffffff7fff6938, ffffffff7b9b2f78, 74b306ba0, 7fff) + 1f8
 000000010248d840 qervwFetch (5, 5, ffffffff7fff6a08, 7fff, 80b0faa18, 10449d000) + a0
 00000001025171b8 rwsfcd (865e988f8, 10249e4e0, ffffffff7fff6b88, 7fff, c0, 10449dba0) + 78
 00000001024a4620 qerhjFetch (7a57ae350, 10249e4e0, ffffffff7fff6d88, 1, ffffffff7bb89f00, 0) + 300
 00000001025171b8 rwsfcd (823642298, 10249e4e0, ffffffff7fff6d88, 7fff, 6f0, 10449dba0) + 78
 00000001024a4620 qerhjFetch (751158588, 102517980, 7511587f0, 1, ffffffff7b9b5090, 0) + 300
 000000010251d1f0 qersoFetch (94, 10506adf8, 10449d000, ffffffff7b9b53b8, 799854d60, 7511587f0) + 350
 0000000101aa6f24 opifch2 (7, f, 150, 1, 104400, 1050685e8) + a64
 0000000101aa6384 opifch (5, 2, ffffffff7fff79f8, 105000, 0, 10434c0a8) + 44
 0000000101ad81ec opipls (104000, 10434c, 1, ffffffff7bba3e6a, 0, 140010) + f4c
 00000001002d0058 opiodr (6, 10506ae10, 10434cfb0, 10506a, 105000, 104000) + 598
 00000001002d4ec0 rpidrus (ffffffff7fff8820, 105067f18, 105068860, ffffffff7bb5f560, 4a8c, 200000) + a0
 0000000102f615e4 skgmstack (ffffffff7fff8a48, ffffffff7f87cf8f, ffffffff7fff898f, 1002d4e20, ffffffff7fff8a70, acc01800) + a4
 00000001002d504c rpidru (ffffffff7fff9140, 10422b000, 10422a918, 104229598, 410, 82) + ac
 00000001002d4808 rpiswu2 (0, 104556000, ffffffff7fff8b88, 2, 104556418, ffffffff7fff92c0) + 1a8
 00000001002d61cc rpidrv (a, ffffffff7fff9044, 9, ffffffff7fff90c0, ffffffff7fff9140, 1002d5180) + 62c
 0000000102797f90 psddr0 (104400, 86e2fc628, ffffffff7fff92c0, 9, 1050769d8, 1050769d8) + 1f0
 0000000102798e04 psdnal (ffffffff7fffa0b8, ffffffff7fffa258, 105000, ffffffff7bc5eb00, 7f7aa86d0, 40) + 184
 000000010376d268 pevm_BFTCHC (0, 7f7aa86d0, 50, ffffffff7bb5f560, ffffffff7bc5eb00, 0) + 188
 000000010373dff4 pfrinstr_FTCHC (10, 15d0000000000000, ffffffff7bb5f5c8, ffffffff7bb5f560, 4a8c, ffffffff7bb66330) + b4
 00000001037362c8 pfrrun_no_tool (ffffffff7bb5f560, 779953684, ffffffff7bb5f5c8, 10457c9d8, 2001, 2001) + 48
 00000001037372d0 pfrrun (ffffffff7bb5f5c8, 200000, 0, 200000, ffffffff7bb5f560, 779abb110) + 2f0
 0000000103783374 plsql_run (ffffffff7bb55408, 1, 0, ffffdfff, ffffffff7fffa0b8, 0) + 274
 0000000103722554 peicnt (ffffffff7fffa0b8, 105068860, 6, ffffffff7fff9f28, 41d8, 1050685e8) + d4
 000000010327b784 kkxexe (105000, 104000, 105068, 104296000, 1050685e8, ffffffff7bb5f560) + 284
 0000000101ad0228 opiexe (4, ffffffff7bc68ee8, ffffffff7fffab00, 0, 0, ffffffff7bc70480) + 33c8
 0000000101a4c0a8 kpoal8 (40008, 1, ffffffff7fffd890, 0, 0, 4) + 648
 00000001002d0058 opiodr (14, 10506ae10, 10434ce70, 10506a, 105000, 104000) + 598
 0000000102cded94 ttcpip (105071450, 18, ffffffff7fffd890, ffffffff7fffcb88, 104229c98, ffffffff7fffcb84) + 694
 00000001002cd3e8 opitsk (1002cf000, 1, 0, ffffffff7fffd9e8, 105071450, 105071458) + 428
 0000000101aaf564 opiino (105070000, 1050683c0, 0, 0, f5, 105070290) + 404
 00000001002d0058 opiodr (4, 10506ae10, 10434c920, 10000, 105071, 105000) + 598
 00000001002cc174 opidrv (0, 4, 10506a, 105071450, 0, 3c) + 354
 00000001002c9828 sou2o (ffffffff7fffe6b8, 3c, 4, ffffffff7fffe698, 104aa6000, 104aa6) + 48
 00000001002a7b34 main (2, ffffffff7fffe798, ffffffff7fffe7b0, 0, 0, 100000000) + 94
 00000001002a7a7c _start (0, 0, 0, 0, 0, 0) + 17c

</code></pre>
<p>Not too encouraging, huh?</p>
<p>So, let&#8217;s run this stack trace through my <a href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL3d3dy50YW5lbHBvZGVyLmNvbS9maWxlcy9zY3JpcHRzL29zX2V4cGxhaW4=" target=\"_blank\">os_explain</a> script:</p>
<pre><code>$ ./os_explain pstack.txt
   SELECT FETCH:
    SORT: Fetch
     HASH JOIN: Fetch
    * HASH JOIN: Fetch
     * VIEW: Fetch
        NESTED LOOP OUTER: Fetch
         NESTED LOOP OUTER: Fetch
          NESTED LOOP JOIN: Fetch
           HASH JOIN: Fetch
          * VIEW: Fetch
             UNION-ALL: Fetch
            * VIEW: Fetch
               UNION-ALL: Fetch
              * NESTED LOOP OUTER: Fetch
                 NESTED LOOP OUTER: Fetch
                  GRANULE ITERATOR: Fetch
                   INDEX: FetchFastFullScan
                    kdirfrs

</code></pre>
<p>Now this is much more understandable!</p>
<p>All my script did was:</p>
<ul>
<li>remove the bottom part of the stack not relevant to plan execution</li>
<li>reverse the order of stack trace lines for better human readability</li>
<li>translating Query Execution Rowsource (qer) function prefixes to their corresponding names using info provided in Metalink note 175982.1</li>
</ul>
<p>Easy :)</p>
<p>So, how to read this?</p>
<p>First, by now it should be obvious that in Oracle, each rowsource operator (the different lines you see in SQL execution plans) is actually just a function inside Oracle kernel. These are the row source functions, starting with &#8220;qer&#8221;. QER stands for Query Execution Rowsource as far as I know.</p>
<p>Whenever an Oracle process fetches data from a cursor, it calls opifch2() which in turn calls the root rowsource function in execution plan. In my case that function was qersoFetch and my os_explain script just substituted the &#8220;qerso&#8221; part with SORT (as per the Metalink note I mentioned above). The first child function of qersoFetch was qerhjFetch, which is a hash join rowsource, and so on. Note that os_explain prefixes some lines with an asterisk (*), this indicates that the output of given function is in turn filtered by a filter operation (the same filter ops what you normally see in the bottom of DBMS_XPLAN explained plans).</p>
<p>So, logically you can imagine an execution plan as a tree of Oracle functions:</p>
<ul>
<li>For a SELECT query the OPI fetch function (opifch2) would be the root.</li>
<li>Various join and union functions like qerhjFetch (HASH JOIN) and qerjotFetch (NESTED LOOPS JOIN) would be branches.</li>
<li>The leaves would always be some sort of access path functions like qertbFetch (TABLE ACCESS FULL) or qerixFetch ( INDEX UNIQUE / FULL / RANGE SCAN ).</li>
</ul>
<p>But physically, an execution plan is just a memory structure in subheap 6 of a child cursor (x$kglcursor.kglobhd6), which has a bunch of rowsource function opcodes in it.<br />
During plan execution Oracle process &#8220;just&#8221; traverses through those opcodes, looks up the corresponding rowsource function starting address using a lookup table and calls it. That function does its task (probably calls other rowsource functions recursively) and returns to caller.</p>
<p>Note that many rowsource functions are designed to be <em>cascading</em>, being able to do only the work needed for returning a small subset of rows and return only few rows at a time, as opposed to the whole resultset.<br />
This is a very good thing as rows can be cascaded, or pipelined back to parent functions as rows become available. For example a table fetch only fetches a handful of rows (and not the whole table) at a time and returns these &#8220;up&#8221; for further processing. Also, a nested loop join is able to pass matching rows &#8220;up&#8221; from the moment first matches are found, again there&#8217;s no need to perform the join on full dataset first before returning first rows.<br />
This also means that there is no need to store the whole intermediate resultset somewhere in memory before passing it up to parent function; instead we just revisit that branch of execution plan tree whenever we need more rows from it. <strong>And the os_explain script shows you exactly in which execution branch the execution currently is</strong>.</p>
<p><em>Addition: I will elaborate on how to match the execution plan with stack trace in an upcoming post &#8211; it&#8217;s too much material for an introductory post.</em></p>
<p>So, cascading rowsources allow us to &#8220;incrementally&#8221; execute plans involving large datasets, without need to keep the intermediate resultsets in memory. On the other hand, few rowsource operators in your execution plan, like SORT, can not return any rows up before all its children&#8217;s rows are processed. With SORT (and aggregate operations which also use SORT) you just have to process all the source rows before returning any meaningful result back. You can&#8217;t just go through only half of the source data, order it and start returning rows in hope that the rest of the rows should have not been returned as first in the order. This is where the SQL cursor workareas come into play for such operations.</p>
<p>SORT, HASH and BITMAP rowsources can allocate SQL workareas for them, while others can&#8217;t. This can easily be identified from execution plan statistics of following sample query:</p>
<pre><code>SQL&gt; select /*+ gather_plan_statistics */ owner, count(*) from dba_source group by owner;

OWNER                            COUNT(*)
------------------------------ ----------
WKSYS                                8988
HR                                     34
<em>[...some output snipped...]</em>
SYS                                129299
WMSYS                                 704

25 rows selected.

SQL&gt; select * from table(dbms_xplan.display_cursor(null,null,'<strong>MEMSTATS LAST</strong>'));

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
SQL_ID  dcp37kxt02m9f, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ owner, count(*) from dba_source
group by owner

Plan hash value: 114136443

----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name            | Starts | E-Rows | A-Rows |   A-Time   |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------
<strong>|   1 |  HASH GROUP BY                |                 |      1 |    593K|     25 |00:00:18.24 |   821K|   821K| 5213K (0)|
</strong>|   2 |   VIEW                        | DBA_SOURCE      |      1 |    593K|    595K|00:00:16.84 |       |       |          |
|   3 |    UNION-ALL                  |                 |      1 |        |    595K|00:00:15.06 |       |       |          |
|*  4 |     FILTER                    |                 |      1 |        |    595K|00:00:10.88 |       |       |          |
<strong>|*  5 |      HASH JOIN                |                 |      1 |    595K|    595K|00:00:08.50 |   884K|   884K| 1316K (0)|
|*  6 |       HASH JOIN               |                 |      1 |   6527 |   6292 |00:00:00.12 |   870K|   870K| 1179K (0)|
</strong>|   7 |        TABLE ACCESS FULL      | USER$           |      1 |     98 |     98 |00:00:00.01 |       |       |          |
<strong>|*  8 |        HASH JOIN              |                 |      1 |   6527 |   6292 |00:00:00.09 |   909K|   909K| 1181K (0)|
</strong>|   9 |         INDEX FULL SCAN       | I_USER2         |      1 |     98 |     98 |00:00:00.01 |       |       |          |
|* 10 |         INDEX FAST FULL SCAN  | I_OBJ2          |      1 |   6527 |   6292 |00:00:00.04 |       |       |          |
|  11 |       INDEX FAST FULL SCAN    | I_SOURCE1       |      1 |    595K|    595K|00:00:01.56 |       |       |          |
|  12 |      NESTED LOOPS             |                 |      0 |      1 |      0 |00:00:00.01 |       |       |          |
|* 13 |       INDEX FULL SCAN         | I_USER2         |      0 |      1 |      0 |00:00:00.01 |       |       |          |
|* 14 |       INDEX RANGE SCAN        | I_OBJ4          |      0 |      1 |      0 |00:00:00.01 |       |       |          |
|  15 |     NESTED LOOPS              |                 |      1 |      1 |      0 |00:00:00.01 |       |       |          |
|  16 |      NESTED LOOPS             |                 |      1 |      1 |      0 |00:00:00.01 |       |       |          |
|  17 |       NESTED LOOPS            |                 |      1 |      1 |      0 |00:00:00.01 |       |       |          |
|* 18 |        INDEX FAST FULL SCAN   | I_OBJ2          |      1 |      6 |      0 |00:00:00.01 |       |       |          |
|* 19 |        FIXED TABLE FIXED INDEX| X$JOXFS (ind:1) |      0 |      1 |      0 |00:00:00.01 |       |       |          |
|* 20 |       INDEX RANGE SCAN        | I_USER2         |      0 |      1 |      0 |00:00:00.01 |       |       |          |
|  21 |      TABLE ACCESS CLUSTER     | USER$           |      0 |      1 |      0 |00:00:00.01 |       |       |          |
|* 22 |       INDEX UNIQUE SCAN       | I_USER#         |      0 |      1 |      0 |00:00:00.01 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------

</code></pre>
<p>You see how the HASH operations do have their last Mem columns populated, therefore those rowsource functions did allocate a SQL workarea for them. Others like NESTED LOOPS joins and data access rowsources did not have any SQL workarea memory allocated as they are completely cascading.</p>
<p>Note that os_explain can also read the stack from STDIN as seen in example below. Also the <strong>-a</strong> option will tell os_explain to show all functions in the stack and not only the execution plan ones and their children.</p>
<p>Command:</p>
<pre><code>SQL&gt; alter session set statistics_level=typical;

Session altered.

SQL&gt; select avg(length(text)) from dba_source where owner = 'SYS';

AVG(LENGTH(TEXT))
-----------------
       125.032127

</code></pre>
<p>Plan:</p>
<pre><code>---------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name            | E-Rows |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------
|   1 |  SORT AGGREGATE                     |                 |      1 |       |       |          |
|   2 |   VIEW                              | DBA_SOURCE      |  17972 |       |       |          |
|   3 |    UNION-ALL                        |                 |        |       |       |          |
|*  4 |     FILTER                          |                 |        |       |       |          |
|   5 |      NESTED LOOPS                   |                 |        |       |       |          |
|   6 |       NESTED LOOPS                  |                 |  18056 |       |       |          |
|*  7 |        HASH JOIN                    |                 |    198 |   909K|   909K| 1193K (0)|
|   8 |         INDEX FULL SCAN             | I_USER2         |     98 |       |       |          |
|   9 |         NESTED LOOPS                |                 |    198 |       |       |          |
|  10 |          TABLE ACCESS BY INDEX ROWID| USER$           |      1 |       |       |          |
|* 11 |           INDEX UNIQUE SCAN         | I_USER1         |      1 |       |       |          |
|* 12 |          INDEX RANGE SCAN           | I_OBJ5          |    198 |       |       |          |
|* 13 |        INDEX RANGE SCAN             | I_SOURCE1       |     93 |       |       |          |
|  14 |       TABLE ACCESS BY INDEX ROWID   | SOURCE$         |     91 |       |       |          |
|  15 |      NESTED LOOPS                   |                 |      1 |       |       |          |
|* 16 |       INDEX FULL SCAN               | I_USER2         |      1 |       |       |          |
|* 17 |       INDEX RANGE SCAN              | I_OBJ4          |      1 |       |       |          |
|  18 |     NESTED LOOPS                    |                 |      1 |       |       |          |
|  19 |      NESTED LOOPS                   |                 |      1 |       |       |          |
|  20 |       NESTED LOOPS                  |                 |      1 |       |       |          |
|  21 |        TABLE ACCESS BY INDEX ROWID  | USER$           |      1 |       |       |          |
|* 22 |         INDEX UNIQUE SCAN           | I_USER1         |      1 |       |       |          |
|* 23 |        INDEX RANGE SCAN             | I_OBJ5          |      1 |       |       |          |
|* 24 |       FIXED TABLE FIXED INDEX       | X$JOXFS (ind:1) |      1 |       |       |          |
|* 25 |      INDEX RANGE SCAN               | I_USER2         |      1 |       |       |          |
---------------------------------------------------------------------------------------------------

</code></pre>
<p>Stack:</p>
<pre><code>$ <strong>pstack 23740 | ./os_explain -a
</strong>   main
    ssthrdmain
     opimai_real
      sou2o
       opidrv
        opiodr
         opiino
          opitsk
           ttcpip
            opiodr
             kpoal8
              SELECT FETCH:
               GROUP BY SORT: Fetch
                VIEW: Fetch
                 UNION-ALL: Fetch
                * FILTER DEFINITION: FetchOutside
                   UNION-ALL: RowProcedure
                    VIEW: RowProcedure
                     qesaAggNonDistSS.
                      evaopn2
                       evalen
                        lxsCntChar

</code></pre>
<p>The FILTER rowsources (not talking about filter predicates on normal rowsources here) can make things more complicated though as they can introduce their own logic and loops into the execution plan (for running correlated subqueries etc).</p>
<p>By the way, see what happens when I run exactly the same query with rowsource level statistics collection enabled:</p>
<pre><code>SQL&gt; alter session set statistics_level=all;

Session altered.

SQL&gt; select avg(length(text)) from dba_source where owner = 'SYS';

AVG(LENGTH(TEXT))
-----------------
       125.032127

</code>
<code>$ pstack 23740 | ./os_explain -a
   main
    ssthrdmain
     opimai_real
      sou2o
       opidrv
        opiodr
         opiino
          opitsk
           ttcpip
            opiodr
             kpoal8
              SELECT FETCH:
<strong>               QUERY EXECUTION STATISTICS: Fetch
</strong>                GROUP BY SORT: Fetch
<strong>                 QUERY EXECUTION STATISTICS: Fetch
</strong>                  VIEW: Fetch
<strong>                   QUERY EXECUTION STATISTICS: Fetch
</strong>                    UNION-ALL: Fetch
                     QUERY EXECUTION STATISTICS: Fetch
                    * QUERY EXECUTION STATISTICS: Fetch
                       FILTER DEFINITION: FetchOutside
                        QUERY EXECUTION STATISTICS: Fetch
                         NESTED LOOP JOIN: Fetch
                          QUERY EXECUTION STATISTICS: Fetch
                           QUERY EXECUTION STATISTICS: SnapStats
                            sltrgftime64
<strong>                             gettimeofday
</strong>                              __kernel_vsyscall

</code></pre>
<p>Every rowsource is wrapped into a QUERY EXECUTION STATISTICS wrapper, which&#8217; task is just to count the number of rows sent &#8220;up&#8221; to parents in the tree, logical IOs and also rowsource timing info whenever an internal counter (set by _rowsource_statistics_sampfreq parameter) wraps.</p>
<p>This is just an intro to Oracle execution plan internals and troubleshooting. Hopefully you don&#8217;t need this technique too often, however it has helped me to successfully pinpoint the root cause of a problem in few non-trivial database problems.</p>
<p>Note that in Oracle 11g there&#8217;s an excellent new feature called Real-time SQL Monitoring. It allows you to monitor the <em>progress of currently running SQL statements</em>. For serially running statements the monitoring kicks in for a statement after it&#8217;s used total 5 seconds of CPU or IO time (this time is controlled by _sqlmon_threshold parameter, but for PX the monitoring is always enabled). After that you can query V$SQL_MONITOR and V$SQL_PLAN_MONITOR for seeing how much time/rows/LIOs that session has spent executing a statement. You can see these details even at the SQL execution plan line level. Alternatively you can use DBMS_SQLTUNE. REPORT_SQL_MONITOR function to get this info nicely formatted. Greg Rahn has written a good <a href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL3N0cnVjdHVyZWRkYXRhLm9yZy8yMDA4LzAxLzA2L29yYWNsZS0xMWctcmVhbC10aW1lLXNxbC1tb25pdG9yaW5nLXVzaW5nLWRibXNfc3FsdHVuZXJlcG9ydF9zcWxfbW9uaXRvci8=" target=\"_blank\">blog entry</a> about it.</p>
<p>However it&#8217;s important to note that both DBMS_SQLTUNE and V$SQL_MONITOR/V$SQL_PLAN_MONITOR use requires you to have Oracle Tuning Pack license which in turn requires Oracle Diagnostic Pack license. Details are in <a href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL2Rvd25sb2FkLm9yYWNsZS5jb20vZG9jcy9jZC9CMjgzNTlfMDEvbGljZW5zZS4xMTEvYjI4Mjg3L29wdGlvbnMuaHRt" target=\"_blank\">Oracle 11g Licensing Guide</a>.</p>
<p>So another, low level approach for real-time monitoring will still be handy even after 11g becomes mainstream. In a future post I will be showing how to measure the progress and get execution profile of your plan by aggregating multiple stack traces and also some cool opportunities with DTrace.</p>
<div class="facebook_like_button"><iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fblog.tanelpoder.com%2F2008%2F06%2F15%2Fadvanced-oracle-troubleshooting-guide-part-6-understanding-oracle-execution-plans-with-os_explain%2F&amp;layout=standard&amp;show-faces=true&amp;width=450&amp;action=like&amp;font=arial&amp;colorscheme=light" scrolling="no" frameborder="0" allowTransparency="true" style="padding: 0px 0px; border:none; overflow:hidden; width:450px; height:70px;"></iframe></div> <img src="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?view=1&post_id=81" width="1" height="1" style="display: none;" /><p><a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save"><img src="http://blog.tanelpoder.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share/Bookmark"/></a> </p>]]></content:encoded>
			<wfw:commentRss>http://blog.tanelpoder.com/2008/06/15/advanced-oracle-troubleshooting-guide-part-6-understanding-oracle-execution-plans-with-os_explain/feed/</wfw:commentRss>
		<slash:comments>17</slash:comments>
		</item>
		<item>
		<title>Oracle hidden costs revealed, part 1 &#8211; Does a batch job run faster when executed locally?</title>
		<link>http://blog.tanelpoder.com/2008/02/05/oracle-hidden-costs-revealed-part-1/</link>
		<comments>http://blog.tanelpoder.com/2008/02/05/oracle-hidden-costs-revealed-part-1/#comments</comments>
		<pubDate>Mon, 04 Feb 2008 16:09:06 +0000</pubDate>
		<dc:creator>Tanel Poder</dc:creator>
				<category><![CDATA[Internals]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Troubleshooting]]></category>
		<category><![CDATA[Unix/Linux]]></category>

		<guid isPermaLink="false">http://tanelpoder.wordpress.com/?p=59</guid>
		<description><![CDATA[This series is about revealing some Oracle&#8217;s internal execution costs and inefficiencies. I will analyze few situations and special cases where you can experience a performance hit where you normally wouldn&#8217;t expect to. The first topic is about a question I saw in a recent Oracle Forum thread. The question goes like this: &#8220;Is there [...]]]></description>
			<content:encoded><![CDATA[<p>This series is about revealing some Oracle&#8217;s internal execution costs and inefficiencies. I will analyze few situations and special cases where you can experience a performance hit where you normally wouldn&#8217;t expect to.</p>
<p>The first topic is about a question I saw in a recent <a target=\"_blank\" href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL2ZvcnVtcy5vcmFjbGUuY29tL2ZvcnVtcy90aHJlYWQuanNwYT90aHJlYWRJRD02MDU5NzAmYW1wO3RzdGFydD0w">Oracle Forum thread</a>.</p>
<p>The question goes like this: <i>&#8220;Is there any benefit if I run long sql queries from the server (by using telnet,etc) or from the remote by sql client.&#8221;</i></p>
<p>In order to leave out the network transfer cost of resultset for simplicity, I will rephrase the question like that: <b>&#8220;Do I get better performance when I execute my server-side batch jobs (which don&#8217;t return any data to client) locally from the database server versus a remote application server or workstation?&#8221;</b></p>
<p>The obvious answer would be <i>&#8220;NO, it does not matter where from you execute your batch job, as Oracle is a client server database system. All execution is done locally regardless of the client&#8217;s location, thus the performance is the same&#8221;</i>.</p>
<p>While this sounds plausible in theory, there is (at least) one practical issue which can affect Oracle server performance depending on the clients platform and client libaries version.</p>
<p>It is caused by regular <b>in-band break checking</b> in client server communication channel where <b>out of band break signalling</b> is not available. A test case is below:</p>
<p><span id="more-59"></span></p>
<p>I have a little script called <a target=\"_blank\" href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL3d3dy50YW5lbHBvZGVyLmNvbS9maWxlcy9zY3JpcHRzL2xvdHNsaW9zLnNxbA==">lotslios.sql</a> which just joins OBJ$ table to itself and forces a long nested loop join. This causes lots of LIOs as the script name says.</p>
<p>In my tests I&#8217;m connecting <i>to</i> Oracle 10.2.0.3 database on Solaris x64 (and the server is otherwise idle).</p>
<p>I will connect to the same database using Windows sqlplus and Linux sqlplus clients (both on the same LAN network) and run the lotslios.sql script with the same parameter:</p>
<p>Using Windows sqlplus client (11.1.0.6):</p>
<pre><code>SQL&gt; @lotslios 10000

  COUNT(*)
----------
     10000

Elapsed: 00:00:<b>29.28</b>
SQL&gt;

</code></pre>
<p>Using Linux sqlplus client (11.1.0.6):</p>
<pre><code>SQL&gt; @lotslios 10000

  COUNT(*)
----------
     10000

Elapsed: 00:00:<b>27.24</b>
SQL&gt;

</code></pre>
<p>I ran multiple iterations of these tests to get better statistical sample, but running the same statement from Windows sqlplus client always gave worse response time:</p>
<pre><code>( 29.28 / 27.24 ) - 1 = ~7.5%

</code></pre>
<p>So, running this &#8220;batch&#8221; from Windows sqlplus took 7.5% more time than from a Linux client.</p>
<p>So, how to understand where the extra time is going?</p>
<p>Here are the steps I usually go through such cases:</p>
<ol>
<li>Check for differences in session&#8217;s wait event profile (using Snapper) &#8211; this would have clearly shown on which event the extra time was spent. However it didn&#8217;t show any differences (in fact, all test runs spinned 100% on CPU as expected in a single-threaded LIO burner test)</li>
<li>Check for differences in session&#8217;s statistic counters (using Snapper) &#8211; a consistent and reproducible difference in V$SESSTAT performance counters would have given some hints on where to look next &#8211; however there was virtually no difference</li>
<li>Check for differences in Oracle server process&#8217;es execution profile (using for example truss -c, DTrace profiler, OProfile or process stack sampling in a loop with either pstack or a debugger) &#8211; and this gave me a clear indication of the area where to look next (and huge amount of areas on what to NOT waste any time further)</li>
</ol>
<p>Note that there are <i>only</i> three steps which usually are enough for <i>seeing</i> in which operation or kernel function the extra time is spent &#8211; and armed with the <i>knowledge</i> in which component the extra time is spent you can direct your investigation focus, narrow down your Metalink bug search, open a specific TAR or blame the Sysadmins ;-)</p>
<p>This is the beauty of a systematic approach and using <i>right tool for the right problem</i> (versus starting to peek around in random places, changing parameters, restarting instances or comparing all configuration files you can possibly find).</p>
<p>So, I used truss as the first tool for reporting system call execution profile for the server process.</p>
<p>For reference, this is what autotrace measured when running lotslios.sql (note the 4 million logical IOs):</p>
<pre><code>SQL&gt; @lotslios 10000

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
<b>    4089670  consistent gets
</b>          0  physical reads
          0  redo size
        335  bytes sent via SQL*Net to client
        350  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

SQL&gt;

</code></pre>
<p>This is what the lotslios.sql generated when executed from <b>Linux </b>sqlplus:</p>
<pre><code>solaris02$ <b>truss -cp 18284</b>
^C
syscall               seconds   calls  errors
read                     .000       2
write                    .000       2
times                    .000      27
yield                    .000     159
                     --------  ------   ----
sys totals:              .001     190      0
usr time:              27.191
elapsed:               31.080

</code></pre>
<p>This is what the lotslios.sql executed by <b>Windows</b> sqlplus generated:</p>
<pre><code>solaris02$ truss -cp 19200
^C
syscall               seconds   calls  errors
read                     .000       2
write                    .000       2
times                    .000      45
yield                    .000     196
<b>pollsys                 5.608 1355248
</b>                     --------  ------   ----
sys totals:             5.610 1355493      0
usr time:              33.505
elapsed:               80.360
solaris02$

</code></pre>
<p>So, there is a big difference in number of <b>pollsys()</b> system calls, depending on which client was used for connecting. The pollsys syscall is normally used for checking whether there is any data that can be read from a file descriptor (or whether the file descriptor is ready for receiving more writes). As TCP sockets on Unix are also accessed through file descriptors, Oracle could be polling the client TCP connection file descriptor&#8230; but (without prior knowledge) we can not be sure.</p>
<p>So, can we dig deeper? Yes we can! Using my favourite digging shovel &#8211; pstack.</p>
<p>What we want to do, is stopping Oracle server process exactly when the pollsys system call is invoked. Solaris truss can help here (read the description for the switches I use from truss manpages if you&#8217;re not familiar with it).</p>
<p>So I attach to the server process with truss again and run the lotslios.sql:</p>
<pre><code>solaris02$ truss -tpollsys -Tpollsys -p 19200
pollsys(0xFFFFFD7FFFDFB4F8, 1, 0xFFFFFD7FFFDFB4B0, 0x00000000) = 0
solaris02$

</code></pre>
<p>Thanks to the -T switch, truss suspends the process immediately when first pollsys syscall is made. And now, <i>finally</i>, we can run pstack to see which functions ended up calling pollsys():</p>
<pre><code>solaris02$ pstack 19200
19200:  oracleSOL10G (LOCAL=NO)
 fffffd7ffdd52caa pollsys  (fffffd7fffdfb4f8, 1, fffffd7fffdfb4b0, 0)
 fffffd7ffdcf9dc2 poll () + 52
 000000000508d37b sntpoltsts () + 12b
<b> 000000000507bc1b snttmoredata () + 2b
 0000000004f7bcdf nsmore2recv () + 25f</b>
 0000000004f9a777 nioqts () + c7
 0000000003706bb0 kdst_fetch () + 4e0
 000000000375d14b kdstf0100101km () + 22b
 00000000036f9dba kdsttgr () + 68a
<b> 00000000033c66bc qertbFetch () + 2ac
</b> 00000000033bebb3 qerjotFetch () + b3
 000000000340edfc qercoFetch () + dc
 0000000003461963 qergsFetch () + 723
 0000000002932e9d opifch2 () + a0d
 00000000028cf6cb kpoal8 () + e3b
 0000000000e97c6c opiodr () + 41c
 0000000003d9f6da ttcpip () + 46a
 0000000000e939d3 opitsk () + 503
 0000000000e96f18 opiino () + 3a8
 0000000000e97c6c opiodr () + 41c
 0000000000e924d1 opidrv () + 2f1
 0000000000e8f90b sou2o () + 5b
 0000000000e552e4 opimai_real () + 84
 0000000000e551b4 main () + 64
 0000000000e54ffc ???????? ()
solaris02$

</code></pre>
<p>The main function of interest is snttmoredata(). SNTT prefix means (in my interpretation) System Network Transport TCP<i>?</i> and moredata indicates that a check is made if there&#8217;s more data coming from the network connection/socket.</p>
<p>The snttmoredata function was called by nsmore2recv&#8230; which (in my interpretation again) means Network Service check for more data to receive.</p>
<p>So based on those two functions in stack it&#8217;s quite clear that the polls are caused by Oracle&#8217;s network layer. That&#8217;s is kind of unexpected as a client server database shouldn&#8217;t care much about the network while it&#8217;s still executing something for the client. This is the time to ask &#8211; what if this client wants to <b>cancel</b> the query?</p>
<p>The observations above are all about user&#8217;s ability to cancel a running query.</p>
<p>Oracle client server communication normally works in RPC fashion &#8211; for example a client sends a command to Oracle and Oracle doesn&#8217;t return anything until the command is completed.</p>
<p>Now if a user tries to cancel their query (using CTRL+C in sqlplus or calling OCIBreak in non-blocking OCI), a cancel packet is sent to server over TCP. The packet will be stored in the server side receive buffer of OS TCP stack and becomes available for reading for the server process (via a TCP socket). However if the server process is in a long-running loop executing a query, it needs to periodically check the TCP receive socket for any outstanding packets. And this is exactly what the <b>pollsys()</b> system call does.</p>
<p>This approach for cancelling an operation is called <b>in-band break</b>, as the break packet is sent in-band with all other traffic. The server process has to be programmed to periodically check for any newly arrived packets, even if it is already busy working on something else.</p>
<p>There are several functions in Oracle kernel where the developers have put the check for in-band breaks. This means that in some highly repetitive operations (like nested loop join) the same functions are hit again and again &#8211; causing frequent polling on the TCP socket. And too frequent polling is what causes the peformance degradation.</p>
<p>However Oracle network layer has a sqlnet.ora parameter called <b>break_poll_skip</b>, which can help in such situations. This parameters defines, how many times to just silently skip the TCP socket polling when the nsmore2recv() function is called. The parameter defaults to 3 in recent versions, which means that only 1 of 3 polls are actually executed ( from above test case it&#8217;s seen that for 4 million consistent gets roughly 1/3 = 1.3 million pollsys() calls were executed ).</p>
<p>Let&#8217;s change the break_poll_skip to 10:</p>
<pre><code>solaris02$ echo break_poll_skip=10 &gt; $ORACLE_HOME/network/admin/sqlnet.ora
solaris02$
solaris02$
solaris02$ truss -cp 3316
^C
syscall               seconds   calls  errors
read                     .000       2
write                    .000       2
times                    .000      33
yield                    .000     226
<b>pollsys                 1.726  406574
</b>                     --------  ------   ----
sys totals:             1.727  406837      0
usr time:              29.526
elapsed:               48.530

</code></pre>
<p>Only 1/10th of pollsys calls were executed at this time.</p>
<p>Let&#8217;s change the parameter that only 1/1000th of pollsys calls are executed:</p>
<pre><code>solaris02$ echo break_poll_skip=1000 &gt; $ORACLE_HOME/network/admin/sqlnet.ora
solaris02$
solaris02$ truss -cp 3429
^C
syscall               seconds   calls  errors
read                     .000       2
write                    .000       2
times                    .000      29
yield                    .000     178
<b>pollsys                  .029    4066
</b>                     --------  ------   ----
sys totals:              .030    4277      0
usr time:              27.572
elapsed:              107.850
solaris02$

</code></pre>
<p>So, setting break_poll_skip can help us reclaim some of the precious CPU time &#8211; however this is achieved at the expense of command cancellation responsiveness. If you set Oracle to skip 1000 break polls, then you need to wait longer until Oracle checks the TCP socket and realizes that a cancel packet has arrived. This may not be such a problem for a heavy nested loop where instead of 100 microseconds you get a response in 100 milliseconds. However if your session happens to wait for an enqueue then you might end up waiting for a loooong time for cancel to happen as Oracle calls the nsmore2recv() function once every timeout, which usually is 3 seconds. Multiply this with break_poll_skip value and you may end up waiting for almost an hour.</p>
<p>For now I recommend you NOT to use break_poll_skip &#8211; until you have read my 2nd article (which I&#8217;ll post in couple of weeks) about more details on in-band breaks and also <b>out-of-band breaks</b> which I haven&#8217;t covered so far.</p>
<p>Note that the polling is not a Solaris problem only, it happens in all Unixes as far as I know and likely on Windows as well.</p>
<p>An example from linux is below:</p>
<pre><code>LIN11G1$ strace -cp 28730
Process 28730 attached - interrupt to quit
Process 28730 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
<b>100.00    0.001383           0     54550           poll
</b>  0.00    0.000000           0         1           read
  0.00    0.000000           0         2           write
  0.00    0.000000           0         3           times
  0.00    0.000000           0        12           getrusage
  0.00    0.000000           0        19           gettimeofday
------ ----------- ----------- --------- --------- ----------------
100.00    0.001383                 54587           total
LIN11G1$

</code></pre>
<div class="facebook_like_button"><iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fblog.tanelpoder.com%2F2008%2F02%2F05%2Foracle-hidden-costs-revealed-part-1%2F&amp;layout=standard&amp;show-faces=true&amp;width=450&amp;action=like&amp;font=arial&amp;colorscheme=light" scrolling="no" frameborder="0" allowTransparency="true" style="padding: 0px 0px; border:none; overflow:hidden; width:450px; height:70px;"></iframe></div> <img src="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?view=1&post_id=59" width="1" height="1" style="display: none;" /><p><a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save"><img src="http://blog.tanelpoder.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share/Bookmark"/></a> </p>]]></content:encoded>
			<wfw:commentRss>http://blog.tanelpoder.com/2008/02/05/oracle-hidden-costs-revealed-part-1/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Systematic application troubleshooting in Unix</title>
		<link>http://blog.tanelpoder.com/2008/01/05/systematic-application-troubleshooting-in-unix/</link>
		<comments>http://blog.tanelpoder.com/2008/01/05/systematic-application-troubleshooting-in-unix/#comments</comments>
		<pubDate>Sat, 05 Jan 2008 11:55:59 +0000</pubDate>
		<dc:creator>Tanel Poder</dc:creator>
				<category><![CDATA[Troubleshooting]]></category>
		<category><![CDATA[Unix/Linux]]></category>

		<guid isPermaLink="false">http://blog.tanelpoder.com/2008/01/05/systematic-application-troubleshooting-in-unix/</guid>
		<description><![CDATA[How many times have you seen a following case, where a user or developer complains that their Oracle session is stuck or running very slowly and the person who starts investigating the issue does following: Checks the database for locks Checks free disk space Checks alert log Goes back to the client saying &#8220;we did [...]]]></description>
			<content:encoded><![CDATA[<p>How many times have you seen a following case, where a user or developer complains that their Oracle session is stuck or running very slowly and the person who starts investigating the issue does following:</p>
<ol>
<li>Checks the database for locks</li>
<li>Checks free disk space</li>
<li>Checks alert log</li>
<li>Goes back to the client saying &#8220;we did a healthcheck and everything looks ok&#8221; and closes the case or asks the user/developer to contact application support team or tune their SQL</li>
</ol>
<p>The point here is that what the heck do the database locks, alert log or disk space have to do with <i>first round session troubleshooting</i>, when Oracle provides just about everything you need in one simple view?</p>
<p>Yes, I am talking about sampling V$SESSION_WAIT here. Database locks, free space and potential errors in alert log <i>may</i> have something to do with your users problems, but not necessarily. As there are many more causes, like network issues etc which could affect your user (and the whole database), it doesn&#8217;t make sense to go through all those random &#8220;healthchecks&#8221; every time you receive a user phone call. Moreover, even if you identify that there is shortage of disk space or there are many database locks &#8211; so what? They may not have anything to do with the users problem.</p>
<p>The issue here is that <i>still</i> many people do not know about V$SESSION_WAIT which in most cases shows your problem immediately or at least points you to right direction (e.g. there&#8217;s no need to check for locks if your session is waiting on &#8220;log file switch (archiving needed)&#8221; wait &#8211; and vice versa). Even if &#8220;these people&#8221; have heard of V$SESSION_WAIT and may be able to drop this in during their job interview, they may not know how to use it in systematic troubleshooting context. Many hours of service downtime and user frustration would be saved if all DBAs knew this extremely simple concept of looking at V$SESSION_WAIT.</p>
<p>This blog entry is not about Oracle though, so I will leave this rant for a future blog post.</p>
<p>This post is about a similar problem in Unix world. Having been involved with resolving some serious production issues lately I have been surprised quite many times by the corporate Unix support people who seem to do behave in similar manner. For example, there is a user calling in saying that their scheduled Unix job, which normally takes 5 minutes, has been running for hours now. The &#8220;senior unix support analyst&#8221; will do following:</p>
<p><span id="more-53"></span></p>
<ol>
<li>Check for free swap space</li>
<li>Check for free disk space</li>
<li>Check for <i>number</i> of network connections</li>
<li>Maybe runs top, sar or vmstat to see what is the system-wide CPU utilization</li>
</ol>
<p>And reply goes as: &#8220;We did a healtcheck and everything looks OK from our side&#8221;.</p>
<p>By now I know that the phrase above <i>really</i> means that &#8220;We have no idea how to check what&#8217;s wrong with your job and we don&#8217;t really care&#8221;</p>
<p>Ok, in order to avoid my blog becoming a collection of essays about the essence of life rather than a technical information source, here comes the tech part. Its about an issue I hit today.</p>
<p>I was copying one directory with lots of files from one Solaris 10 box to another &#8211; using <b>scp</b>. The scp did output all file names copied so I saw the progress. After many files were copied, the copy process suddenly got stuck.</p>
<p>Now, I could have started checking for random things like disk space, swap space, CPU or memory utilization, which would have led me nowhere&#8230; Instead I chose the simple and systematic approach which allowed me to diagnose the issue with 2 commands only:</p>
<pre><code>$ ps -ef | grep scp
  oracle  1768   694   0 20:00:59 pts/3       0:00 grep scp
  oracle  <b>1602</b>  1601   0 19:13:09 ?           3:11 scp -r -p -f copy
$

</code></pre>
<p>Ok, my scp process is there (it hasn&#8217;t died or anything).</p>
<p>Let&#8217;s check what this <i>actual process</i> is doing (rather than checking some system-wide aggregations which don&#8217;t show anything about individual processes):</p>
<pre><code>$ truss -p <strong>1602</strong>
open64("<strong>copy/tmp/pipe</strong>", O_RDONLY) (sleeping...)

&lt;...no further lines returned...&gt;

^C$
$ </code> <code>
</code> </pre>
<p>The above command returned only one line &#8211; showing that my scp process was stuck trying to open a named pipe (named copy/tmp/pipe)&#8230; which blocked my scp from proceeding as there was noone writing into the other end of this pipe. Apparently there&#8217;s a problem with the scp I was using, that it didn&#8217;t know how to handle pipes.</p>
<p>As I did not want to kill and restart my scp process I resolved the issue in a simple way:</p>
<pre><code>$ echo blah &gt; copy/tmp/pipe
$
 </code><code> </code> <code>
</code> </pre>
<p>This command above allowed the open64() syscall to complete, my scp read the &#8220;blah&#8221; string &#8211; reached EOF and knew to close this &#8220;file&#8221; and proceed to next.</p>
<p>So, the point of this post is &#8211; you need to use <strong>right tool for the right problem</strong>. A single session or process problem can not be diagnosed using systemwide tools.</p>
<ol>
<li>Whenever you diagnose a single session hang or performance problem in <b>Oracle</b>, you should first look into V$SESSION_WAIT (sample it few times to see whether the SEQ# and P1,2,3 values change). If you see the SEQ# value changing fast, you can sample and calculate wait time deltas from V$SESSION_EVENT as <a target=\"_blank\" href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL2Jsb2cudGFuZWxwb2Rlci5jb20vMjAwNy8xMi8wNi9vcmFjbGUtc2Vzc2lvbi1zbmFwcGVyLXYxMDYtcmVsZWFzZWQ=">Snapper</a> does.</li>
<li>Whenever you diagnose a single process hang or performance problem in <b>Unix</b>, you should first use ps, prstat or top to see whether the process still exists and whether it&#8217;s mostly on CPU or not and use truss/strace or pstack as next steps for diagnosing what the process is doing exactly.</li>
</ol>
<p>  </p>
<p>There will be follow-up blog posts on usage details of those tools&#8230;</p>
<p> Update:</p>
<p>Here&#8217;s a link to the version of v$session_wait sampling script I use:</p>
<p><a rel=\"nofollow\" href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL3d3dy50YW5lbHBvZGVyLmNvbS9maWxlcy9zY3JpcHRzL3N3LnNxbA=="><font color="#a0522d">http://www.tanelpoder.com/files/scripts/sw.sql</font></a></p>
<p>See the comments of this blog entry for usage examples.</p>
<div class="facebook_like_button"><iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fblog.tanelpoder.com%2F2008%2F01%2F05%2Fsystematic-application-troubleshooting-in-unix%2F&amp;layout=standard&amp;show-faces=true&amp;width=450&amp;action=like&amp;font=arial&amp;colorscheme=light" scrolling="no" frameborder="0" allowTransparency="true" style="padding: 0px 0px; border:none; overflow:hidden; width:450px; height:70px;"></iframe></div> <img src="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?view=1&post_id=53" width="1" height="1" style="display: none;" /><p><a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save"><img src="http://blog.tanelpoder.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share/Bookmark"/></a> </p>]]></content:encoded>
			<wfw:commentRss>http://blog.tanelpoder.com/2008/01/05/systematic-application-troubleshooting-in-unix/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Advanced Oracle Troubleshooting Guide, Part 3: More adventures in process stack</title>
		<link>http://blog.tanelpoder.com/2007/09/06/advanced-oracle-troubleshooting-guide-part-3-more-adventures-in-process-stack/</link>
		<comments>http://blog.tanelpoder.com/2007/09/06/advanced-oracle-troubleshooting-guide-part-3-more-adventures-in-process-stack/#comments</comments>
		<pubDate>Wed, 05 Sep 2007 16:50:32 +0000</pubDate>
		<dc:creator>Tanel Poder</dc:creator>
				<category><![CDATA[Internals]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Tools]]></category>
		<category><![CDATA[Troubleshooting]]></category>
		<category><![CDATA[Unix/Linux]]></category>

		<guid isPermaLink="false">http://tanelpoder.wordpress.com/2007/09/06/advanced-oracle-troubleshooting-guide-part-3-more-adventures-in-process-stack/</guid>
		<description><![CDATA[&#8230;or rather thread stack as nowadays decent operating systems execute threads (or tasks as they&#8217;re called in Linux kernel). Anyway, stack trace gives you the ultimate truth on what your program is doing, exactly right now. There are couple of but&#8217;s like stack corruptions and missing symbol information which may make the traces less useful [...]]]></description>
			<content:encoded><![CDATA[<p>&#8230;or rather thread stack as nowadays decent operating systems execute threads (or tasks as they&#8217;re called in Linux kernel).</p>
<p>Anyway, stack trace gives you the ultimate truth on what your program is doing, exactly right now. There are couple of but&#8217;s like stack corruptions and missing symbol information which may make the traces less useful for us, but for detailed hang &amp; performance troubleshooting the stack traces are a goldmine.</p>
<p>So, I present another case study &#8211; how to diagnose a complete database hang when you can&#8217;t even log on to the database.</p>
<p><span id="more-40"></span></p>
<p>I was doing some performance diagnosis work and wanted to find out where some pointers in some stack traces were pointing in shared pool. The X$KSMSP is the place where you&#8217;d find out what type of chunk happens to own the given address and which one is the parent heap of that chunk.</p>
<p>However, at <a href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL3d3dy5ob3Rzb3MuY29tL3N5bTA4Lmh0bWw=" target=\"_blank\">Hotsos Symposium</a> a year or two ago I remember Jonathan Lewis mentioning to be very careful with selecting from X$KSMSP on busy systems with large shared pools.</p>
<p>The reason is that X$KSMSP will take the shared pool latches and start scanning through shared pool heaps, reporting all shared pool chunks in it.</p>
<ul>
<li>This means that shared pool latches are held for very long time (the larger the shared pool, the more chunks you&#8217;ve got in SP, the more it takes).</li>
<li>This in turn means that you will not be able to allocate or deallocate any chunk from these pools during there latches are held</li>
<li>This in turn means that you will not be able to log on to database (as it allocates space in shared pool for session parameter values), you will not be able to hard parse a statement (as it needs to allocate SP space for your cursor and execution plan), you will not be able to even do some soft parses (as those may also use shared pool latches in some cases).
</li>
</ul>
<p>The shared pool size was set to 5GB, due some memory leak issues in past.</p>
<p>I decided to not run just <b>select * from x$ksmsp</b> on the full X$ table, I ran a select * from x$ksmsp <b>where rownum &lt;=10</b> instead, just to see whether this approach of fetching few rows at a time would work better. Obviously I did this in a test environment first&#8230; and it got completely hung. Oh man.</p>
<p>So I started up another sqlplus session to see what&#8217;s going on &#8211; and it got hung too. By then it was pretty clear to me that it was this killer X$KSMSP query, which didn&#8217;t care about my &#8220;where rownum &lt;= 10&#8243; trick and probably kept sweeping through the whole shared pool &#8211; holding the shared pool latch all the time.</p>
<p>As I wasn&#8217;t able to log on to Oracle, the next thing was to check what my process was doing from OS level. The problem with that was that I didn&#8217;t have an idea what was my sessions process ID, I didn&#8217;t know when exactly I had logged on with my session. I didn&#8217;t have access to /dev/kmem to use <b>lsof</b> tool either which would have pointed me out which SPIDs correspond to TCP connections from my workstation&#8217;s IP.</p>
<p>Note that even though a test environment, this was a busy system, used heavily by various other processes and batch jobs. So I needed to be really sure which one is my process before killing it at OS level.</p>
<p>I started with ps and prstat:</p>
<pre><code>$ prstat 1
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 24331 xxx_user 8312M 8254M cpu515  10    0   1:22.44 4.1% oracle/1
 26136 oracle   8347M 8285M cpu528  10    0   0:10.25 4.1% oracle/1
 23949 xxx_user 8306M 8248M cpu520   0    0   1:23.21 4.1% oracle/1
 21853 oracle     20G   59M sleep   58    0   0:00.00 2.2% oracle/21
 21849 oracle     20G   59M cpu16   31    0   0:19.52 2.2% oracle/21
 21860 oracle     20G   58M cpu512  32    0   0:23.32 2.0% oracle/19
 24681 oracle     20G   35M cpu531  42    0   8:06.28 0.9% oracle/15
 26516 oracle     20G   13M sleep   58    0  11:34.29 0.0% oracle/1
 27522 oracle     20G   11M sleep   58    0   0:57.33 0.0% oracle/152
 13742 oracle   2064K 1800K cpu19   56    0   0:00.00 0.0% prstat/1
 27526 oracle     20G   11M sleep   58    0   0:58.45 0.0% oracle/149
 27528 oracle     20G   11M sleep   58    0   0:57.07 0.0% oracle/148
 27520 oracle     20G   11M sleep   58    0   0:57.28 0.0% oracle/157
 27524 oracle     20G   11M sleep   58    0   0:58.18 0.0% oracle/166

</code></pre>
<p>Even though the CPU usage and the process owner names gave me some hints which one could have been the troublemaker, I wanted to get proof before taking any action. And pstack was very helpful for that (note that due this hang I wasn&#8217;t even able to log on with another session!).</p>
<p>So I ran pstack on the main suspect process, 26136. I picked this one as it was the only remote Oracle process using 100% CPU (on a 24-CPU server). The other two top processes were spawned by a local user using bequeath, thus their username was different.</p>
<pre><code>$ <b>pstack 26136</b>
26136:  oracleDBNAME01 (LOCAL=NO)
 000000010301f6c0 kghsrch (105068700, 0, ffffffff7bc77a68, 118, ffffffff7bc77ad8, 0) + 20
 0000000103021f20 kghprmalo (7fffffff, 0, ffffffff7bc77a68, 0, ffffffff77130b90, 140) + 460
 0000000103024804 kghalp (ffffffff7fffa898, ffffffff7bc77a68, 58, 7ffffffc, fffffffffffffffc, 104b86ce8) + 884
 0000000100bb78a4 ksmspc (0, 38004d988, ffffffff7bc7ad80, 54e50a7d0, 458, 104e22180) + 44
 000000010303564c <b>kghscn</b> (104e22180, 38004d988, 1000000000000000, 7ffffff8, 105068700, 54e50a7d0) + 68c
 0000000103034750 kghnwscn (0, ffffffff7bc7ad80, 100bb7860, 1858, 2fa0, 105068700) + 270
 0000000100bb7a44 <b>ksmshp</b> (100bb7, 1, 380000030, ffffffff7bc77a68, 100bb7860, ffffffff7bc7ad80) + 64
<b> 00000001024e67a8 qerfxFetch (10449c000, 0, 10506ae10, ffffffff7bc7ad40, 5651c5518, 1043ae680) + 328
 00000001024e7edc qercoFetch (541d07900, 10449dba0, 4e, 1, ffffffff7bc7ad98, 1024e6480) + fc</b>
 0000000101aa6f24 <b>opifch2</b> (2, 5, 60, 1, 104400, 1050685e8) + a64
 0000000101a4c694 kpoal8 (0, 1, ffffffff7fffdc70, 0, 10434c0a8, 1) + c34
 00000001002d0058 opiodr (14, 10506ae10, 10434ce70, 10506a, 105000, 104000) + 598
 0000000102cded94 ttcpip (105071450, 18, ffffffff7fffdc70, ffffffff7fffcf68, 104229c98, ffffffff7fffcf64) + 694
 00000001002cd3e8 opitsk (1002cf000, 1, 0, ffffffff7fffddc8, 105071450, 105071458) + 428
 0000000101aaf564 opiino (105070000, 105000, 57a321e48, 105000, dc, 105070290) + 404
 00000001002d0058 opiodr (4, 10506ae10, 10434c920, 10000, 105071, 105000) + 598
 00000001002cc174 opidrv (0, 4, 10506a, 105071450, 0, 3c) + 354
 00000001002c9828 sou2o (ffffffff7fffea98, 3c, 4, ffffffff7fffea78, 104aa6000, 104aa6) + 48
 00000001002a7b34 main (2, ffffffff7fffeb78, ffffffff7fffeb90, 0, 0, 100000000) + 94
 00000001002a7a7c _start (0, 0, 0, 0, 0, 0) + 17c

</code></pre>
<p>So, reading the bold sections from bottom up (and using the Metalink note 175982.1 I&#8217;ve <a href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL2Jsb2cudGFuZWxwb2Rlci5jb20vMjAwNy8wOC8yNy9hZHZhbmNlZC1vcmFjbGUtdHJvdWJsZXNob290aW5nLWd1aWRlLXBhcnQtMi1uby1tYWdpYy1pcy1uZWVkZWQtc3lzdGVtYXRpYy1hcHByb2FjaC13aWxsLWRvLw==" target=\"_blank\">mentioned earlier</a>)</p>
<ol>
<li>opifch2: this is a FETCH call being executed</li>
<li>qercoFetch: is a COUNT row source in execution plan</li>
<li>qerfxFetch: is a FIXED TABLE row source (in other words, accessing an X$ table)</li>
<li>ksmshp: looks like this is the function which is called under the covers when X$KSMSP is accessed</li>
<li>kghscn: sounds a lot like Kernel Generic Heap SCaN :)</li>
</ol>
<p>So, this stack trace proves that the process was definitely executing code which was doing a fetch from an X$ table (the qerfxFetch) and this had eventually resulted in an operation which sounded like a heap scanner ( kghscn ). This is exactly what X$KSMSP does, it scans through the entire shared pool heap and returns a row for every memory chunk in there.</p>
<p>Why was there a qercoFetch function involved even though my query did not have a count(*) in it? The explanation is below:</p>
<pre><code>SQL&gt; select * from x$ksmsp where rownum &lt;=1;

ADDR           INDX    INST_ID   KSMCHIDX   KSMCHDUR KSMCHCOM
-------- ---------- ---------- ---------- ---------- ---------------
04AA0A00          0          1          1          1 free memory

SQL&gt; @x

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------

--------------------------------------------------------------------
| Id  | Operation            |  Name       | Rows  | Bytes | Cost  |
--------------------------------------------------------------------
|   0 | SELECT STATEMENT     |             |       |       |       |
<b>|*  1 |  COUNT STOPKEY       |             |       |       |       |
</b>|   2 |   FIXED TABLE FULL   | X$KSMSP     |       |       |       |
--------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

<b>   1 - filter(ROWNUM&lt;=1)

</b></code></pre>
<p>The <b>qercoFetch</b> rowsource function was the one which should have enforced the &#8220;where rownum &lt;=10&#8243; part of my query. It should have stopped calling the qer<b>fx</b>Fetch function after it had received its 10 rows, but for some reason it continued on. Or the reason could be that the qerfxFetch function is not sophisticated enough to return control back to calling function with only part of its resultset, leaving no chance to qercoFetch to decide whether to stop fetching half-way through. (Many row sources have such flexibility of returning data in cascading fashion up in stack. For example nested loops are cascading (qerjoFetch, qerjotFetch), hash joins are cascading starting from the moment the hash join build partition has built in memory (qerhjFetch). Sort merge joins are not cascading from high-level perspective, as both of the joined row sources need to be sorted before any rows can be returned. However when both row sources have been sorted (either in memory or temp tablespace), the matched rows can be returned in a cascading fashion again, meaning that the calling function can fetch only X rows, do its processing and fetch next X rows etc. The X is normally dependent on arraysize &#8211; the amount of rows you fetch from Oracle at a time. Anyway, I leave this interesting topic for a next post ;)</p>
<p>Continuing with my hang diagnosis, after I had identified the troublemaker X$KSMSP scanner process, using prstat and pstack, I remembered that I could also have checked the sqlplus.exe&#8217;s timestamp against the server process start time (accounting the time zone difference and clock drift of course). This can be helpful when you have all your processes either trying to be on CPU or sleeping &#8211; unless you like to pstack through all of them.</p>
<p>I had identified one such process with logon time close to my sqlplus.exe&#8217;s start time and I took pstack on it:</p>
<pre><code>$ pstack 4848
4848:   oracleDBNAME01 (LOCAL=NO)
 ffffffff7c9a5288 <b>semsys</b>   (2, 6d0a000a, ffffffff7fffa79c, 1, 10)
 0000000102f782c8 <b>sskgpwwait</b> (ffffffff7fffab48, 10501dba8, 0, 200, ffffffffffffffff, 2b) + 168
 0000000100af4b74 <b>kslges</b> (<b>31c</b>, 0, 57f338090, 0, 876, 578802b68) + 4f4
 0000000102d70ea0 <b>kglhdgn</b> (3000, ffffffff7fffb3d0, 12, 0, 0, 380003910) + 180
 0000000102d58974 <b>kglget</b> (105068700, ffffffff7fffb360, fc, 1000, 0, 1) + b94
 000000010128f454 <b>kkttrex</b> (8, 1050683c0, 10506ae08, 105000, 10506a000, 9) + 4b4
 000000010128e6c8 <b>kktexeevt0</b> (31a8, 57a3ae0a0, 105068f74, 2, 4, 10506ae08) + 3a8
 0000000101a77b60 <b>kpolon</b> (51, 51, 40002d91, 105071450, 80000, ffffffff7fffdc70) + c0
 00000001002d0058 opiodr (1a, 10506ae10, 10434cc68, 10506a, 105000, 105000) + 598
 0000000102cded94 ttcpip (105071450, 20, ffffffff7fffdc70, ffffffff7fffcf68, 104228a98, ffffffff7fffcf64) + 694
 00000001002cd3e8 opitsk (1002cf000, 1, 0, ffffffff7fffddc8, 105071450, 105071458) + 428
 0000000101aaf564 opiino (105070000, 105000, 57f337fd8, 105000, e3, 105070290) + 404
 00000001002d0058 opiodr (4, 10506ae10, 10434c920, 10000, 105071, 105000) + 598
 00000001002cc174 opidrv (0, 4, 10506a, 105071450, 0, 3c) + 354
 00000001002c9828 sou2o (ffffffff7fffea98, 3c, 4, ffffffff7fffea78, 104aa6000, 104aa6) + 48
 00000001002a7b34 main (2, ffffffff7fffeb78, ffffffff7fffeb90, 0, 0, 100000000) + 94
 00000001002a7a7c _start (0, 0, 0, 0, 0, 0) + 17c

</code></pre>
<p>Looks like this one was my other session I tried to start up after my first session got hung.</p>
<p>Let&#8217;s read the highlighted sections of stack from the top this time:</p>
<ol>
<li>semsys is a Solaris system call allowing to wait for semaphore count (value) to increase above 0. In other words this system call allows a thread to sleep until someone posts it through the semaphore</li>
<li>sskgpwwait is an Oracle operating system dependent (OSD) layer function which allows Oracle process to wait for an event</li>
<li>kslges is a latch get function (with sleep and timeout capability)</li>
<li>kglhdgn is a call for creating a new library cache object (handle)</li>
<li>kglget is a lookup call for locating a library cache object (if it doesn&#8217;t find one, a library cache miss is incremented and kglhdgn() above is called)</li>
<li>kkt calls are related to firing some internal triggers (logon triggers and auditing, anyone?)</li>
<li>kpolon is related to logon and session setup as far as I know</li>
</ol>
<p>So this stack indicates ( I might not be entirely correct ) that this process had got stuck while trying to create a library cache object in shared pool for an internal trigger during logon ( this database used session auditing by the way ).</p>
<p>Anyway, when I killed my original session (the one who was scanning through X$KSMSP), the database worked ok again.<br />
The advantage of stack tracing was once again shown in a case where Oracle&#8217;s instrumentation was not usable (due my stupid mistake ;). This again illustrates once more that you should test everything out thoroughly in test environments even if you think it <i>should</i> work ok.</p>
<div class="facebook_like_button"><iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fblog.tanelpoder.com%2F2007%2F09%2F06%2Fadvanced-oracle-troubleshooting-guide-part-3-more-adventures-in-process-stack%2F&amp;layout=standard&amp;show-faces=true&amp;width=450&amp;action=like&amp;font=arial&amp;colorscheme=light" scrolling="no" frameborder="0" allowTransparency="true" style="padding: 0px 0px; border:none; overflow:hidden; width:450px; height:70px;"></iframe></div> <img src="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?view=1&post_id=40" width="1" height="1" style="display: none;" /><p><a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save"><img src="http://blog.tanelpoder.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share/Bookmark"/></a> </p>]]></content:encoded>
			<wfw:commentRss>http://blog.tanelpoder.com/2007/09/06/advanced-oracle-troubleshooting-guide-part-3-more-adventures-in-process-stack/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Operating systems are lazy allocating memory</title>
		<link>http://blog.tanelpoder.com/2007/08/28/operating-systems-are-lazy-allocating-memory/</link>
		<comments>http://blog.tanelpoder.com/2007/08/28/operating-systems-are-lazy-allocating-memory/#comments</comments>
		<pubDate>Tue, 28 Aug 2007 15:02:24 +0000</pubDate>
		<dc:creator>Tanel Poder</dc:creator>
				<category><![CDATA[Internals]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Unix/Linux]]></category>
		<category><![CDATA[Windows]]></category>

		<guid isPermaLink="false">http://tanelpoder.wordpress.com/2007/08/28/operating-systems-are-lazy-allocating-memory/</guid>
		<description><![CDATA[There was a discussion about whether Oracle really allocates all memory for SGA immediately on instance startup or not. And further, whether Oracle allocates memory beyond the SGA_TARET if SGA_MAX_SIZE is larger than it. It&#8217;s worth reading this thread first: http://forums.oracle.com/forums/thread.jspa?threadID=535400&#38;tstart=0 I will paste an edited version of my reply to here as well: Don&#8217;t [...]]]></description>
			<content:encoded><![CDATA[<p>There was a discussion about whether Oracle really <strong>allocates</strong> all memory for SGA immediately on instance startup or not. And further, whether Oracle allocates memory beyond the SGA_TARET if SGA_MAX_SIZE is larger than it.<br />
It&#8217;s worth reading this thread first: <a href="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?url=aHR0cDovL2ZvcnVtcy5vcmFjbGUuY29tL2ZvcnVtcy90aHJlYWQuanNwYT90aHJlYWRJRD01MzU0MDAmYW1wO3RzdGFydD0w">http://forums.oracle.com/forums/thread.jspa?threadID=535400&amp;tstart=0</a></p>
<p>I will paste an edited version of my reply to here as well:</p>
<p><span id="more-36"></span></p>
<p>Don&#8217;t confuse address space set-up with allocating physical memory pages from RAM!</p>
<p>Even if ipcs -m shows x GB as the SGA shm segment length, it doesn&#8217;t mean this memory has actually been initialized and taken from RAM.</p>
<p>Decent OS&#8217;es do only initialize the pageable meory pages when they&#8217;re touched the first time, so a shm segment showing 10GB in ipcs -m output may be only 10% &#8220;used&#8221; really as some pages have never been touched.There are many things which affect when and if the memory is actually *allocated*, the ones I remember right now are:</p>
<blockquote><p>1) using solaris ISM &#8211; means Oracle will be usng non-pageable large pages &#8211; the shm seg size you see in ipcs is fully allocated from RAM and locked in RAM.</p>
<p>2) using Solaris DISM, the SGA shm segment is pageable (small pages in Solaris 8, large pages from Solaris 9) and may not necessarily be allocated from RAM</p>
<p>3) using lock_sga=true -&gt; the SGA shm segment is allocated from RAM and locked in RAM</p>
<p>4) using _lock_sga_areas -&gt; some ranges of pages in SGA shm segment are locked to memory, some pages of SGA shm segment may still be uninitialized</p>
<p>5) using _pre_page_sga=true -&gt; all pages of SGA shm segment are touched on startup</p>
<p>6) few others like _db_cache_pre_warm which affect memory page touching on startup&#8230;</p>
<p>7) using memory_target on Oracle 11g</p></blockquote>
<p>So, there are *many* things which affect physical memory allocation, but generally, unless you&#8217;re using non-pageable pages, not all SGA-size worth of memory is allocated from OS during instance startup.</p>
<p>Normally these artificial instance startup errors after setting sga_max_size to xxxGB come from hitting max shm segment size or max RAM + swap size (on Unixes). On linux on the other hand you can overallocate memory as Linux doesn&#8217;t back anonymous memory mappings with swap space (linux starts killing &#8220;random&#8221; processes instead when running out-of-memory. nice, huh?)</p>
<p>This means that if your SGA_TARGET is lower than SGA_MAX_SIZE during startup then the pages &#8220;above&#8221; SGA_TARGET will never be touched, thus not allocated!</p>
<p>And if you ramp down SGA_TARGET during your instance lifetime, then the pages &#8220;above&#8221; the new SGA_TARGET won&#8217;t be touched anymore (after MMAN completes the downsizing), which means these pages will be paged out from physical memory if there&#8217;s shortage of free physical memory.</p>
<p>Note that this &#8220;lazy&#8221; allocation behaviour comes from how modern operating systems work, it&#8217;s not a feature of Oracle. Oracle just has an option to request some specific behaviour from OS on some platforms (like requesting ISM using SHARE_MMU flag on Solaris when setting up the SGA SHM segment).</p>
<p>Thanks to this heavy &#8220;virtualization&#8221; of virtual memory pages and short codepath requirements for VM handling, its often hard to get a complete and accurate picture of individual processes &amp; SHM segments physical memory usage.</p>
<div class="facebook_like_button"><iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fblog.tanelpoder.com%2F2007%2F08%2F28%2Foperating-systems-are-lazy-allocating-memory%2F&amp;layout=standard&amp;show-faces=true&amp;width=450&amp;action=like&amp;font=arial&amp;colorscheme=light" scrolling="no" frameborder="0" allowTransparency="true" style="padding: 0px 0px; border:none; overflow:hidden; width:450px; height:70px;"></iframe></div> <img src="http://blog.tanelpoder.com/wp-content/plugins/wordpress-feed-statistics/feed-statistics.php?view=1&post_id=36" width="1" height="1" style="display: none;" /><p><a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save"><img src="http://blog.tanelpoder.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share/Bookmark"/></a> </p>]]></content:encoded>
			<wfw:commentRss>http://blog.tanelpoder.com/2007/08/28/operating-systems-are-lazy-allocating-memory/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
	</channel>
</rss>
