My Blog is 1 year old! 8-)

Very big thanks to everyone who are reading (and commenting) my blog !!!

I wrote my first real blog entry on 18th June 2007, about advanced Oracle troubleshooting in cases where conventional methods are not enough.

Also, I had my 50th post and 100,000th pageview couple days ago!

So 100000 divided by 365 days is 274 pageviews per day. The last week has been crazy though, no day with less than 1000 pageviews and the best day with 1319 views. Thanks :)

Top post over time is this one about HTMLizing your sqlplus output.
It’s not about internals, but thanks to this feature I will probably stay with sqlplus forever. I’m sure that with enough tweaking you can get a sqlplus script to prepare coffee and maybe even fetch beer for you (free beer if you have RAC licenses).

Thanks once more to all readers, I hope I’ve been useful and also thanks to all the people I’ve learnt from (a lot of people btw!)

I’m not gonna post anything for another 5 days as I’m going to vacation (and I need to wake up in 3 hours so I better stop that stress test script and get some sleep).

Have a good weekend!

Another use case for WaitProf – diagnosing “events in waitclass Other”

I recently diagnosed a performance issue where the “events in waitclass Other” occasionally took significant part of the session’s response time. For example Snapper (which reads wait event data from V$SESSION_EVENT) reported that during measuring 39.9% of the response time was spent on “events in waitclass Other”.

SQL> @sn 1 119

-- Session Snapper v1.07 by Tanel Poder ( http://www.tanelpoder.com )

---------------------------------------------------------------------------------------------------------------------------------------------
HEAD,     SID, SNAPSHOT START   ,  SECONDS, TYPE, STATISTIC                               ,         DELTA,  DELTA/SEC,     HDELTA, HDELTA/SEC
---------------------------------------------------------------------------------------------------------------------------------------------
DATA,     119, 20080621 05:22:05,        1, STAT, session logical reads                   ,         18284,      18284,     18.28k,     18.28k
DATA,     119, 20080621 05:22:05,        1, STAT, consistent gets                         ,         15301,      15301,      15.3k,      15.3k
DATA,     119, 20080621 05:22:05,        1, STAT, consistent gets from cache              ,         15228,      15228,     15.23k,     15.23k
DATA,     119, 20080621 05:22:05,        1, STAT, consistent gets from cache (fastpath)   ,         15136,      15136,     15.14k,     15.14k
DATA,     119, 20080621 05:22:05,        1, STAT, calls to get snapshot scn: kcmgss       ,            89,         89,         89,         89
DATA,     119, 20080621 05:22:05,        1, STAT, no work - consistent read gets          ,         14883,      14883,     14.88k,     14.88k
DATA,     119, 20080621 05:22:05,        1, STAT, table scans (short tables)              ,            21,         21,         21,         21
DATA,     119, 20080621 05:22:05,        1, STAT, table scan rows gotten                  ,       1429227,    1429227,      1.43M,      1.43M
DATA,     119, 20080621 05:22:05,        1, STAT, table scan blocks gotten                ,         17440,      17440,     17.44k,     17.44k
DATA,     119, 20080621 05:22:05,        1, WAIT, events in waitclass Other               ,        399831,     399831,   399.83ms,   399.83ms
--  End of snap 1

From Oracle 10g Oracle has consolidated lots of the events into “events in waitclass Other”. This is because saving all 900+ wait event stats for every session (in V$SESSION_EVENT array) would waste too much memory with giving little benefit (normally there’s only a handful of troublemaking events anyway). Therefore makes sense to aggregate the least likely happening events under some common category. Looks like Oracle kernel coders have set a threshold in event number above which all events are grouped under the “other” waitclass.

See below, this is from 11g:

[Read more...]

Killing an Oracle process from inside Oracle

I had a following situation few days ago – I was running a CREATE TABLE AS SELECT over a heterogenous services dblink. However I cancelled this command via pressing CTRL+C twice in Windows sqlplus (this actually just kills the client sqlplus and not the call).

Anyway, when I wanted to drop that table involved, this happened:

SQL> drop table MYTABLE;
drop table MYTABLE
           *
ERROR at line 1:
ORA-00054: resource busy and acquire with NOWAIT specified

I can’t drop a table as someone is holding a lock on it. Fair enough, this was a dev environment used only by me, so I used DBA_OBJECTS.OBJECT_ID to find out the object ID of that table:

SQL> @o MYTABLE

owner                     object_name                    object_type        CREATED           LAST_DDL_TIME     status           OID      D_OID
------------------------- ------------------------------ ------------------ ----------------- ----------------- --------- ---------- ----------
XYZ_DEV01_OWNER           MYTABLE                        TABLE              20080616 11:08:44 20080616 11:08:44 VALID          63764      63764

…and then I queried what enqueue locks were held on that object:

SQL> select * from v$lock where id1=63764;

ADDR     KADDR           SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK
-------- -------- ---------- -- ---------- ---------- ---------- ---------- ---------- ----------
40034278 40034290        130 TM      63764          0          6          0       2662          0

Ok, I see session 130 holding a TM lock on that table. I queried the corresponding SERIAL# from v$session as well and killed the session:

SQL> alter system kill session '130,8764';
alter system kill session '130,8764'
*
ERROR at line 1:
ORA-00031: session marked for kill

SQL> select * from v$lock where id1=63764;

ADDR     KADDR           SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK
-------- -------- ---------- -- ---------- ---------- ---------- ---------- ---------- ----------
40034278 40034290        130 TM      63764          0          6          0       2668          0

After hanging for 60 seconds, my kill command gave up (and marked my session for kill), but my lock was still not released… Now what?

[Read more...]

Experiments with Google Custom Search engine

This post is loosely related to Oracle, as I wanted to have a single point of entry search page for my Oracle-related searching – which I could bring up with pressing a hotkey combination.

Google has a thing called Custom Search Engine, which allows you define custom search engine rules, sites to include, exclude etc. Therefore you can demote or completely exclude some unwanted sites from your search results. Alberto Dell’Era just commented on this on Jonathan Lewis’es blog entry about Firefox.

CSE sounded promising, but the problem is that if you sign up for the free custom search service, Google will show its adverts on the result page without such filtering, which still can lead you to some crap sites out there.

Anyway, the search page I use is available here (both for download and online use), feel free to use it if the adverts don’t bother you.

http://tanelpoder.com/search/

To be fair to you, I also mention that if anyone clicks on advert links in CSE results, Google pays the search engine creator (me) money… I doubt if I ever become a billionaire because of that, but if it pays the yearly $25 I spend on WordPress domain mapping and custom CSS, I’d be happy :)
if you don’t like that fact, feel free to take the simple source code of my page and modify it for your own needs.

Ok, next post will be about a good use-case for ORADEBUG for crashing reluctant-to-die server processes, I hope this makes up for my spam post here… (hmm… can one actually spam their own blog?)

Update: Note that I haven’t explicitly excluded any sites from the search results, I have just marked the ones I like and those are given a better ranking than other sites. However its still possible you see junk sites in results when you search for some specific keywords. Anyway, my search page is intended just as an example, feel free to customize yourself a search engine matching exactly your needs.

 

cursor_space_for_time To Be Deprecated

If you haven’t seen the Meatlink note 565424.1 in the news yet, cursor_space_for_time parameter will be deprecated in Oracle 10.2.0.5 and 11.1.0.7.

That’s kind of good news, I hope this will eventually reduce the number of expert DBAs who set this parameter to true whenever they see any kind of shared pool / library cache latch contention.

On the other hand, spin_count was made an undocumented parameter long time ago, but is still heavily abused worldwide so I wouldn’t be surprised if the same happens to future _cursor_space_for_time…

Advanced Oracle Troubleshooting Guide, Part 6: Understanding Oracle execution plans with os_explain

Get ready for some more adventures in Oracle process stack!

Before proceeding though, please read this post about safety of different stack sampling approaches.

I have had few non-trivial Oracle troubleshooting cases, related to query hangs and bad performance, where I’ve wanted to know where exactly in execution plan the current execution is.
Remember, Oracle is just another program executing instructions clustered in functions on your server, so stack sampling can help out here as well.

So, I was looking into the following stack trace taken from an Oracle 10.1 database on Solaris SPARC, running a SQL with this execution plan.

[Read more...]

Debugger dangers

Whenever I deliver training or conference presentations on advanced troubleshooting topics, I usually spend some time demonstrating how to get and interpret Oracle server process stack traces.
As I’ve mentioned before, stack traces are the ultimate indicators showing where in Oracle kernel (or whatever application) code the execution currently is (or where it was when a crash occurred). This is the reason Oracle Support asks for stack traces whenever there’s a crash or non-trivial hang involved, that’s why Oracle database dumps errorstacks when ORA-600′s and other exceptions occur.

There are multiple ways for getting stack traces for Oracle, but not all ways are equal. Some give you more contextual info, some less, but what I’m blogging about today is that some ways are less safe than others.

I was using pstack on Linux for diagnosing an IO related performance issue. I executed a create table as select statement and ran pstack in a loop for getting stack traces from the running process.

However in one of the test runs I got following error in my Oracle session:

SQL> create table t as select * from dba_source;
create table t as select * from dba_source
                                *
ERROR at line 1:
ORA-01115: IO error reading block from file 1 (block # 11161)
ORA-01110: data file 1: '/u01/oradata/LIN10G/system01.dbf'
ORA-27091: unable to queue I/O
ORA-27072: File I/O error
Additional information: 3
Additional information: 11145
Additional information: 32768

I suspected that this issue was due Linux pstack, stopped the pstack script and ran my CTAS from the same Oracle session again:

[Read more...]

Short note on KGX Mutexes

I received a question on what’s the point of the use of Mutexes for Oracle cursors in library cache. For short intro, I’m pasting one of my fairly recent answers in Oracle forums about Oracle mutexes here:

In Oracle, latches and mutexes are different things and managed using different modules. KSL* modules for latches and KGX* for mutexes.

General mutex operatins require less CPU instructions than latch operations (as they aren’t as sophisticated as latches and don’t maintain get/miss counts as latches do).

But the main scalability benefit comes from that there’s a mutex structure in each child cursor handle and the mutex itself acts as cursor pin structure. So if you have a cursor open (or cached in session cursor cache) you don’t need to get the library cache latch (which was previously needed for changing cursor pin status), but you can modify the cursor’s mutex refcount directly (with help of pointers in open cursor state area in sessions UGA).

Therefore you have much higher scalability when pinning/unpinning cursors (no library cache/library cache pin latching needed, virtually no false contention) and no separate pin structures need to be allocated/maintained.

Few notes:

  1. library cache latching is still needed for parsing etc, the mutexes address only the pinning issue in library cache
  2. mutexes are currently used for library cache cursors (not other objects like PL/SQL stored procs, table defs etc)
  3. As mutexes are a generic mechanism (not library cache specific) they’re used in V$SQLSTATS underlying structures too
  4. When mutexes are enabled, you won’t see cursor pins from X$KGLPN anymore (as X$KGLPN is a fixed table based on the KGL pin array – which wouldn’t be used for cursors anymore)

Generating lots of rows using connect by – safely!

Every now and then I need to generate a large number of rows into some table.

Not that I have any personal urge to do so, but you know, this is needed for some test cases and so on ;-)

It’s quite well known that it’s possible to generate lots of rows using CONNECT BY LEVEL <= #rows syntax.

However as the CONNECT BY LEVEL syntax is recursive by nature, it will consume more memory the more rows you query in one shot ( I think it was Mark Bobak who once posted this observation to Oracle-L, but I didn’t manage to find that thread anymore ).

So, here’s a test case:

SQL> select count(r)
  2  from (
  3     select rownum r from dual connect by rownum <= 100000000
  4  )
  5  /
        select rownum r from dual connect by rownum <= 100000000
                             *
ERROR at line 3:
ORA-04030: out of process memory when trying to allocate 44 bytes (kxs-heap-w,cursor work heap)

After running for a while the server process run out of private memory, used for the CONNECT BY cursor work heap.

Let’s investigate:

[Read more...]

Advanced Oracle Troubleshooting Guide, Part 5: Sampling V$ stuff with WaitProf. Really fast. Using SQL!

I bet you thought I’ll be writing about direct SGA access?! ;)

Nope!

Direct SGA access has excellent troubleshooting potential (as long as you know the shared memory data structures), but it has one major drawback – very few companies have such tools already in place in their production systems.

I have occasionally been called in to solve an urgent performance problem, happening right now and it needs solving immediately! And did I mention, these are critical production systems. Where you can’t just install binary executables freshly downloaded off internet. In fact you would want to diagnose the issue with minimal impact and changes required to those production environments (and that leaves sql tracing out the first round troubleshooting tools for me as well!)

So, I’ve developed myself a toolset for such purpose, Snapper and sw.sql and some process stack reading techniques I already have introduced in my blog.

Next in line is waitprof.sql which is a high-frequency V$SESSION_WAIT sampler – implemented in plain SQL (not PL/SQL).

Waitprof is basically a sampling session wait profiler. It’s like running a select against V$SESSION_WAIT in a very tight loop and aggregating results – but I have used a trick to do all this in plain SQL, which gives me performance advantage over PL/SQL based loops. Waitprof is able to sample V$SESSION_WAIT for a session up to 100 000 times per second!

This depends on your hardware of course and Oracle version too, but normally you’ll get 50-70kHz sampling rate with it.

Ok, you want to see an example? ;-)

[Read more...]