Is this valid SQL syntax? :-)

I’m talking about this:

select-1from from dual;

Looks like invalid, right? Well, let’s run it:

SQL> select-1from from dual;

       ROM
----------
 -1.0E+000

This is because:

  1. Oracle doesn’t need whitespace for tokenizing the SQL statement (differences in character classes will do – as I’ve explained here)
  2. The first from “keyword” in the above statement is broken down to two tokens as an “F” right after a digit means that the preceding number is a FLOAT (and “D” means DOUBLE) and the tokenizer stops right there, knowing that whatever comes after this character (“ROM”) is a next token, which according to the Oracle SQL syntax rules will be assigned as the output column alias

The following funky-looking SQL statements are also valid:

SQL> select.1e2ffrom dual;

     .1E2F
----------
  1.0E+001

SQL> select.1e2fas"."from dual;

         .
----------
  1.0E+001

In the upper example, the “.1e2f” means number .1 * 10^2 (scientific notation) represented as a FLOAT internally and in the lower one I’ve just added a column alias with “AS” keyword just to make the SQL look a bit crazier.

:-)

Asynch descriptor resize wait event in Oracle

A lot of people have started seeing “asynch descriptor resize” wait event in Oracle 11gR2. Here’s my understanding of what it is. Note that I didn’t spend too much time researching it, so some details may be not completely accurate, but my explanation will at least give you an idea of why the heck you suddenly see this event in your database.

FYI, there’s a short, but incomplete explanation of this wait event also documented in MOS Note 1081977.1

Update: There’s a bug and a patch related to this wait event too.

The “direct path loader” (KCBL) module is used for performing direct path IO in Oracle, such as direct path segment scans and reading/writing spilled over workareas in temporary tablespace. Direct path IO is used whenever you see “direct path read/write*” wait events reported in your session. This means that IOs aren’t done from/to buffer cache, but from/to PGA directly, bypassing the buffer cache.

 

This KCBL module tries to dynamically scale up the number of asynch IO descriptors (AIO descriptors are the OS kernel structures, which keep track of asynch IO requests) to match the number of direct path IO slots a process uses. In other words, if the PGA workarea and/or spilled-over hash area in temp tablespace gets larger, Oracle also scales up the number of direct IO slots. Direct IO slots are PGA memory structures helping to do direct IO between files and PGA.

 

In order to be able to perform this direct IO asynchronously, Oracle also dynamically scales up the number of OS asynch IO descriptors, one for each slot (up to 4096 descriptors per process). When Oracle doesn’t need the direct IO slots anymore (when the direct path table scan has ended or a workarea/tempseg gets cancelled) then it scales down the number of direct IO slots and asynch IO descriptors. Scaling asynch IO descriptors up/down requires issuing syscalls to OS (as the AIO descriptors are OS kernel structures).

 

I guess this is supposed to be an optimization, to avoid running out of OS AIO descriptors, by releasing them when not they’re not needed, but as that Metalink note mentioned, the resize apparently sucks on Linux. Perhaps that’s why other ports also suffer and have seen the same wait event.

 

The “asynch descriptor resize” event itself is really an IO wait event (recorded in the wait class Other though), waiting for reaping outstanding IOs. Once this wait is over, then the OS call to change the amount of asynch IO descriptors (allocated to that process) is made. There’s no wait event recorded for the actual “resize” OS call as it shouldn’t block.

 

So, the more direct IO you do, especially when sorting/hashing to temp with frequent workarea closing/opening, the more of this event you’ll see (and it’s probably the same for regular tablespace direct path IO too).

 

This problem wouldn’t be noticeable if Oracle kept async io descriptors cached and wouldn’t constantly allocated/free them. Of course then you may end up running out of aio descriptors in the whole server easier. Also I don’t know whether there would be some OS issues with reusing cached aio descriptors, perhaps there is a good reason why such caching isn’t done.

 

Nevertheless, what’s causing this wait event is too frequent aio descriptor resize due to changes in direct IO slot count (due to changes in PGA workarea/temp segment and perhaps when doing frequent direct path scans through lots of tables/partitions too).

 

So, the obvious question here is what to do about this wait event? Well, first you should check how big part of your total response time this event takes at all?

 

  1. If it’s someting like 1% of your response time, then this is not your problem anyway and troubleshooting this further would be not practical – it’s just how Oracle works :)
  2. If it’s something like 20% or more of your response time, then it’s clearly a problem and you’d need to talk to Oracle Support to sort out the bug
  3. If it’s anything in between, make sure you don’t have an IO problem first, before telling that this is a bug. In one recent example I saw direct path reads take over a second on average when this problem popped up. The asynch descriptor resize wait event may well disappear from the radar once you fix the root cause – slow IO (or SQL doing too much IO). Remember, the asynch descriptor resize wait event, at least on Linux, is actually an IO wait event, the process is waiting for outstanding IO completion before the descriptor count increase/decrease can take place.

A little new feature for shared pool geeks :-)

If you’ve taken any shared pool dumps from 11g+ databases lately, you might have wondered about this:

http://tech.e2sn.com/oracle/troubleshooting/shared-pool

A: The most fundamental difference between hash and nested loop joins

Ok guys, thanks for waiting!

I ended up expanding the article quite a lot compared to what I had originally planned. In fact I only wrote 50% of what I plan to write, I’ll update the rest… um… later… Instead of just stating the difference between the joins I took a step back and elaborated something what I often see people doing (and talking about in newsgroups and lists too).

Basically the most fundamental (or biggest or most important) difference between nested loop and hash joins is that:

  • Hash joins can not look up rows from the inner (probed) row source based on values retrieved from the outer (driving) row source, nested loops can.

In other words, when joining table A and B (A is driving table, B is the probed table), then a nested loop join can take 1st row from A and perform a lookup to B using that value (of the column(s) you join by). Then nested loop takes the next row from A and performs another lookup to table B using the new value. And so on and so on and so on.

This opens up additional access paths to the table B, for example when joining ORDERS and ORDER_ITEMS by ORDER_ID (and ORDER_ID is leading column of PK in ORDER_ITEMS table), then for whatever orders are taken from ORDERS table, we can perform a focused, narrow index range scan on ORDER_ITEMS for every ORDER_ID retrieved from the driving ORDERS table. A hash join can’t do that.

Of course this doesn’t mean that hash joins can’t use any indexes for tables they read – index range scans and unique lookups can still be used under a hash join, but only if there are constant values in the query text (in form of literal or bind variables). If there are no such constant (filter) conditions under a hash join, then the other options to use that index would be to do an INDEX FULL SCAN (which is a range scan from end to end of the index) or INDEX FAST FULL SCAN (which is like a full table scan through the entire index segment). However none of these opportunities give the same benefits as nested loops looking up rows from row source B dynamically based on what was retrieved from A during runtime.

Note that this nested loops benefit isn’t limited to indexes only on table B, the difference is more fundamental than just a specific access path. For example, if table B happens to be a single table hash cluster or indexed X$ table, then the nested loop is also able to do “optimized” lookups from these row-sources, based on the values retrieved from table A.

So, my article with a lot of (loosely) related details is here:

In the comments section of my question, Tom, Bernard Polarski, Christian Antognini and Marc Musette got the closest to what I had in my mind when I asked the question. However, of course your mileage may vary somewhat depending on what kind of problems you have experienced the most over all the years. Also, Jonathan Lewis had a valid comment regarding that the answer depends on what exactly does one mean by “fundamental” and yeah this was open to interpretation.

Nevertheless, I wanted to emphasize that there’s a more important difference between NL and hash joins, than the usual stuff you see in training material which talk about implementation details like hash tables and memory allocation…

Some day I will complete that article, I plan to add some design advice in there, like denormalization opportunities for getting the best of the both worlds etc. But now I’m gonna get a beer instead.

Thanks for reading and answering my blog, I was quite impressed by the volume of comments & answers to my question. I must do this more often!

Oracle Closed World presentation links

Thanks to everybody who attended my OCW hacking session!

Sorry to guys who attended via webinar – I’ll do the session again in a few weeks, with audio from end to end hopefully! And I will get someone to assist me with monitoring the transmission quality and attendee questions etc.

Note that this stuff is mostly for hacking and fun – don’t use the undocumented stuff in production!

The links are below:

Download scripts & Tools:

Rlwrap links:

Diagnostic events:

X$TRACE
I haven’t written any articles on X$TRACE yet, but you can find some stuff from one of my very old presentations:

Or just type:

ALTER TRACING ENABLE “10704:4:ALL”

Where 10704 is the KST event number, 4 is the level and ALL means all Oracle PIDs (Oracle PIDs, not OSPID’s).


Which number takes more space in an Oracle row?

So, which number takes more bytes inside an Oracle row?

A: 123

B:  1000000000000000000000000000000000000

And the correct answer is … (drumroll) … A! The “big” number 1000000000000000000000000000000000000 actually takes less space than the “small” 123!

Let’s verify this:

SQL> select vsize(123) A, vsize(1000000000000000000000000000000000000) B from dual;

         A          B
---------- ----------
         3          2

WTF? Why does such a small number 123 take more space than  1000000000000000000000000000000000000 ?

Well, the answer lies in how Oracle stores numbers. Oracle NUMBER datatype doesn’t store numbers in their platform-native integer format. Oracle uses it’s own format which stores numbers in scientific notation, in exponent-mantissa form. More details about this here.

You can use the DUMP sql function to see the actual binary value of the number data stored:

select dump(123) from dual;

DUMP(123)
---------------------
Typ=2 Len=3: 194,2,24

SQL> select dump(1000000000000000000000000000000000000) from dual;

DUMP(10000000000000
-------------------
Typ=2 Len=2: Typ=2 Len=2: 211,2

So, although the number 1000000000000000000000000000000000000 is bigger than 123, when stored in base-10 exponent form, it really carries much less information in it than 123 (1 x 10^36 vs 123 x 10^0). Oracle doesn’t need many bits for keeping the precision of this large value as it happens to be a power of 10.

See what happens when I store a number only slightly bigger or smaller than the original large number, now the stored number requires much more storage for keeping the required precision:

SQL> select dump(1000000000000000000000000000000000000+1) from dual;

DUMP(1000000000000000000000000000000000000+1)
-------------------------------------------------------
Typ=2 Len=20: 211,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2

SQL> select dump(1000000000000000000000000000000000000-1) from dual;

DUMP(1000000000000000000000000000000000000-1)
-----------------------------------------------------------------------------------------
Typ=2 Len=19: 210,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100

Dropping and creating tables in read only tablespaces?!

You probably already know that it’s possible to drop tables in Oracle read only tablespaces… (You did know that already, right? ;-) Here’s a little example:

SQL> create tablespace ronly datafile '/u03/oradata/LIN112/ronly.01.dbf' size 10m;

Tablespace created.

SQL> create table test tablespace ronly as select * from all_users;

Table created.

SQL> alter tablespace ronly READ ONLY;

Tablespace altered.

SQL> drop table test;

Table dropped.

I just dropped a table from a read only tablespace! Well, perhaps it’s because that instead of dropping the table was put into recyclebin instead (which is a data dictionary update)? Let’s check which segments remain in the RONLY tablespace:

SQL> select owner,segment_name,segment_type from dba_segments where tablespace_name = 'RONLY';

OWNER   SEGMENT_NAME                    SEGMENT_TYPE
------- ------------------------------- ------------------
TANEL   BIN$ix7rAUXZfB3gQKjAgS4LXg==$0  TABLE

Indeed, it seems that the table segment wasn’t actually dropped. Well, let’s purge the recycle bin to try to actually drop the table segment:

SQL> purge recyclebin;

Recyclebin purged.

SQL> select owner,segment_name,segment_type from dba_segments where tablespace_name = 'RONLY';

OWNER    SEGMENT_NAME                   SEGMENT_TYPE
-------- ------------------------------ ------------------
TANEL    9.130                          TEMPORARY

Wow, Oracle has converted the table segment into a temporary segment instead (see segment_type)! Bur our tablespace is read only, how can it do that?! The answer is that neither the regular DROP nor DROP PURGE need to write anything into the tablespace where the segment resides! The initial DROP operation just updated data dictionary, like renaming the table to BIN$… in OBJ$ and so on. The second DROP PURGE operation just ran a bunch of deletes against data dictionary to indicate that the table object is gone. But why is the TEMPORARY segment left behind? This has to do with locally managed tablespaces. Before LMT days, when you dropped a segment, then the segment space was released and acquired back to tablespace through inserts/updates to UET$/FET$ (used/free extents) base tables, which resided in system tablespace like all other data dictionary base tables. But with LMTs, the free space information is kept in bitmaps in the tablespace files themselves! Thus, if you drop a table in a read only LMT tablespace, the table will be gone, but the space will not be physically released (as you can’t update the LMT bitmaps in read only tablespace files). However, Oracle doesn’t want to lose that space should someone make the tablespace read write later on, so the table segment is updated to be a TEMPORARY segment instead of completely deleting it from data dictionary. That’s how the SMON can clean it up later on should that tablespace become read-write again. The 9.130 in SEGMENT_NAME column means relative file# 9 and starting block# 130, that’s a segment’s unique identifier in a tablespace. Let’s move on. This example is executed on Oracle 11.2, while logged on to a non-SYS/SYSTEM user:

SQL> select status from dba_tablespaces where tablespace_name = 'RONLY';

STATUS
---------
READ ONLY

The tablespace is still in read only status. Let’s try to CREATE a table into that tablespace:

SQL> create table test(a int) tablespace ronly;

Table created.

What? I can also CREATE a table into a read only tablespace?! Well, this is the behavior you get starting from 11gR2 onwards, it’s called deferred segment creation, Oracle doesn’t need to create any segments for a table until you actually insert rows into it! So, Oracle creates all needed metadata about the new table in data dictionary, but doesn’t actually allocate any space from the tablespace. This applies to other segment types like index and table/index partitions as well. There’s new syntax which controls the deferred segment creation:

SQL> drop table test purge;

Table dropped.

SQL> create table test (a int) segment creation IMMEDIATE tablespace ronly;
create table test (a int) segment creation IMMEDIATE tablespace ronly
*
ERROR at line 1:
ORA-01647: tablespace 'RONLY' is read-only, cannot allocate space in it

In the above case I disabled the deferred segment creation and got an error immediately as Oracle tried to allocate space from the read only tablespace. Let’s try the other way:

SQL> create table test (a int) segment creation DEFERRED tablespace ronly;

Table created.

Now, with deferred segment creation enabled, Oracle didn’t try to allocate space from the read only tablespace. Let’s look into the segments in that tablespace again:

SQL> select owner,segment_name,segment_type from dba_segments where tablespace_name = 'RONLY';

OWNER        SEGMENT_NAME    SEGMENT_TYPE
------------ --------------- ---------------
TANEL        9.130           TEMPORARY

We still have that old segment from an earlier DROP TABLE operation waiting to be cleaned up (when the tablespace goes into read-write mode again), but no segment for our latest test table created. Let’s make the tablespace read write and check again:

SQL> alter tablespace ronly read write;

Tablespace altered.

SQL> select owner,segment_name,segment_type from dba_segments where tablespace_name = 'RONLY';

no rows selected

Apparently SMON has kicked in already and cleaned up the table segment which had been marked temporary earlier when I dropped the table in a read only tablespace.

Note that tables in SYS and SYSTEM schema can not use deferred segment creation:

SQL> create table sys.test (a int) segment creation deferred;
create table sys.test (a int) segment creation deferred
*
ERROR at line 1:
ORA-14223: Deferred segment creation is not supported for this table

SQL> create table system.test (a int) segment creation deferred;
create table system.test (a int) segment creation deferred
*
ERROR at line 1:
ORA-14223: Deferred segment creation is not supported for this table

SQL> create table tanel.test (a int) segment creation deferred;

Table created.

There’s also a parameter, deferred_segment_creation which controls the default behavior:

SQL> show parameter deferred

NAME_COL_PLUS_SHOW_PARAM     TYPE        VALUE
---------------------------- ----------- --------
deferred_segment_creation    boolean     TRUE

SQL>

The full power of Oracle’s diagnostic events, part 2: ORADEBUG DOC and 11g improvements

I haven’t written any blog entries for a while, so here’s a very sweet treat for low-level Oracle troubleshooters and internals geeks out there :)

Over a year ago I wrote that Oracle 11g has a completely new low-level kernel diagnostics & tracing infrastructure built in to it. I wanted to write a longer article about it with comprehensive examples and use cases, but by now I realize I won’t ever have time for this, so I’ll just point you to the right direction :)

Basically, since 11g, you can use SQL_Trace, kernel undocumented traces, various dumps and other actions at much better granularity than before.

For example, you can enable SQL_Trace for a specific SQL_ID only:

SQL> alter session set events 'sql_trace[SQL: 32cqz71gd8wy3{pgadep: exactdepth 0} {callstack: fname opiexe}
plan_stat=all_executions,wait=true,bind=true';


Session altered.

Actually I have done more in above example, I have also said that trace only when the PGA depth (the dep= in tracefile) is zero. This means that trace only top-level calls, issued directly by the client application and not recursively by some PL/SQL or by dictionary cache layer. Additionally I have added a check whether we are currently servicing opiexe function (whether the current call stack contains opiexe as a (grand)parent function) – this allows to trace & dump only in specific cases of interest!

The syntax is actually more powerful than that, in this example I’m running kernel tracing for a kernel component plus instructing Oracle to dump various other things at level 1 (callstack,process state and query block debug info) whenever a tracepoint (event) in the SQL Transformation component family is hit:

SQL> alter session set events 'trace[RDBMS.SQL_Transform[SQL: 32cqz71gd8wy3]
disk=high RDBMS.query_block_dump(1) processstate(1) callstack(1)';


Session altered.

And by now you are probably asking that where is this syntax formally documented? Google and MOS searches don’t return anything useful. Well, as with many other things, a good reference is stored within Oracle kernel itself!

Just log on as sysdba and type ORADEBUG DOC:

ORADEBUG DOC

SQL> oradebug doc
Internal Documentation
**********************
EVENT                           Help on events (syntax, event list, ...)
COMPONENT       [<comp_name>]   List all components or describe <comp_name>

This gives you the index page, now you can navigate on by running ORADEBUG DOC EVENT and take it from there. There’s lots of documentation there!

I have put the output with some comments and examples into my website too:

http://tech.e2sn.com/oracle/troubleshooting/oradebug-doc

Note that this feature is quite fresh, almost not used at all in the real (production) world, so I consider this quite experimental. I have managed to crash my session with some tests, so take the usual advice about any undocumented stuff (and oradebug) – don’t use it in production without thinking first and if you do use it, then use it at your own risk!

Oracle memory troubleshooting article

Randolf Geist has written a good article about systematic troubleshooting of a PL/SQL memory allocation & CPU utilization problem – and he has used some of my tools too!

http://oracle-randolf.blogspot.com/2010/05/advanced-oracle-troubleshooting-session.html

Execution plan Quiz: Shouldn’t these row sources be the other way around ;-)

Here’s a little trick question. Check out the execution plan below.

What the hell, shouldn’t the INDEX/TABLE access be the other way around?!

Also, how come it’s TABLE ACCESS FULL (and not by INDEX ROWID) in there?

This question is with a little gotcha, but can you come up with a query which produced such plan? ;-)

----------------------------------------------
| Id  | Operation          | Name   | E-Rows |
----------------------------------------------
|   0 | SELECT STATEMENT   |        |        |
|*  1 |  INDEX RANGE SCAN  | PK_EMP |      1 |
|*  2 |   TABLE ACCESS FULL| EMP    |      1 |
----------------------------------------------