A lot of people have started seeing “asynch descriptor resize” wait event in Oracle 11gR2. Here’s my understanding of what it is. Note that I didn’t spend too much time researching it, so some details may be not completely accurate, but my explanation will at least give you an idea of why the heck you suddenly see this event in your database.
FYI, there’s a short, but incomplete explanation of this wait event also documented in MOS Note 1081977.1
Update: There’s a bug and a patch related to this wait event too.
The “direct path loader” (KCBL) module is used for performing direct path IO in Oracle, such as direct path segment scans and reading/writing spilled over workareas in temporary tablespace. Direct path IO is used whenever you see “direct path read/write*” wait events reported in your session. This means that IOs aren’t done from/to buffer cache, but from/to PGA directly, bypassing the buffer cache.
This KCBL module tries to dynamically scale up the number of asynch IO descriptors (AIO descriptors are the OS kernel structures, which keep track of asynch IO requests) to match the number of direct path IO slots a process uses. In other words, if the PGA workarea and/or spilled-over hash area in temp tablespace gets larger, Oracle also scales up the number of direct IO slots. Direct IO slots are PGA memory structures helping to do direct IO between files and PGA.
- If it’s someting like 1% of your response time, then this is not your problem anyway and troubleshooting this further would be not practical – it’s just how Oracle works :)
- If it’s something like 20% or more of your response time, then it’s clearly a problem and you’d need to talk to Oracle Support to sort out the bug
- If it’s anything in between, make sure you don’t have an IO problem first, before telling that this is a bug. In one recent example I saw direct path reads take over a second on average when this problem popped up. The asynch descriptor resize wait event may well disappear from the radar once you fix the root cause – slow IO (or SQL doing too much IO). Remember, the asynch descriptor resize wait event, at least on Linux, is actually an IO wait event, the process is waiting for outstanding IO completion before the descriptor count increase/decrease can take place.




There’s one more important question to ask – is such resizing behavior new in Oracle 11gR2 or was it done in earlier versions too, but we just didn’t know about it, because this wait wasn’t instrumented properly? There have been plenty of issues in past with incomplete IO instrumentation in Oracle.
This will be your home-work ;-)
And no I don’t think that increasing /proc/sys/fs/aio-max-nr kernel parameter on Linux would avoid this problem. You see these wait events even on systems with almost no activity and with majority of aio descriptors unused.
Quote:
“The “asynch descriptor resize” event itself is really an IO wait event (recorded in the wait class Other though), waiting for reaping outstanding IOs.
*** end ***
I am confused, why waiting on outstanding IOs would be categorized under a “resize” operation?
Aren’t those IO operations needed? Is it for writes only (similar to cache flush) or for reads as well?
Just a few questions that came to mind that I need to follow up on.
@Christo Kutrovsky
You are waiting for IO because the resize needs them to complete. So you’re waiting for IOs to complete.
The reason you are waiting for these IOs is that resize operation (perhaps a coding bug, perhaps a design bug, perhaps intended so), but this doesn’t change that you’re waiting for the IOs to complete, *before* the resize happens. The resize operation itself isn’t timed by any wait event (at least on Linux), it’s the IO completion wait *before* resize which will show up as “asynch descriptor resize” wait.
Hi Tanel,
So, if I understand you correctly, this event is just a special case of a ‘direct path read’ or ‘direct path write’ wait event? Those events also, you’re simply waiting for I/O to complete, correct? So, if that’s the case, is there any reason for the distinction? Why would I care if I’m waiting for I/O to complete so that the number of allocated async descriptors can be modified (async descriptor resize) or if I’m waiting for I/O to complete for any other reason (direct path read/write)?
Thanks,
-Mark
@Tanel Poder
Mark Bobak: That’s exactly what I am wondering…
Christo, Mark,
I recently wrote more about this in a separate discussion, I’ll paste it here too:
The wait occurs on any change in number of async descriptors allocated for the process, both upsize and downsize.
The wait event measures the time taken to wait for outstanding IOs to complete before the resize operation ever happens.
So, this is an IO completion wait event, not a resize wait event. Although this IO completion wait happens because of this resize feature in Oracle.
I did a little more checking today – the actual descriptor resize on Linux is is done by first dropping the AIO queue and then recreating it with new number of events/descriptors (using io_destroy() and io_setup() syscalls). So that’s the reason for this symptom on Linux, it doesn’t allow you to resize existing AIO context, instead you’ll need to:
1) wait for all IOs to complete (this is what this wait event is about)
2) drop the AIO context
3) reallocate AIO context with new number of descriptors
Does this affect other platforms too (Windows, Solaris etc)? Or is just Linux?
In 10.2.0.5/11.2.0.1, there is pretty bug regarding asynchronous I/O tuning.
It seems to be redisovered every now and then under different conditions
Bug 9772888 “WARNING:Could not lower the asynch I/O … It is set to -1″
This can hang streams (increase delays) and cause peaks in the database performance.
Have you seen it?
This is confirmed on Solaris/Linux in the environments I work with
@Viljo Hakala
Nope haven’t seen it yet. Yeah it seems Oracle tries to resize the aio descriptor count down, issues the syscall and this (after taking some time) fails – returning -1.
I haven’t tested this enough to recommend it, but for the adventurous out there, there’s an event which ought to fix the number of AIO slots, preventing the constant resize “optimization”.
10353, 00000, “number of slots”
// *Cause:
// *Action: sets a given number of slots to use
// *Comment: a slot is a unit of I/O and this factor controls the number
// *Comment: of outstanding I/Os.
Note that I’m not recommending to use this (in production), so better to wait until these bugs are patched or at least get Oracle Support’s blessing before trying this.
@John
John, it’s possible on other platforms too, depending on the OS capabilities and how Oracle’s OSD layer uses it. I haven’t seen it as a problem on other than Linux myself though. Basically if this wait shows up on some other platform and takes significant part of response time, then yep you have a problem there :-)
I can confirm that this also happen on SUN Solaris.
In my opinion this is an old bug in new clothes. In 10.2.0.3 we had the issue that it sometimes did not work when several objects in different schemas had the same name. In this case some accesses where very slow.
I actual have a similar case. There are two schemas with two views having the same name.
One of them hit
asynch descriptor resize
waiting on this always takes nearby 180 seconds. If I create the same view but with different name, then it works fine.
So in my opinion this is an old unpublished bug which looks a little bit different in the past.
So far I try to find a workaround for this. In 10.2.0.3 it was, so far I remember correct, an underscore parameter to fix it. It was related to index or optimizer behavior. Maybe this won’t work for 11.2. In my opinion a really strange issue.
I found a solution for this. Very crazy. by asking V$SESSION_EVENT using for example
prompt ==============
prompt CURRENT WAITS
prompt ==============
column event format A30
column sid format 9999
column “Latchname” format a40
SELECT A.SID SID, SEQ# “N-Times”, A.EVENT “EVENT”, b.name “Latchname” , state, wait_time “Wait Time”, seconds_in_wait “Waiting (s)”
FROM V$SESSION_WAIT A , V$latchname b
WHERE
A.SID=&1 and a.p2= b.latch#(+) order by 1;
I always got that the session waited for this “asynch” stuff but this was a false information. After I had a 10046 trace I got from the trace that the reason was “cache buffer chains”.
Then I added db_cache_size and shared_pool_size to have a limit to this 11.2 database to reduce memory contention. In 10.x we had in this case “buffer examiniation waits” but well, 11.2 is just a new database.
After that the query worked well.
Maybe this will give all of you an idea how to solve this stupid issue.
@Steven
Well, how you use V$SESSION_WAIT is wrong. This was probably what misled you. Whenever querying V$SESSION or V$SESSION_WAIT for current wait, you should always include the STATE column. If the STATE column does ***NOT*** say WAITING then you are NOT waiting, you’re on CPU. This is what causes A LOT of confusion out there. My sw.sql and Snapper avoid this problem by using a CASE statement and showing “on cpu” instead of the event when the session isn’t actually waiting for anything (thus on cpu). ASH also does this kind of replacement like I do in my scripts. But V$SESSION_WAIT and V$SESSION have this problem.
You probably mean “buffer extermination” waits, not examination… this happens when the auto SGA manager downsizes buffer cache (in favor of some other pool). The opposite would be shared pool and library cache latch/mutex contention, when shared pool gets downsized in favor of buffer cache. These are classic SGA_TARGET/MEMORY_TARGET related problems and yep, setting minimums for various pools (so they wouldn’t fluctuate in size so much) is one way to avoid it. Another option is to completely disable SGA_TARGET/MEMORY_TARGET and that’s what a lot of high-activity OLTP sites do.
Well the result of this query was always correct in the past. I use it since roundabout 12 years and I was able to find a lot of issues using this query. So maybe it does not fit to 11.2 anymore. You are right that the “false” output was the reason for the delay of the analysis.
I do not like to have SGA_TARGET/MEMORY_TARGET. I agree with your statement.
You are right. A typo. “buffer extermination” is correct.
Since it is required to have proper calculated min values for the caches it makes no sense to use SGA_TARGET/MEMORY_TARGET.
One point I figured out for 11.2 is that, if MEMORY_TARGET is used than it makes sense to have set limits for cache and shared pool but also for the underscore and double underscore parameters. If all of them where set proper (some of them have to have same value), than the dynamic cache management work the way documented and in this case the library cache/mutex contention disappears.
On the other hand there are some poor progamming techniques used by progammers related to the usage of ref cursors. A new, false programming technique is to open masses of cursors during runtime by using a stored procedure and not to close these and also not close the connection but let the session die.
In this case you will also see masses of library cache/mutex contention. In my opinion related to the fact that if there are masses of such sessions born and died in parallel than the “garbage collection” queue of Oracle made by PMON becomes larger and larger and larger and this also blocks shared pool and library cache especially if they are huge..
One point the query already get STATE.
The output in this case was
==============
CURRENT WAITS
==============
SID N-Times EVENT Latchname STATE Wait Time Waiting (s)
—– ———- —————————— —————————————- ——————- ———- ———–
57 1130 asynch descriptor resize * WAITED SHORT TIME -1 180
Repeated executions only showed changes at “Time Waiting”. The number just increased so in my opinion a bug because the 10046 trace shows a different wait event chain.
So I guess that under some circumstances the content of V$SESSION_WAIT is still false.
PARSE #18446744071458606432:c=160000,e=162969,p=0,cr=14,cu=0,mis=1,r=0,dep=0,og=1,plh=3811922696,tim=3341058141780
EXEC #18446744071458606432:c=0,e=217,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3811922696,tim=3341058142162
WAIT #18446744071458606432: nam=’SQL*Net message to client’ ela= 14 driver id=1413697536 #bytes=1 p3=0 obj#=-1 tim=3341058142228
WAIT #18446744071458606432: nam=’asynch descriptor resize’ ela= 10 outstanding #aio=0 current aio limit=4294967295 new aio limit=676 obj#=-1 tim=3341058167185
WAIT #18446744071458606432: nam=’asynch descriptor resize’ ela= 1 outstanding #aio=0 current aio limit=4294967295 new aio limit=676 obj#=-1 tim=3341058168397
WAIT #18446744071458606432: nam=’asynch descriptor resize’ ela= 14 outstanding #aio=0 current aio limit=4294967295 new aio limit=676 obj#=-1 tim=3341058246952
WAIT #18446744071458606432: nam=’asynch descriptor resize’ ela= 13 outstanding #aio=0 current aio limit=4294967295 new aio limit=676 obj#=-1 tim=3341058342651
*** 2011-05-25 11:50:29.448
WAIT #18446744071458606432: nam=’latch: cache buffers chains’ ela= 11 address=16056105368 number=155 tries=0 obj#=-1 tim=3341166517672
*** 2011-05-25 11:50:51.560
WAIT #18446744071458606432: nam=’latch: cache buffers chains’ ela= 26 address=16056418448 number=155 tries=0 obj#=-1 tim=3341188629227
@Steven
The state above shows you were on CPU. And sql_trace output shows the waits are only a few microseconds each, with much longer time in between of the waits (on CPU). I would profile the session (with snapper for example) and see where is the most time going (probably CPU) and what else is happening in the session (v$sesstat STAT section in snapper).
By the way, there’s a bug 9829397, which apparently causes this wait event to show up on HP-UX Itanium.
@Tanel Poder
Hi, how exactly do we set that event in our init.ora file?
I am experiencing this problem on an epic scale (11.2.02, Windows Server 2003 SP2). Instance runs fine for months, job completes nightly in about 2 hours. On some seemingly random run, the job will run for > 24 hours, which ends up killing the session as one of the remote DBs used during the process bounces and causes everything to fail (why it is bounced every 24 hours is another discussion). When looking at a trace file for this session, the only waits listed for hours at a time look like this:
WAIT #5: nam=’asynch descriptor resize’ ela= 2 outstanding #aio=0 current aio limit=4294967295 new aio limit=1375 obj#=13389 tim=7972868357171
WAIT #5: nam=’asynch descriptor resize’ ela= 2 outstanding #aio=0 current aio limit=4294967295 new aio limit=1375 obj#=13389 tim=7972868473834
WAIT #5: nam=’asynch descriptor resize’ ela= 2 outstanding #aio=0 current aio limit=4294967295 new aio limit=1310 obj#=13389 tim=7972868488147
WAIT #5: nam=’asynch descriptor resize’ ela= 2 outstanding #aio=0 current aio limit=4294967295 new aio limit=1310 obj#=13389 tim=7972868494096
WAIT #5: nam=’asynch descriptor resize’ ela= 1 outstanding #aio=0 current aio limit=4294967295 new aio limit=1375 obj#=13389 tim=7972868654485
WAIT #5: nam=’asynch descriptor resize’ ela= 2 outstanding #aio=0 current aio limit=4294967295 new aio limit=1375 obj#=13389 tim=7972868766024
WAIT #5: nam=’asynch descriptor resize’ ela= 2 outstanding #aio=0 current aio limit=4294967295 new aio limit=1310 obj#=13389 tim=7972868779279
WAIT #5: nam=’asynch descriptor resize’ ela= 2 outstanding #aio=0 current aio limit=4294967295 new aio limit=1310 obj#=13389 tim=7972868784715
WAIT #5: nam=’asynch descriptor resize’ ela= 1 outstanding #aio=0 current aio limit=4294967295 new aio limit=1375 obj#=13389 tim=7972868940973
WAIT #5: nam=’asynch descriptor resize’ ela= 1 outstanding #aio=0 current aio limit=4294967295 new aio limit=1375 obj#=13389 tim=7972869052582
WAIT #5: nam=’asynch descriptor resize’ ela= 1 outstanding #aio=0 current aio limit=4294967295 new aio limit=1310 obj#=13389 tim=7972869065874
WAIT #5: nam=’asynch descriptor resize’ ela= 2 outstanding #aio=0 current aio limit=4294967295 new aio limit=1310 obj#=13389 tim=7972869071356
This section is repeated about once a second, over and over, for long durations. Any ideas what might cause this?
Just added this patch info to the post:
https://supporthtml.oracle.com/ep/faces/secure/km/DocumentDisplay.jspx?id=9829397.8