latch: cache buffers chains latch contention – a better way for finding the hot block

Tanel Poder

2009/08/27

Here's a treat for Oracle performance professionals and geeks who are looking for more systematic ways for cache buffers chains (CBC) latch contention troubleshooting. Cache buffers chains latches are taken when a process wants to walk through a cache buffer hash chain, looking if the block with required DBA (data block address) is in buffer cache. If the block happens to be in cache, then in most cases it has to be pinned first before use and unpinned after use, to make sure no-one else can perform an incompatible operation on that block at the same time. The modification of pin structures (pinning/unpinning) is also protected by CBC latches.

CBC latch contention can happen for multiple reasons, but one reason is that there is some really hot block in a SMP system with high number of CPUs (or CMT system with high number of threads like Sun T-series servers). Sometimes there happen to be multiple moderately hot blocks "under" the same CBC latch, which can result in latch contention again.

Traditionally DBAs used to look up the child latch address from V$SESSION_WAIT, sql_trace output or ASH and then look up all buffers protected by that latch from X$BH, using HLADDR column (HLADDR stands for Hash Latch Address). I also have a script for that, bhla.sql (Buffer Headers by Latch Address), which reports me all blocks currently in buffer cache, "under" that particular latch and the corresponding data block addresses and object names:

SQL> @bhla 27E5A780  <-- latch address reported by wait interface (Parameter1 in latch: cache buffers chains wait event)

FLG_LRUFLG                 OBJ OBJECT_TYPE         object                                          TCH DBA     
------------------- ---------- ------------------- ---------------------------------------- ---------- ---------
80000:8                      2 TABLE               SYS.SUBCOLTYPE$                                   0 1 7873  
80000:8                      2 TABLE               SYS.ATTRCOL$                                      0 1 7873  
0:0                      50472 INDEX               SYSMAN.MGMT_JOB_EXEC_IDX04                        1 3 27699 
0:0                      50472 INDEX               SYSMAN.MGMT_JOB_EXEC_IDX04                        1 3 27699 
0:4                      50472 INDEX               SYSMAN.MGMT_JOB_EXEC_IDX04                        1 3 27699 
2202000:8                 3710 TABLE               SYS.WRI$_ADV_REC_ACTIONS                          1 3 972   
0:0                      50472 INDEX               SYSMAN.MGMT_JOB_EXEC_IDX04                        1 3 27699 
0:0                      50472 INDEX               SYSMAN.MGMT_JOB_EXEC_IDX04                        1 3 27699 
2202000:8                94227 TABLE PARTITION     SYS.WRH$_ROWCACHE_SUMMARY                         2 3 36142 
80000:8                  54880 INDEX               PERFSTAT.STATS$UNDOSTAT_PK                       14 4 118331
0:8                         37 INDEX               SYS.I_OBJ2                                       24 1 55591 
2000:8                    3680 TABLE               SYS.WRI$_ADV_TASKS                               26 3 739   
0:8                         75 TABLE               SYS.IDL_UB2$                                     35 1 11745 
0:8                         73 TABLE               SYS.IDL_UB1$                                     38 1 12211

From above we see that this child latch is protecting many blocks in buffer cache. So which one is the troublemaker? This is where the touchcount (TCH column) has been traditionally used, the higher the TCH the hotter the buffer, right?

Actually this may not always be true. For example if there is a somewhat hot block which has been steadily in buffer cache for weeks, it may have very high touch count. Now if there is an extremely hot block used by every morning’s batch job (and causing the trouble) but not used later on in the day, this block will get aged out later in the day. Once a block is aged out, its metadata, including TCH is gone! So when the block is loaded back in next morning (causing trouble again), its touch count starts from zero again.

So, by just looking into touchcount numbers you can’t always reliably detect who’s the current troublemaker. One option here would be to query the touchcounts, then wait for a minute, query the touchcounts again and see who’s touchcount increased the most during the minute. This would give a bit better picture about who is causing trouble now.

But still, it would not be always reliable for another reason – touchcounts are incremented only after 3 seconds have passed since last increment! This factor has been coded in to avoid situation such a short but crazy nested loop join hammering a single buffer hundreds of thousands of times in few seconds and then finishing. The buffer wouldn’t be hot anymore but the touchcount would be hundreds of thousands due a single SQL execution. So, unless 3 seconds (of SGA internal time) has passed since last TCH update, the touchcounts would not be increased during buffer access.

This time is controlled by SGA variable kcbatt_ by the way:

SQL> oradebug dumpvar sga kcbatt_
ub4 kcbatt_ [3C440F4, 3C440F8) = 00000003

This 3-second delay leaves us in the following situation, let say there are 2 blocks protected by a CBC child latch:

One block has been accessed once every 3 seconds for 24 hours in a row. A block accessed once per 3 seconds is definitely not a hot block, but its touchcount would be around 28800 (86400 seconds per 24 hours / 3 = 28800).

And there is another block which is accessed crazily for 2 seconds in a row and this happens every 10 seconds. 2 seconds of consecutive access would increase the touchcount by 1. If such access pattern has been going on every 10 seconds over last 24 hours, then the touch count for that buffer would be 86400 / 10 = 8640.

In the first case we can have a very cold block with TCH = 28800 and in second case a very hot block with TCH = 8640 only and this can mislead DBAs to fixing the wrong problem.

Luckily this problem can be solved with LatchProfX :-)

If you don’t know what LatchProfX is, you may want to read these articles first - LatchProf gives you a better way for latch contention troubleshooting compared to old approaches:

And now I introduce a new feature in LatchProfX.

In addition to being able to report the SID of the latch holder and the kernel function why this latch was taken by the session, the newest LatchProfX can also report the object what was protected by the latch under investigation. For cache buffers chains the object is data block address of the block being accessed.

Here’s an example, let say I have CBC latch contention problems and from monitoring tools or sql_trace I have identified the hot latch, I use my sw.sql script here, to report what a session is waiting for:

SQL> @sw 138
 SID STATE   EVENT                        SEC_IN_WAIT                 P1
------- ------- ---------------------------- ----------- ------------------
    138 WAITING latch: cache buffers chains            0 address=0x27E5A780

Sw reports that this session is waiting on cache buffer chains child latch with address 0x27E5a780.

Now we can run latchprofx and report sid, name, hold mode and protected object for that child latch only (instead of latch name we specified its address):

SQL> @latchprofx sid,name,hmode,object % 27E5A780 100000

 SID NAME                   HMODE         OBJECT   Held  Gets  Held %  Held ms
---- ---------------------- ------------ ------- ------ ----- ------- --------
 138 cache buffers chains   exclusive     40EB02   3928  3797    3.93   41.637
 151 cache buffers chains   exclusive     40EB02   3711  3660    3.71   39.337
 138 cache buffers chains   shared        40EB02    623   623     .62    6.604
 151 cache buffers chains   shared        40EB02    544   544     .54    5.766

As you see, LatchProfX has reported a data block address in the object column. This is the datablock we are trying to access in buffer cache under protection of the given child latch.

Lets see what this DBA translates to:

SQL> @dba 40EB02

 RFILE#     BLOCK#
---------- ----------
         1      60162

Press enter to find the segment using V$BH (this may take CPU time), CTRL+C to cancel:

STATE      BLOCK_CLASS        OBJECT_TYPE         object                                          TCH  MODE_HELD
---------- ------------------ ------------------- ---------------------------------------- ---------- ----------
xcur       data block         INDEX               SYS.SYS_IOT_TOP_94276                           331          1

Press enter to query what segment resides there using DBA_EXTENTS (this can be IO intensive), CTRL+C to cancel:

OWNER                          SEGMENT_NAME         PARTITION_NAME  TABLESPACE_NAME
------------------------------ -------------------- --------------- -----------------
SYS                            SYS_IOT_TOP_94276                    SYSTEM

So, the troublemaker is an IOT index segment block belonging to table with object_id 94276. I can find the table name from dba_objects by object_id, or by using my script:

SQL> @oid 94276

owner                     object_name                    object_type        CREATED
------------------------- ------------------------------ ------------------ -----------------
SYS                       KILL_CPU                       TABLE              20090825 23:19:49

Also, as we know the segment name and relative file number (RFILE#) and block number (BLOCK#)  from above output we can dump the datablock with an ALTER SYSTEM DUMP DATAFILE command.

As that command requires absolute file id (not relative) we have to do a little conversion first (in cases where there are multiple files with same relative file number in database):

SQL> select file_id from dba_extents where relative_fno = 1 and segment_name = 'SYS_IOT_TOP_94276';

 FILE_ID
----------
 1

Of course in my little test case all relative file numbers match with absolute numbers (but if you have lots of datafiles, have directly migrated from Oracle 7 or use transportable tablespaces, you may hit this issue).

Finally I can run ALTER SYSTEM DUMP DATAFILE 1 BLOCK 60162 to find out more about the contents of that block. Another option would be to construct a rowid from the data_object_id, rfile and block and query the table using that but as that’s complicated for IOTs, I’ll leave this to a future post.

You can download latest version of LatchProfX here:


NB! Check out my 2018/2019 online training classes: Advanced Oracle SQL Tuning training, Advanced Oracle Troubleshooting training and the new Practical Linux Performance & Application Troubleshooting training!