Hard Drive Predictive Failures on Linux and Exadata

This post also applies to non-Exadata systems as hard drives work the same way in other storage arrays too – just the commands you would use for extracting the disk-level metrics would be different. Scroll down to smartctl if you wan’t to skip the Oracle stuff and get straight to the Linux disk diagnosis commands.

I just noticed that one of our Exadatas had a disk put into “predictive failure” mode and thought to show how to measure why the disk is in that mode (as opposed to just replacing it without really understanding the issue ;-)

SQL> @exadata/cellpd
Show Exadata cell versions from V$CELL_CONFIG....

DISKTYPE             CELLNAME             STATUS                 TOTAL_GB     AVG_GB  NUM_DISKS   PREDFAIL   POORPERF WTCACHEPROB   PEERFAIL   CRITICAL
-------------------- -------------------- -------------------- ---------- ---------- ---------- ---------- ---------- ----------- ---------- ----------
FlashDisk            192.168.12.3         normal                      183         23          8
FlashDisk            192.168.12.3         not present                 183         23          8                     3
FlashDisk            192.168.12.4         normal                      366         23         16
FlashDisk            192.168.12.5         normal                      366         23         16
HardDisk             192.168.12.3         normal                    20489       1863         11
HardDisk             192.168.12.3         warning - predictive       1863       1863          1          1
HardDisk             192.168.12.4         normal                    22352       1863         12
HardDisk             192.168.12.5         normal                    22352       1863         12

So, one of the disks in storage cell with IP 192.168.12.3 has been put into predictive failure mode. Let’s find out why!

To find out which exact disk, I ran one of my scripts for displaying Exadata disk topology (partial output below):

SQL> @exadata/exadisktopo2
Showing Exadata disk topology from V$ASM_DISK and V$CELL_CONFIG....

CELLNAME             LUN_DEVICENAME       PHYSDISK                       PHYSDISK_STATUS                                                                  CELLDISK                       CD_DEVICEPART                                                                    GRIDDISK                       ASM_DISK                       ASM_DISKGROUP                  LUNWRITECACHEMODE
-------------------- -------------------- ------------------------------ -------------------------------------------------------------------------------- ------------------------------ -------------------------------------------------------------------------------- ------------------------------ ------------------------------ ------------------------------ ----------------------------------------------------------------------------------------------------
192.168.12.3         /dev/sda             35:0                           normal                                                                           CD_00_enkcel01                 /dev/sda3                                                                        DATA_CD_00_enkcel01            DATA_CD_00_ENKCEL01            DATA                           "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sda             35:0                           normal                                                                           CD_00_enkcel01                 /dev/sda3                                                                        RECO_CD_00_enkcel01            RECO_CD_00_ENKCEL01            RECO                           "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sdb             35:1                           normal                                                                           CD_01_enkcel01                 /dev/sdb3                                                                        DATA_CD_01_enkcel01            DATA_CD_01_ENKCEL01            DATA                           "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sdb             35:1                           normal                                                                           CD_01_enkcel01                 /dev/sdb3                                                                        RECO_CD_01_enkcel01            RECO_CD_01_ENKCEL01            RECO                           "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sdc             35:2                           normal                                                                           CD_02_enkcel01                 /dev/sdc                                                                         DATA_CD_02_enkcel01            DATA_CD_02_ENKCEL01            DATA                           "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sdc             35:2                           normal                                                                           CD_02_enkcel01                 /dev/sdc                                                                         DBFS_DG_CD_02_enkcel01         DBFS_DG_CD_02_ENKCEL01         DBFS_DG                        "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sdc             35:2                           normal                                                                           CD_02_enkcel01                 /dev/sdc                                                                         RECO_CD_02_enkcel01            RECO_CD_02_ENKCEL01            RECO                           "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sdd             35:3                           warning - predictive                                                             CD_03_enkcel01                 /dev/sdd                                                                         DATA_CD_03_enkcel01                                                                          "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sdd             35:3                           warning - predictive                                                             CD_03_enkcel01                 /dev/sdd                                                                         DBFS_DG_CD_03_enkcel01                                                                       "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
                     /dev/sdd             35:3                           warning - predictive                                                             CD_03_enkcel01                 /dev/sdd                                                                         RECO_CD_03_enkcel01                                                                          "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"

Ok, looks like /dev/sdd (with address 35:3) is the “failed” one.

When listing the alerts from the storage cell, indeed we see that a failure has been predicted, warning raised and even handled – XDMG process gets notified and the ASM disks get dropped from the failed grid disks (as you see from the exadisktopo output above if you scroll right).

CellCLI> LIST ALERTHISTORY WHERE alertSequenceID = 456 DETAIL;
	 name:              	 456_1
	 alertDescription:  	 "Data hard disk entered predictive failure status"
	 alertMessage:      	 "Data hard disk entered predictive failure status.  Status        : WARNING - PREDICTIVE FAILURE  Manufacturer  : HITACHI  Model Number  : H7220AA30SUN2.0T  Size          : 2.0TB  Serial Number : 1016M7JX2Z  Firmware      : JKAOA28A  Slot Number   : 3  Cell Disk     : CD_03_enkcel01  Grid Disk     : DBFS_DG_CD_03_enkcel01, DATA_CD_03_enkcel01, RECO_CD_03_enkcel01"
	 alertSequenceID:   	 456
	 alertShortName:    	 Hardware
	 alertType:         	 Stateful
	 beginTime:         	 2013-11-27T07:48:03-06:00
	 endTime:           	 2013-11-27T07:55:52-06:00
	 examinedBy:        	 
	 metricObjectName:  	 35:3
	 notificationState: 	 1
	 sequenceBeginTime: 	 2013-11-27T07:48:03-06:00
	 severity:          	 critical
	 alertAction:       	 "The data hard disk has entered predictive failure status. A white cell locator LED has been turned on to help locate the affected cell, and an amber service action LED has been lit on the drive to help locate the affected drive. The data from the disk will be automatically rebalanced by Oracle ASM to other disks. Another alert will be sent and a blue OK-to-Remove LED will be lit on the drive when rebalance completes. Please wait until rebalance has completed before replacing the disk. Detailed information on this problem can be found at https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&id=1112995.1  "

	 name:              	 456_2
	 alertDescription:  	 "Hard disk can be replaced now"
	 alertMessage:      	 "Hard disk can be replaced now.  Status        : WARNING - PREDICTIVE FAILURE  Manufacturer  : HITACHI  Model Number  : H7220AA30SUN2.0T  Size          : 2.0TB  Serial Number : 1016M7JX2Z  Firmware      : JKAOA28A  Slot Number   : 3  Cell Disk     : CD_03_enkcel01  Grid Disk     : DBFS_DG_CD_03_enkcel01, DATA_CD_03_enkcel01, RECO_CD_03_enkcel01 "
	 alertSequenceID:   	 456
	 alertShortName:    	 Hardware
	 alertType:         	 Stateful
	 beginTime:         	 2013-11-27T07:55:52-06:00
	 examinedBy:        	 
	 metricObjectName:  	 35:3
	 notificationState: 	 1
	 sequenceBeginTime: 	 2013-11-27T07:48:03-06:00
	 severity:          	 critical
	 alertAction:       	 "The data on this disk has been successfully rebalanced by Oracle ASM to other disks. A blue OK-to-Remove LED has been lit on the drive. Please replace the drive."

The two alerts show that we first detected a (soon) failing disk (event 456_1) and then ASM kicked in and dropped the ASM disks from the failing disk and rebalanced the data elsewhere (event 456_2).

But we still do not know why we are expecting the disk to fail! And the alert info and CELLCLI command output do not have this detail. This is where the S.M.A.R.T monitoring comes in. Major hard drive manufacturers support the SMART standard for both reactive and predictive monitoring of the hard disk internal workings. And there are commands for querying these metrics.

Let’s find the failed disk info at the cell level with CELLCLI:

CellCLI> LIST PHYSICALDISK;
	 35:0     	 JK11D1YAJTXVMZ	 normal
	 35:1     	 JK11D1YAJB4V0Z	 normal
	 35:2     	 JK11D1YAJAZMMZ	 normal
	 35:3     	 JK11D1YAJ7JX2Z	 warning - predictive failure
	 35:4     	 JK11D1YAJB3J1Z	 normal
	 35:5     	 JK11D1YAJB4J8Z	 normal
	 35:6     	 JK11D1YAJ7JXGZ	 normal
	 35:7     	 JK11D1YAJB4E5Z	 normal
	 35:8     	 JK11D1YAJ8TY3Z	 normal
	 35:9     	 JK11D1YAJ8TXKZ	 normal
	 35:10    	 JK11D1YAJM5X9Z	 normal
	 35:11    	 JK11D1YAJAZNKZ	 normal
	 FLASH_1_0	 1014M02JC3    	 not present
	 FLASH_1_1	 1014M02JYG    	 not present
	 FLASH_1_2	 1014M02JV9    	 not present
	 FLASH_1_3	 1014M02J93    	 not present
	 FLASH_2_0	 1014M02JFK    	 not present
	 FLASH_2_1	 1014M02JFL    	 not present
	 FLASH_2_2	 1014M02JF7    	 not present
	 FLASH_2_3	 1014M02JF8    	 not present
	 FLASH_4_0	 1014M02HP5    	 normal
	 FLASH_4_1	 1014M02HNN    	 normal
	 FLASH_4_2	 1014M02HP2    	 normal
	 FLASH_4_3	 1014M02HP4    	 normal
	 FLASH_5_0	 1014M02JUD    	 normal
	 FLASH_5_1	 1014M02JVF    	 normal
	 FLASH_5_2	 1014M02JAP    	 normal
	 FLASH_5_3	 1014M02JVH    	 normal

Ok, let’s look into the details, as we also need the deviceId for querying the SMART info:

CellCLI> LIST PHYSICALDISK 35:3 DETAIL;
	 name:              	 35:3
	 deviceId:          	 26
	 diskType:          	 HardDisk
	 enclosureDeviceId: 	 35
	 errMediaCount:     	 0
	 errOtherCount:     	 0
	 foreignState:      	 false
	 luns:              	 0_3
	 makeModel:         	 "HITACHI H7220AA30SUN2.0T"
	 physicalFirmware:  	 JKAOA28A
	 physicalInsertTime:	 2010-05-15T21:10:49-05:00
	 physicalInterface: 	 sata
	 physicalSerial:    	 JK11D1YAJ7JX2Z
	 physicalSize:      	 1862.6559999994934G
	 slotNumber:        	 3
	 status:            	 warning - predictive failure

Ok, the disk device was /dev/sdd, the disk name is 35:3 and the device ID is 26. And it’s a SATA disk. So I will run smartctl with the sat+megaraid device type option to query the disk SMART metrics – via the SCSI controller where the disks are attached to. Note that the ,26 in the end is the deviceId reported by the LIST PHYSICALDISK command. There’s quite a lot of output, I have highlighted the important part in red:

> smartctl  -a  /dev/sdd -d sat+megaraid,26
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-2.6.32-400.11.1.el5uek] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     HITACHI H7220AA30SUN2.0T 1016M7JX2Z
Serial Number:    JK11D1YAJ7JX2Z
LU WWN Device Id: 5 000cca 221df9d11
Firmware Version: JKAOA28A
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Thu Nov 28 06:28:13 2013 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(22330) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   028   028   016    Pre-fail  Always       -       430833663
  2 Throughput_Performance  0x0005   132   132   054    Pre-fail  Offline      -       103
  3 Spin_Up_Time            0x0007   117   117   024    Pre-fail  Always       -       614 (Average 624)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       69
  5 Reallocated_Sector_Ct   0x0033   058   058   005    Pre-fail  Always       -       743
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   112   112   020    Pre-fail  Offline      -       39
  9 Power_On_Hours          0x0012   096   096   000    Old_age   Always       -       30754
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       69
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       80
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       80
194 Temperature_Celsius     0x0002   253   253   000    Old_age   Always       -       23 (Min/Max 17/48)
196 Reallocated_Event_Count 0x0032   064   064   000    Old_age   Always       -       827
197 Current_Pending_Sector  0x0022   089   089   000    Old_age   Always       -       364
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 0
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     30754         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

So, from the highlighted output above we see that the Raw_Read_Error_Rate indicator for this hard drive is pretty close to the threshold of 16. The SMART metrics really are just “health indicators” following the defined standards (read the wiki article). The health indicator value can range from 0 to 253. For the Raw_read_Error_Rate metric (which measures the physical read success from the disk surface), bigger value is better (apparently 100 is the max with the Hitachi disks at least, most of the other disks were showing around 90-100). So whenever there are media read errors, the metric will drop, the more errors, the more it will drop.

Apparently some of the read errors are inevitable (and detected by various checks like ECC), especially in the high-density disks. The errors will be corrected/worked around, sometimes via ECC, sometimes by a re-read. So, yes, your hard drive performance can get worse as disks age or are about to fail. If the metric ever goes below the defined threshold of 16, the disk apparently (and consistently over some period of time) isn’t working so great, so it should better be replaced.

Note that the RAW_VALUE column does not neccesarily show the number of failed reads from the disk platter. It may represent the number of sectors that failed to be read or it may be just a bitmap – or both combined into the low- and high-order bytes of this value. For example, when converting the raw value of 430833663 to hex, we get 0x19ADFFFF. Perhaps the low-order FFFF is some sort of a bitmap and the high 0rder 0x19AD is the number of failed sectors or reads. There’s some more info available about Seagate disks, but in our V2 we have Hitachi ones and I couldn’t find anything about how to decode the RAW_VALUE for their disks. So, we just need to trust that the “normalized” SMART health indicators for the different metrics tell us when there’s a problem.

Even though when I ran the smartctl command, I did not see the actual value (nor the worst value) crossing the threshold, 28 is still pretty close to the threshold 16, considering that normally the indicator should be close to 100. So my guess here is that the indicator actually did cross the threshold, this is when the alert got raised. It’s just that by the time I logged in and ran my diagnostics commands, the disk worked better again. It looks like the “worst” values are not remembered properly by the disks (or it could be that some SMART tool resets these every now and then). Note that we would see SMART alerts with the actual problem metric values in the Linux /var/log/messages file _if _the smartd service were enabled in the Storage Cell Linux OS – but apparently it’s disabled and probably some Oracle’s own daemon in the cell is monitoring that.

So what does this info tell us – a low “health indicator” for the Raw_Read_Error_Rate means that there are problems with physically reading the sequences of bits from the disk platter. This means bad sectors or weak sectors (that are soon about to become bad sectors probably). So, had we seen a bad health state for UDMA_CRC_Error_Count for example, it would have indicated a data transfer issue over the SATA cable. So, it looks like the reason for the disk being in the predictive failure state is about it having just too many read errors from the physical disk platter.

If you look into the other highlighted metrics above – the Reallocated_Sector_Ct and Current_Pending_Sector, you see there are hundreds of disk sectors (743 and 364) that have had IO issues, but eventually the reads finished ok and the sectors were migrated (remapped) to a spare disk area. As these disks have 512B sector size, this means that some Oracle block-size IOs from a single logical sector range may actually have to read part of the data from the original location and seek to some other location on the disk for reading the rest (from the remapped sector). So, again, your disk performance may get worse when your disk is about to fail or is just having quality issues.

For reference, here’s an example from another, healthier disk in this Exadata storage cell:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   086   086   016    Pre-fail  Always       -       2687039
  2 Throughput_Performance  0x0005   133   133   054    Pre-fail  Offline      -       99
  3 Spin_Up_Time            0x0007   119   119   024    Pre-fail  Always       -       601 (Average 610)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       68
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   114   114   020    Pre-fail  Offline      -       38
  9 Power_On_Hours          0x0012   096   096   000    Old_age   Always       -       30778
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       68
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       81
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       81
194 Temperature_Celsius     0x0002   253   253   000    Old_age   Always       -       21 (Min/Max 16/46)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

The Raw_Read_Error_Rate indicator still shows 86 (% from ideal?), but it’s much farther away from the threshold 16. Many of the other disks showed even 99 or 100 and apparently this metric value changed as the disk behavior changed. Some disks with value of 88 jumped to 100 and an hour later they were at 95 and so on. So the VALUE column allows real time monitoring of these internal disk metrics, the only thing that doesn’t make sense right now is that why does teh WORST column get reset over time.

For the better behaving disk, the Reallocated_Sector_Ct and Current_Pending_Sector metrics show zero, so this disk doesn’t seem to have bad or weak sectors (yet).

I hope that this post is another example that it is possible to dig deeper, but only when the piece of software or hardware is properly instrumented, of course. If it didn’t have such instrumentation, it would be way harder (you would have to take a stethoscope and record the noise of the hard drive for analysis or open the drive in dustless vacuum and see what it’s doing yourself ;-)

Hard Drive Predictive Failures on Linux and Exadata

Tanel Poder

2013-11-29