FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Fedora User

 
 
LinkBack Thread Tools
 
Old 08-15-2010, 05:05 PM
Suvayu Ali
 
Default understanding smart logs

Hi everyone,

Some background:
Recently my RAM went bad, and I realised it too late. Towards the last
few of days my desktop had crashed more than once. Yesterday I received
the replacement RAMs from RMA. On installing them and turning on my
machine I noticed errors like these,

Device: /dev/sdb [SAT], 172 Currently unreadable (pending) sectors

And I see that the errors started around about the time my desktop
started crashing before I found the faulty RAMs.

The problem:
On subsequent boots it failed to boot, fsck complaining about disk read
errors during a forced disk check. I was dropped to a read-only shell to
troubleshoot everytime, so I ran fsck on all my partitions and found
errors on my /home. The error messages said "inode has deleted or empty
entries clear", "unlinked inode entries" and so on. Since I was on a
read only partition I couldn't save them on a file (I guess paper would
have worked :-p). When prompted by fsck to fix the errors, I answered yes.

On a reboot, my system booted properly but I had lost some very
important data. All the missing directories were the ones which fsck had
complained about. I restored whatever I could from some backups.

To confirm this as a one off incident and my disk hasn't gone bad I ran
SMART tests, (this is a few month old drive)
# smartctl -t long /dev/sdb

But after the test I can't understand the output of the logs,

> # smartctl -a /dev/sdb
> smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
>
> === START OF INFORMATION SECTION ===
> Model Family: Western Digital Caviar Black family
> Device Model: WDC WD1001FALS-00E8B0
> Serial Number: WD-WMATV5966482
> Firmware Version: 05.00K05
> User Capacity: 1,000,204,886,016 bytes
> Device is: In smartctl database [for details use: -P show]
> ATA Version is: 8
> ATA Standard is: Exact ATA specification draft version not indicated
> Local Time is: Sat Aug 14 19:37:26 2010 PDT
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status: (0x84) Offline data collection activity
> was suspended by an interrupting command from host.
> Auto Offline Data Collection: Enabled.
> Self-test execution status: ( 121) The previous self-test completed having
> the read element of the test failed.
> Total time to complete Offline
> data collection: (18000) seconds.
> Offline data collection
> capabilities: (0x7b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities: (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability: (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: ( 2) minutes.
> Extended self-test routine
> recommended polling time: ( 208) minutes.
> Conveyance self-test routine
> recommended polling time: ( 5) minutes.
> SCT capabilities: (0x3037) SCT Status supported.
> SCT Feature Control supported.
> SCT Data Table supported.
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
> 1 Raw_Read_Error_Rate 0x002f 199 199 051 Pre-fail Always - 1354
> 3 Spin_Up_Time 0x0027 253 253 021 Pre-fail Always - 1158
> 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 40
> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
> 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
> 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1403
> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 38
> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 21
> 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 18
> 194 Temperature_Celsius 0x0022 112 107 000 Old_age Always - 38
> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
> 197 Current_Pending_Sector 0x0032 199 199 000 Old_age Always - 172
> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
> # 1 Extended offline Completed: read failure 90% 1393 1106820646
>
> SMART Selective self-test log data structure revision number 1
> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> 1 0 0 Not_testing
> 2 0 0 Not_testing
> 3 0 0 Not_testing
> 4 0 0 Not_testing
> 5 0 0 Not_testing
> Selective self-test flags (0x0):
> After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.

All the values in the table above seems larger than the threshold. But
the report says PASSED. I'm not clear how to interpret this. Could
someone help? Thanks a lot in advance.

--
Suvayu

Open source is the future. It sets us free.
--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
 
Old 08-15-2010, 05:17 PM
James McKenzie
 
Default understanding smart logs

Suvayu Ali wrote:
> Hi everyone,
>
> Some background:
> Recently my RAM went bad, and I realised it too late. Towards the last
> few of days my desktop had crashed more than once. Yesterday I received
> the replacement RAMs from RMA. On installing them and turning on my
> machine I noticed errors like these,
>
> Device: /dev/sdb [SAT], 172 Currently unreadable (pending) sectors
>
> And I see that the errors started around about the time my desktop
> started crashing before I found the faulty RAMs.
>
> The problem:
> On subsequent boots it failed to boot, fsck complaining about disk read
> errors during a forced disk check. I was dropped to a read-only shell to
> troubleshoot everytime, so I ran fsck on all my partitions and found
> errors on my /home. The error messages said "inode has deleted or empty
> entries clear", "unlinked inode entries" and so on. Since I was on a
> read only partition I couldn't save them on a file (I guess paper would
> have worked :-p). When prompted by fsck to fix the errors, I answered yes.
>
> On a reboot, my system booted properly but I had lost some very
> important data. All the missing directories were the ones which fsck had
> complained about. I restored whatever I could from some backups.
>
> To confirm this as a one off incident and my disk hasn't gone bad I ran
> SMART tests, (this is a few month old drive)
> # smartctl -t long /dev/sdb
>
> But after the test I can't understand the output of the logs,
>
>
>> # smartctl -a /dev/sdb
>> smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
>> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
>>
>> === START OF INFORMATION SECTION ===
>> Model Family: Western Digital Caviar Black family
>> Device Model: WDC WD1001FALS-00E8B0
>> Serial Number: WD-WMATV5966482
>> Firmware Version: 05.00K05
>> User Capacity: 1,000,204,886,016 bytes
>> Device is: In smartctl database [for details use: -P show]
>> ATA Version is: 8
>> ATA Standard is: Exact ATA specification draft version not indicated
>> Local Time is: Sat Aug 14 19:37:26 2010 PDT
>> SMART support is: Available - device has SMART capability.
>> SMART support is: Enabled
>>
>> === START OF READ SMART DATA SECTION ===
>> SMART overall-health self-assessment test result: PASSED
>>
>> General SMART Values:
>> Offline data collection status: (0x84) Offline data collection activity
>> was suspended by an interrupting command from host.
>> Auto Offline Data Collection: Enabled.
>> Self-test execution status: ( 121) The previous self-test completed having
>> the read element of the test failed.
>> Total time to complete Offline
>> data collection: (18000) seconds.
>> Offline data collection
>> capabilities: (0x7b) SMART execute Offline immediate.
>> Auto Offline data collection on/off support.
>> Suspend Offline collection upon new
>> command.
>> Offline surface scan supported.
>> Self-test supported.
>> Conveyance Self-test supported.
>> Selective Self-test supported.
>> SMART capabilities: (0x0003) Saves SMART data before entering
>> power-saving mode.
>> Supports SMART auto save timer.
>> Error logging capability: (0x01) Error logging supported.
>> General Purpose Logging supported.
>> Short self-test routine
>> recommended polling time: ( 2) minutes.
>> Extended self-test routine
>> recommended polling time: ( 208) minutes.
>> Conveyance self-test routine
>> recommended polling time: ( 5) minutes.
>> SCT capabilities: (0x3037) SCT Status supported.
>> SCT Feature Control supported.
>> SCT Data Table supported.
>>
>> SMART Attributes Data Structure revision number: 16
>> Vendor Specific SMART Attributes with Thresholds:
>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
>> 1 Raw_Read_Error_Rate 0x002f 199 199 051 Pre-fail Always - 1354
>> 3 Spin_Up_Time 0x0027 253 253 021 Pre-fail Always - 1158
>> 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 40
>> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
>> 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
>> 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1403
>> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
>> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
>> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 38
>> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 21
>> 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 18
>> 194 Temperature_Celsius 0x0022 112 107 000 Old_age Always - 38
>> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
>> 197 Current_Pending_Sector 0x0032 199 199 000 Old_age Always - 172
>> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
>> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
>> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
>>
>> SMART Error Log Version: 1
>> No Errors Logged
>>
>> SMART Self-test log structure revision number 1
>> Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
>> # 1 Extended offline Completed: read failure 90% 1393 1106820646
>>
>> SMART Selective self-test log data structure revision number 1
>> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
>> 1 0 0 Not_testing
>> 2 0 0 Not_testing
>> 3 0 0 Not_testing
>> 4 0 0 Not_testing
>> 5 0 0 Not_testing
>> Selective self-test flags (0x0):
>> After scanning selected spans, do NOT read-scan remainder of disk.
>> If Selective self-test is pending on power-up, resume after 0 minute delay.
>>
>
> All the values in the table above seems larger than the threshold. But
> the report says PASSED. I'm not clear how to interpret this. Could
> someone help? Thanks a lot in advance.
>
>
Got a good backup of this drive? Looks like it needs to be retested, in
a different machine and if it fails, replaced.

I had a drive that exhibited the same behavior and eventually, it failed.

James McKenzie

--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
 
Old 08-16-2010, 01:44 AM
Suvayu Ali
 
Default understanding smart logs

On Sunday 15 August 2010 10:17 AM, James McKenzie wrote:
> Got a good backup of this drive? Looks like it needs to be retested, in
> a different machine and if it fails, replaced.
>
> I had a drive that exhibited the same behavior and eventually, it failed.
>

I downloaded the bootable iso of the disk diagnostic suite from Western
Digital and ran. It claimed to detect and fix the errors. After the scan
the smart logs read like this,

> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
> 1 Raw_Read_Error_Rate 0x002f 199 199 051 Pre-fail Always - 1545
> 3 Spin_Up_Time 0x0027 253 253 021 Pre-fail Always - 1066
> 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 42
> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
> 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
> 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1426
> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 38
> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 21
> 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 20
> 194 Temperature_Celsius 0x0022 109 107 000 Old_age Always - 41
> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
> 197 Current_Pending_Sector 0x0032 200 199 000 Old_age Always - 78
> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
> # 1 Conveyance offline Completed: read failure 90% 1422 1106820646
> # 2 Extended offline Completed: read failure 90% 1393 1106820646
>

Is it okay to continue with this drive? I bought them a few months back,
I am not in a position to change them unless I can RMA the unit.

> James McKenzie

All suggestions welcome.
--
Suvayu

Open source is the future. It sets us free.
--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
 
Old 08-16-2010, 01:46 AM
Robert Nichols
 
Default understanding smart logs

On 08/15/2010 12:05 PM, Suvayu Ali wrote:
>> SMART Attributes Data Structure revision number: 16
>> Vendor Specific SMART Attributes with Thresholds:
>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
>> 1 Raw_Read_Error_Rate 0x002f 199 199 051 Pre-fail Always - 1354
>> 3 Spin_Up_Time 0x0027 253 253 021 Pre-fail Always - 1158
>> 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 40
>> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
>> 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
>> 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1403
>> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
>> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
>> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 38
>> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 21
>> 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 18
>> 194 Temperature_Celsius 0x0022 112 107 000 Old_age Always - 38
>> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
>> 197 Current_Pending_Sector 0x0032 199 199 000 Old_age Always - 172
>> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
>> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
>> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
>>
>> SMART Error Log Version: 1
>> No Errors Logged
>>
>> SMART Self-test log structure revision number 1
>> Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
>> # 1 Extended offline Completed: read failure 90% 1393 1106820646

Your problem is the 172 sectors pending reallocation. Those are sectors
that are currently unreadable and will be reallocated to spare sectors
the next time they are written. The problem is that the drive has no
way to know whether the current contents are important (part of some
file, or file system metadata) or irrelevant (part of file system free
space), so the drive _must_ continue to return an error on any attempted
read of those sectors.

The most straightforward way to recover is to back up all of the data
now on the drive while making note of any files that have read errors,
write zeros to the entire drive, then re-make the file system(s) and
restore the data, hopefully having some other source for any important
files that could not be read when backing up.

Trying to use a less ham-fisted approach gets complicated in a hurry.
You need to identify every file affected by a bad sector and re-write
it, then find all of the bad sectors that are now part of free space and
re-write those (filling up the file system with a huge all-zero file
would be one way), and then hope that there are no bad sectors that are
part of file system metadata or otherwise inaccessible via normal file
I/O.

If it were my drive I'd probably make an attempt at rewriting any
affected files I could find (using dd with the "conv=notrunc" option so
that the OS won't reallocate the space) and hope I could get lucky (all
of the errors bunched in a few files that I could recover elsewhere or
simply overwrite with zeros and delete). In the end, I'd probably waste
more time than the simplistic approach would take, and with less
assurance of success.

--
Bob Nichols "NOSPAM" is really part of my email address.
Do NOT delete it.

--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
 
Old 08-16-2010, 01:55 AM
Suvayu Ali
 
Default understanding smart logs

Hi Robert,

On Sunday 15 August 2010 06:46 PM, Robert Nichols wrote:
> Your problem is the 172 sectors pending reallocation. Those are sectors
> that are currently unreadable and will be reallocated to spare sectors
> the next time they are written. The problem is that the drive has no
> way to know whether the current contents are important (part of some
> file, or file system metadata) or irrelevant (part of file system free
> space), so the drive_must_ continue to return an error on any attempted
> read of those sectors.
>
> The most straightforward way to recover is to back up all of the data
> now on the drive while making note of any files that have read errors,
> write zeros to the entire drive, then re-make the file system(s) and
> restore the data, hopefully having some other source for any important
> files that could not be read when backing up.

Thank you for the advise. I used the disk diagnostic tool provided by WD
to check and fix errors on the disk. It seemed to reduce the
Current_Pending_Sector count from 172 to 78. I will take up on your
suggestion over the next weekend. Thanks a lot again.

--
Suvayu

Open source is the future. It sets us free.
--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
 
Old 08-16-2010, 01:57 AM
JD
 
Default understanding smart logs

On 08/15/2010 06:44 PM, Suvayu Ali wrote:
> On Sunday 15 August 2010 10:17 AM, James McKenzie wrote:
>> Got a good backup of this drive? Looks like it needs to be retested, in
>> a different machine and if it fails, replaced.
>>
>> I had a drive that exhibited the same behavior and eventually, it failed.
>>
> I downloaded the bootable iso of the disk diagnostic suite from Western
> Digital and ran. It claimed to detect and fix the errors. After the scan
> the smart logs read like this,
>
>> Vendor Specific SMART Attributes with Thresholds:
>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
>> 1 Raw_Read_Error_Rate 0x002f 199 199 051 Pre-fail Always - 1545
>> 3 Spin_Up_Time 0x0027 253 253 021 Pre-fail Always - 1066
>> 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 42
>> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
>> 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
>> 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1426
>> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
>> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
>> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 38
>> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 21
>> 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 20
>> 194 Temperature_Celsius 0x0022 109 107 000 Old_age Always - 41
>> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
>> 197 Current_Pending_Sector 0x0032 200 199 000 Old_age Always - 78
>> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
>> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
>> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
>>
>> SMART Error Log Version: 1
>> No Errors Logged
>>
>> SMART Self-test log structure revision number 1
>> Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
>> # 1 Conveyance offline Completed: read failure 90% 1422 1106820646
>> # 2 Extended offline Completed: read failure 90% 1393 1106820646
>>
> Is it okay to continue with this drive? I bought them a few months back,
> I am not in a position to change them unless I can RMA the unit.
>
>> James McKenzie
> All suggestions welcome.

What you can do is go to the web site of the manufacturer.
Sometimes you can download their diagnostic tool which
will give more detailed information. Their diag tool is non-destructive
if you select the non-destructive tests.
If you get errors, you can then use that error log and email it to
support@yourdiscmaker.com
And they will advise you to go to their RMA web site and request
a return merchandise authorisation (rma).

I just did that with a 500gb seagate drive which had only a few days
remaining
in the warranty. I received a brand new replacement.

Good luck.

--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
 
Old 08-16-2010, 02:00 AM
James McKenzie
 
Default understanding smart logs

Robert Nichols wrote:
> On 08/15/2010 12:05 PM, Suvayu Ali wrote:
>
>>> SMART Attributes Data Structure revision number: 16
>>> Vendor Specific SMART Attributes with Thresholds:
>>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
>>> 1 Raw_Read_Error_Rate 0x002f 199 199 051 Pre-fail Always - 1354
>>> 3 Spin_Up_Time 0x0027 253 253 021 Pre-fail Always - 1158
>>> 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 40
>>> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
>>> 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
>>> 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1403
>>> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
>>> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
>>> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 38
>>> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 21
>>> 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 18
>>> 194 Temperature_Celsius 0x0022 112 107 000 Old_age Always - 38
>>> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
>>> 197 Current_Pending_Sector 0x0032 199 199 000 Old_age Always - 172
>>> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
>>> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
>>> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
>>>
>>> SMART Error Log Version: 1
>>> No Errors Logged
>>>
>>> SMART Self-test log structure revision number 1
>>> Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
>>> # 1 Extended offline Completed: read failure 90% 1393 1106820646
>>>
>
> Your problem is the 172 sectors pending reallocation. Those are sectors
> that are currently unreadable and will be reallocated to spare sectors
> the next time they are written. The problem is that the drive has no
> way to know whether the current contents are important (part of some
> file, or file system metadata) or irrelevant (part of file system free
> space), so the drive _must_ continue to return an error on any attempted
> read of those sectors.
>

Bob:

With 'modern' drives you should NEVER see these errors. This drive is a
time-bomb waiting to explode data all over the place. Get on the phone
with WD and get these drives replaced immediately. Don't waste your
time working on them. The old Hitachi/IBM 'Deathstars' would exhibit
the same behavior shortly before they died. As I said, now it the time
not to hope you have a viable backup of your important data as you will
be exercising it soon.

James McKenzie
(And yes, I've been there, done that with backup/restore of a 20 GB
drive when it failed...)

--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
 
Old 08-16-2010, 03:00 AM
JD
 
Default understanding smart logs

On 08/15/2010 06:44 PM, Suvayu Ali wrote:
> On Sunday 15 August 2010 10:17 AM, James McKenzie wrote:
>> Got a good backup of this drive? Looks like it needs to be retested, in
>> a different machine and if it fails, replaced.
>>
>> I had a drive that exhibited the same behavior and eventually, it failed.
>>
> I downloaded the bootable iso of the disk diagnostic suite from Western
> Digital and ran. It claimed to detect and fix the errors. After the scan
> the smart logs read like this,
>
>> Vendor Specific SMART Attributes with Thresholds:
>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
>> 1 Raw_Read_Error_Rate 0x002f 199 199 051 Pre-fail Always - 1545
>> 3 Spin_Up_Time 0x0027 253 253 021 Pre-fail Always - 1066
>> 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 42
>> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
>> 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
>> 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1426
>> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
>> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
>> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 38
>> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 21
>> 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 20
>> 194 Temperature_Celsius 0x0022 109 107 000 Old_age Always - 41
>> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
>> 197 Current_Pending_Sector 0x0032 200 199 000 Old_age Always - 78
>> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
>> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
>> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
>>
>> SMART Error Log Version: 1
>> No Errors Logged
>>
>> SMART Self-test log structure revision number 1
>> Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
>> # 1 Conveyance offline Completed: read failure 90% 1422 1106820646
>> # 2 Extended offline Completed: read failure 90% 1393 1106820646
>>
> Is it okay to continue with this drive? I bought them a few months back,
> I am not in a position to change them unless I can RMA the unit.
>
>> James McKenzie
> All suggestions welcome.

Is it possible to purge the SMART logs and reset
the counters, and the rerun the SMART tests?

--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
 
Old 08-16-2010, 03:14 AM
James McKenzie
 
Default understanding smart logs

JD wrote:
> On 08/15/2010 06:44 PM, Suvayu Ali wrote:
>
>> On Sunday 15 August 2010 10:17 AM, James McKenzie wrote:
>>
>>> Got a good backup of this drive? Looks like it needs to be retested, in
>>> a different machine and if it fails, replaced.
>>>
>>> I had a drive that exhibited the same behavior and eventually, it failed.
>>>
>>>
>> I downloaded the bootable iso of the disk diagnostic suite from Western
>> Digital and ran. It claimed to detect and fix the errors. After the scan
>> the smart logs read like this,
>>
>>
>>> Vendor Specific SMART Attributes with Thresholds:
>>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
>>> 1 Raw_Read_Error_Rate 0x002f 199 199 051 Pre-fail Always - 1545
>>> 3 Spin_Up_Time 0x0027 253 253 021 Pre-fail Always - 1066
>>> 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 42
>>> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
>>> 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
>>> 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1426
>>> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
>>> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
>>> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 38
>>> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 21
>>> 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 20
>>> 194 Temperature_Celsius 0x0022 109 107 000 Old_age Always - 41
>>> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
>>> 197 Current_Pending_Sector 0x0032 200 199 000 Old_age Always - 78
>>> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
>>> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
>>> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
>>>
>>> SMART Error Log Version: 1
>>> No Errors Logged
>>>
>>> SMART Self-test log structure revision number 1
>>> Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
>>> # 1 Conveyance offline Completed: read failure 90% 1422 1106820646
>>> # 2 Extended offline Completed: read failure 90% 1393 1106820646
>>>
>>>
>> Is it okay to continue with this drive? I bought them a few months back,
>> I am not in a position to change them unless I can RMA the unit.
>>
>>
>>> James McKenzie
>>>
>> All suggestions welcome.
>>
>
> Is it possible to purge the SMART logs and reset
> the counters, and the rerun the SMART tests?
>
>
That should be possible. Any errors should be a good reason to send the
drives back.

James McKenzie

--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
 
Old 08-16-2010, 03:27 AM
JD
 
Default understanding smart logs

On 08/15/2010 08:14 PM, James McKenzie wrote:
> JD wrote:
>> On 08/15/2010 06:44 PM, Suvayu Ali wrote:
>>
>>> On Sunday 15 August 2010 10:17 AM, James McKenzie wrote:
>>>
>>>> Got a good backup of this drive? Looks like it needs to be retested, in
>>>> a different machine and if it fails, replaced.
>>>>
>>>> I had a drive that exhibited the same behavior and eventually, it failed.
>>>>
>>>>
>>> I downloaded the bootable iso of the disk diagnostic suite from Western
>>> Digital and ran. It claimed to detect and fix the errors. After the scan
>>> the smart logs read like this,
>>>
>>>
>>>> Vendor Specific SMART Attributes with Thresholds:
>>>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
>>>> 1 Raw_Read_Error_Rate 0x002f 199 199 051 Pre-fail Always - 1545
>>>> 3 Spin_Up_Time 0x0027 253 253 021 Pre-fail Always - 1066
>>>> 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 42
>>>> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
>>>> 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
>>>> 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1426
>>>> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
>>>> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
>>>> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 38
>>>> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 21
>>>> 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 20
>>>> 194 Temperature_Celsius 0x0022 109 107 000 Old_age Always - 41
>>>> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
>>>> 197 Current_Pending_Sector 0x0032 200 199 000 Old_age Always - 78
>>>> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
>>>> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
>>>> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
>>>>
>>>> SMART Error Log Version: 1
>>>> No Errors Logged
>>>>
>>>> SMART Self-test log structure revision number 1
>>>> Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
>>>> # 1 Conveyance offline Completed: read failure 90% 1422 1106820646
>>>> # 2 Extended offline Completed: read failure 90% 1393 1106820646
>>>>
>>>>
>>> Is it okay to continue with this drive? I bought them a few months back,
>>> I am not in a position to change them unless I can RMA the unit.
>>>
>>>
>>>> James McKenzie
>>>>
>>> All suggestions welcome.
>>>
>> Is it possible to purge the SMART logs and reset
>> the counters, and the rerun the SMART tests?
>>
>>
> That should be possible. Any errors should be a good reason to send the
> drives back.
>
> James McKenzie
>
Of course. Be sure to zero out the drive if it contains
sensitive data or private intellectual property before
sending it for replacement.

dd if=/dev/zero of=/dev/sdx bs=256M

I use 256m to reduce the total number of
calls to write(2). If you have oodles of ram,
then by all means use a larger number (keep it sane)
Kernel will break it down to many buffers and queue
them up for io.

--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
 

Thread Tools




All times are GMT. The time now is 12:35 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org