FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian User

 
 
LinkBack Thread Tools
 
Old 06-07-2011, 08:21 AM
Ong Chin Kiat
 
Default is this hard disk failure?

Couple of possibilites:- Hard disk is failing- Insufficient power available for your hard disk, causing it to spin up then spin down again- Controller error- Faulty connection or SATA port*

The more likely possibilities are 1 and 3.
If you can get another hard disk to test, that will narrow down the possibilities.

On Tue, Jun 7, 2011 at 3:47 PM, surreal <firewalrus@gmail.com> wrote:

>From today morning i am getting strange kind of system messages on starting the computer..

I typed dmesg and found these messages


[* 304.694936] ata4.00: status: { DRDY ERR }
[* 304.694939] ata4.00: error: { ICRC ABRT }

[* 304.694954] ata4: soft resetting link
[* 304.938280] ata4.00: configured for UDMA/33
[* 304.938293] ata4: EH complete
[* 304.970866] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
[* 304.970873] ata4.00: BMDMA stat 0x26


[* 304.970884] ata4.00: cmd 25/00:38:f6:2a:94/00:00:15:00:00/e0 tag 0 dma 28672 in
[* 304.970887]********* res 51/84:18:16:2b:94/84:00:15:00:00/e0 Emask 0x30 (host bus error)
[* 304.970891] ata4.00: status: { DRDY ERR }


[* 304.970895] ata4.00: error: { ICRC ABRT }
[* 304.970909] ata4: soft resetting link
[* 305.218280] ata4.00: configured for UDMA/33
[* 305.218296] ata4: EH complete
[* 305.880378] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6


[* 305.880385] ata4.00: BMDMA stat 0x26
[* 305.880397] ata4.00: cmd 25/00:80:fe:22:8e/00:01:15:00:00/e0 tag 0 dma 196608 in
[* 305.880399]********* res 51/84:60:1e:23:8e/84:01:15:00:00/e0 Emask 0x30 (host bus error)


[* 305.880404] ata4.00: status: { DRDY ERR }
[* 305.880408] ata4.00: error: { ICRC ABRT }
[* 305.880423] ata4: soft resetting link
[* 306.126281] ata4.00: configured for UDMA/33
[* 306.126297] ata4: EH complete




What do these messages mean? What is the solution to prevent these messages from appearing? Help!
--
Harshad Joshi
 
Old 06-07-2011, 10:41 AM
Ralf Mardorf
 
Default is this hard disk failure?

On Tue, 2011-06-07 at 16:21 +0800, Ong Chin Kiat wrote:

> If you can get another hard disk to test, that will narrow down the
> possibilities
... and before doing this turn off power and disconnect and connect all
cables for this HDD on the HDD (power too) and on the mobo.

-- Ralf







--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 1307443287.4467.2.camel@debian">http://lists.debian.org/1307443287.4467.2.camel@debian
 
Old 06-07-2011, 11:46 AM
Camaleón
 
Default is this hard disk failure?

On Tue, 07 Jun 2011 13:17:29 +0530, surreal wrote:

> From today morning i am getting strange kind of system messages on
> starting the computer..
>
> I typed dmesg and found these messages
>
> [ 304.694936] ata4.00: status: { DRDY ERR }
> [ 304.694939] ata4.00: error: { ICRC ABRT }
> [ 304.694954] ata4: soft resetting link

(...)

What do you have attached to that port (ata 4)?

> What do these messages mean? What is the solution to prevent these
> messages from appearing? Help!

It can be a bad cable -or bad connection- or even a kernel issue. I mean,
it does not have to be a hard disk failure "per se". Anyway, running a
smartctl long test will neither hurt.

Greetings,

--
Camaleón


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: pan.2011.06.07.11.46.32@gmail.com">http://lists.debian.org/pan.2011.06.07.11.46.32@gmail.com
 
Old 06-07-2011, 11:59 AM
Ralf Mardorf
 
Default is this hard disk failure?

On Tue, 2011-06-07 at 11:46 +0000, Camaleón wrote:
> It can be a bad cable -or bad connection-

For me a hard disc never gets broken without click-click-click noise
before it failed, but it's very common that cables and connections fail.

A tip: If there's a warranty seal, don't break it, try to loose it with
a hairdryer. Then disassemble cables and remount them.


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 1307447981.4467.36.camel@debian">http://lists.debian.org/1307447981.4467.36.camel@debian
 
Old 06-07-2011, 12:03 PM
Ralf Mardorf
 
Default is this hard disk failure?

On Tue, 2011-06-07 at 13:59 +0200, Ralf Mardorf wrote:
> On Tue, 2011-06-07 at 11:46 +0000, Camaleón wrote:
> > It can be a bad cable -or bad connection-
>
> For me a hard disc never gets broken without click-click-click noise
> before it failed, but it's very common that cables and connections fail.
>
> A tip: If there's a warranty seal, don't break it, try to loose it with
> a hairdryer. Then disassemble cables and remount them.

PS: Back in the old Atari days we kept the seals and by force tear out
the screw under the seal. Not every seal can be unscathed removed by a
hairdryer, but usually not all screws are needed.



--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 1307448211.4467.40.camel@debian">http://lists.debian.org/1307448211.4467.40.camel@debian
 
Old 06-07-2011, 01:02 PM
Miles Fidelman
 
Default is this hard disk failure?

Ralf Mardorf wrote:

For me a hard disc never gets broken without click-click-click noise
before it failed, but it's very common that cables and connections fail.




By the time a disk gets to the click-click-click phase, there has been
LOTS of warning - it's just that today's disks include lots of internal
fault-recovery mechanisms that hide things from you, unless you run
SMART diagnostics (and not just the basic "smart status" either).


For example, if you have a machine that's suddenly running VERY slowly -
it's good sign that a drive is experiencing internal read errors (unless
it's a laptop - a shorted battery is a good suspect). Both are lessons
learned the hard way, and not forgotten.


Turns out that modern drives have onboard processors that retry reads
multiple times - good for protecting data if you only have the one copy
on that drive, at the expense of reduced disk access times. Not so good if:


a. you don't notice that it's happening (the disk will eventually fail
hard), or,


b. you're running RAID - instead of the drive dropping out of the array,
the entire array slows down as it waits for the failing drive to
(eventually) respond


In either case, you'll tear your hair out trying to figure out why your
machine is running slowly (is it a virus, a file lock that didn't
release, etc., etc., etc.).


Lessons learned:

- if your machine is running really slowly, try a reboot -- if it
reboots properly, but takes 2 times as long (or longer) to shutdown and
then come back up -- get very suspicious (if your patience lasts that long)


- if it's a laptop - pull the battery and try again - if everything is
normal, buy yourself a new battery


- if it's a server - try booting from a liveCD (if you can, first
disconnect the hard drive entirely) - if normal then you could well have
a hard drive problem (or you could have a virus)


- install SMART utilities and run "smartctl -A /dev/<your drive> -- the
first line is usually the "raw read error" rate -- if the value (last
entry on the line) is anything except 0, that's the sign that your drive
is failing, if it's in the 1000s, failure is imminent, it's just that
your drive's internal software is hiding it from you - replace it!


- if you're running RAID, be sure to purchase "enterprise" drives (where
"desktop" try very hard to read a sector, despite the delay; enterprise
drives give up quickly as they expect failure recovery to be handled by
RAID)


- you would expect software raid (md) to detect slow drives, mark them
bad, and drop them from an array -- nope, md does not keep track of delay


and, not really relevant for Debian, but a direct offshoot of learning
the above lessons:


- if you're running a Mac or Windows, you're system may be reporting
"smart status good" - but it's not really true - it's not looking at raw
read errors


- there seems to be a bug in the smart utilities for Mac (as available
through Macports and Fink) -- the smart daemon will fail periodically,
with the only symptom being that every few minutes, you're machine will
slow to a crawl (spinning beachball everywhere) for 30 seconds or so,
then recover --- a really good example of taking a pre-emptive measure
that causes a new problem (I can't tell you how long it took to track
this one down - what with downloading every performance tracking tool I
could find.)



Miles Fidelman

--
In theory, there is no difference between theory and practice.
In<fnord> practice, there is. .... Yogi Berra



--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org

Archive: 4DEE217C.9020906@meetinghouse.net">http://lists.debian.org/4DEE217C.9020906@meetinghouse.net
 
Old 06-07-2011, 01:25 PM
Ralf Mardorf
 
Default is this hard disk failure?

On Tue, 2011-06-07 at 09:02 -0400, Miles Fidelman wrote:
> Ralf Mardorf wrote:
> > For me a hard disc never gets broken without click-click-click noise
> > before it failed, but it's very common that cables and connections fail.
> >
> >
>
> By the time a disk gets to the click-click-click phase,

A phase everybody know for modern HDDs , but it's possible to get data
even from a disk that won't loose the heads anymore [1].
For the Atari I've got a 42MB SCSI connected to a Lacom adaptor, it
sometimes needs several boots, but it's unbreakable.

> there has been
> LOTS of warning - it's just that today's disks include lots of internal
> fault-recovery mechanisms that hide things from you, unless you run
> SMART diagnostics (and not just the basic "smart status" either).
>
> For example, if you have a machine that's suddenly running VERY slowly

Correct! Resp. if Voodoo seems to have impact to your machine, it seldom
is Voodoo, but a broken HDD.

> -
> it's good sign that a drive is experiencing internal read errors (unless
> it's a laptop - a shorted battery is a good suspect). Both are lessons
> learned the hard way, and not forgotten.
>
> Turns out that modern drives have onboard processors that retry reads
> multiple times - good for protecting data if you only have the one copy
> on that drive, at the expense of reduced disk access times. Not so good if:
>
> a. you don't notice that it's happening (the disk will eventually fail
> hard), or,
>
> b. you're running RAID - instead of the drive dropping out of the array,
> the entire array slows down as it waits for the failing drive to
> (eventually) respond
>
> In either case, you'll tear your hair out trying to figure out why your
> machine is running slowly (is it a virus, a file lock that didn't
> release, etc., etc., etc.).
>
> Lessons learned:
>
> - if your machine is running really slowly, try a reboot -- if it
> reboots properly, but takes 2 times as long (or longer) to shutdown and
> then come back up -- get very suspicious (if your patience lasts that long)
>
> - if it's a laptop - pull the battery and try again - if everything is
> normal, buy yourself a new battery
>
> - if it's a server - try booting from a liveCD (if you can, first
> disconnect the hard drive entirely) - if normal then you could well have
> a hard drive problem (or you could have a virus)
>
> - install SMART utilities and run "smartctl -A /dev/<your drive> -- the
> first line is usually the "raw read error" rate -- if the value (last
> entry on the line) is anything except 0, that's the sign that your drive
> is failing, if it's in the 1000s, failure is imminent, it's just that
> your drive's internal software is hiding it from you - replace it!
>
> - if you're running RAID, be sure to purchase "enterprise" drives (where
> "desktop" try very hard to read a sector, despite the delay; enterprise
> drives give up quickly as they expect failure recovery to be handled by
> RAID)
>
> - you would expect software raid (md) to detect slow drives, mark them
> bad, and drop them from an array -- nope, md does not keep track of delay
>
> and, not really relevant for Debian, but a direct offshoot of learning
> the above lessons:
>
> - if you're running a Mac or Windows, you're system may be reporting
> "smart status good" - but it's not really true - it's not looking at raw
> read errors
>
> - there seems to be a bug in the smart utilities for Mac (as available
> through Macports and Fink) -- the smart daemon will fail periodically,
> with the only symptom being that every few minutes, you're machine will
> slow to a crawl (spinning beachball everywhere) for 30 seconds or so,
> then recover --- a really good example of taking a pre-emptive measure
> that causes a new problem (I can't tell you how long it took to track
> this one down - what with downloading every performance tracking tool I
> could find.)
>
>
> Miles Fidelman
>
> --
> In theory, there is no difference between theory and practice.
> In<fnord> practice, there is. .... Yogi Berra

My Samsung SATA drives until now are without failure for a suspicious
long time . I very, very often turn the computer off and on.
The only bad are the SATA connectors, a friend already planned to solder
new SATA connectors on his mobo. Note! Nobody without experiences in
soldering multi-layer boards should do this soldering. I planned to do
it too.

[1] When the heads aren't released anymore after the final click, there
still is the possibility to get them working.

- Disassemble the HDD from the case, keep the power and data cables
connected.
- With a rubber-headed mallet or something similar knock against the HDD
from several angles, while rebooting again and again.
- If it doesn't work, repeat this after the HDD did rest for a week.
Dunno while this does help, but it does, perhaps different temperatures
for the room will work like gnomes.

-- Ralf


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 1307453128.2408.19.camel@debian">http://lists.debian.org/1307453128.2408.19.camel@debian
 
Old 06-07-2011, 03:27 PM
Henrique de Moraes Holschuh
 
Default is this hard disk failure?

On Tue, 07 Jun 2011, Miles Fidelman wrote:
> b. you're running RAID - instead of the drive dropping out of the
> array, the entire array slows down as it waits for the failing drive
> to (eventually) respond

Eh, it is worse.

A failing drive _will_ drop out of the array sooner or later, and it can
be very bad if it is does so 'sooner' for any other reason than an
imminent unit failure: there is a high probability of other device(s)
deciding to also time out while the array is degraded or rebuilding, and
it results in service downtime (and usually data loss).

You never want discs dropping off the array due to
non-immediate-failure-related performance problems, the chance of
multiple drops causing an array failure is too high. You want to know
the disk is slow, and to replace it in controlled conditions.

This problem is *common*. Don't do hardware RAID on regular consumer
crap without SCT ERC support (aka TLER/CCTL/ERC), and don't buy
expensive crap with buggy firmware that the vendor refuses to issue a
public fix for to save face (but which you can get from your RAID card
vendor if you are very lucky). Linux smartctl gives you access to the
drive's SCT ERC page if it is supported.

Also, any device model (not a SPECIFIC device) for which firmware
updates are available that reduce the effective throughput should be
avoided like the plague, as that indicates they have shipped models with
manufacturing or component issues, and you can never be sure of what
you'll get when you buy a new one.

If you already have bought such a device with known high design or
manufacturing defects/weakness ratio, it depends on your luck whether
you got something good or a lemon. If SMART finds *NO* issues (no
increasing high fly writes, no reallocated sectors grow), and throughput
tests show the expected response, you have a good one: be happy.

If either test shows any such issues, remove it from production.
Secure-erase it, apply any firmware updates if you want to use it as
throw-away backup media (make sure the data is encrypted), or send it
for recycling.

Linux software raid is much more forgiving by default (and it can tune
the timeout for each component device separately), and will just slow
down most of the time instead of kicking component devices off the array
until dataloss happens. Could be useful if you got duped by the vendor
and sold a defective drive that can only operate safely out-of-spec, but
can still be useful to you.

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20110607152700.GB1137@khazad-dum.debian.net">http://lists.debian.org/20110607152700.GB1137@khazad-dum.debian.net
 
Old 06-07-2011, 03:39 PM
Miles Fidelman
 
Default is this hard disk failure?

Henrique de Moraes Holschuh wrote:

On Tue, 07 Jun 2011, Miles Fidelman wrote:


b. you're running RAID - instead of the drive dropping out of the
array, the entire array slows down as it waits for the failing drive
to (eventually) respond




Linux software raid is much more forgiving by default (and it can tune
the timeout for each component device separately), and will just slow
down most of the time instead of kicking component devices off the array
until dataloss happens. Could be useful if you got duped by the vendor
and sold a defective drive that can only operate safely out-of-spec, but
can still be useful to you.



Not necessarily the best strategy if you have enough drives to survive 2
drive failures. Sometimes better to have a drive drop out of the array
and trigger an alarm than to have a system slow to a crawl precipitously
(particularly as that makes it hard to run diagnostics to figure out
which drive is bad).


Re. tuning: How? I've tried to find ways to get md to track timeouts,
and never been able to find any relevant parameters. Queries to the
linux-raid list have yielded some fairly definitive sounding statements,
from folks who should know, that md doesn't have any such timeouts. If
they're there, please.. more information!







--
In theory, there is no difference between theory and practice.
In<fnord> practice, there is. .... Yogi Berra



--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org

Archive: 4DEE4643.3060503@meetinghouse.net">http://lists.debian.org/4DEE4643.3060503@meetinghouse.net
 
Old 06-07-2011, 04:06 PM
Henrique de Moraes Holschuh
 
Default is this hard disk failure?

On Tue, 07 Jun 2011, Miles Fidelman wrote:
> >Linux software raid is much more forgiving by default (and it can tune
> >the timeout for each component device separately), and will just slow
> >down most of the time instead of kicking component devices off the array
> >until dataloss happens. Could be useful if you got duped by the vendor
> >and sold a defective drive that can only operate safely out-of-spec, but
> >can still be useful to you.
>
> Not necessarily the best strategy if you have enough drives to
> survive 2 drive failures. Sometimes better to have a drive drop out
> of the array and trigger an alarm than to have a system slow to a
> crawl precipitously (particularly as that makes it hard to run
> diagnostics to figure out which drive is bad).

YMMV. I'd never do that in a RAID array with important data in it.

External events that cause non-ECR disks to time out CAN and DO happen to
the entire set of disks in the same enclosure (such as impact vibrations
from a nearby equipment or from the floor). It is a known problem in
datacenters, but it can happen at home as well when a large truck passes
close by, or someone bumps in the shelf/table/rack :-)

If enough of those devices go over the timeout threshold because of the
external even (which is rather spartan by default on most hardware RAID
cards), the array goes offline and data loss can happen.

Worse, rebuilding a degraded array will excercise the array at the time it
is most vulnerable, it is not a safe operation unless you're rebuilding an
already redundant array (which is one of the reasons why RAID6 or anything
N+2 or above is a good idea). This is why you have to regularly scrub the
array at off-peak hours or as a background operation.

> Re. tuning: How? I've tried to find ways to get md to track
> timeouts, and never been able to find any relevant parameters.

It is not in md. It is in the libata/scsi layer. Just tune the per-device
parameters, e.g. in /sys/block/sda/device/*

AFAIK, if libata doesn't time out the device, md won't drop it off the
array.

> Queries to the linux-raid list have yielded some fairly definitive
> sounding statements, from folks who should know, that md doesn't
> have any such timeouts. If they're there, please.. more
> information!

md doesn't track performance (much, if at all), it does not do even a decent
job of scheduling reads/writes over multiple md devices that have components
that share the same physical device. It is quite simple (but not to the
point of being brain-dead like dm-raid).

OTOH, md really is a separate layer on top of the component devices. You can
smart-test and performance-test the component devices, change their
libata/scsi layer parameters...

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20110607160627.GD1137@khazad-dum.debian.net">http://lists.debian.org/20110607160627.GD1137@khazad-dum.debian.net
 

Thread Tools




All times are GMT. The time now is 11:17 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright ©2007 - 2008, www.linux-archive.org