FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian User

 
 
LinkBack Thread Tools
 
Old 07-22-2010, 11:38 AM
Gregory Seidman
 
Default Which disk is failing?

I have a RAID1 (using md) running on two USB disks. (I'm working on moving
to eSATA, but it's USB for now.) That means I don't have any insight using
SMART. Meanwhile, I've been getting occasional fail events. Unfortunately,
I don't get any information on which disk is failing.

When the system comes up, it seems to be entirely random which disk comes
up as /dev/sda and which comes up as /dev/sdb. In fact, since my root disk
is on SATA, at least one time it came up as /dev/sda and the USB drives
came up as /dev/sdb and /dev/sdc, though I think that was under a different
kernel version. When I get a failure email, it tells me that it might be
due to /dev/sda1 failing -- except when it tells me that it might be due to
/dev/sdb1 failing. When things are working, mdadm -D /dev/md0 looks like
this:


/dev/md0:
Version : 00.90
Creation Time : Wed Feb 22 20:50:29 2006
Raid Level : raid1
Array Size : 312496256 (298.02 GiB 320.00 GB)
Used Dev Size : 312496256 (298.02 GiB 320.00 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Thu Jul 22 07:30:46 2010
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0

UUID : e4feee4a:6b6be6d2:013f88ab:1b80cac5
Events : 0.17961786

Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 8 1 1 active sync /dev/sda1

When it fails, however, the device names disappear and it just tells me
it's clean, degraded and shows an active disk, a removed disk, and a faulty
spare without any device names.

I even tried doing dd if=/dev/md0 of=/dev/null to see if I could get the
light flickering on one and not the other, but I just get I/O errors. Once
a disk fails, the RAID seems to go into a nasty state where it reads
properly through the crypto loop and LVM I have on top of it, but the
filesystems become read-only and the block devices just give errors. Worse,
the first indication (even before the mdadm email) that something is wrong
is a message to console that an ext3 journal write failed.

What I've been doing (which makes me tremendously uncomfortable since I
know a disk is failing) is to reboot and bring everything back up. This has
been working, but I know it's just a matter of time before the failing disk
becomes a failed disk. I could wait until then, since presumably I'll then
know which is which, but who knows what data corruption is possible between
now and then?

So, um, help?

--Greg


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20100722113802.GB10802@anthropohedron.net">http://lists.debian.org/20100722113802.GB10802@anthropohedron.net
 
Old 07-22-2010, 12:54 PM
Michal
 
Default Which disk is failing?

On 22/07/10 12:38, Gregory Seidman wrote:

I have a RAID1 (using md) running on two USB disks. (I'm working on moving
to eSATA, but it's USB for now.) That means I don't have any insight using
SMART. Meanwhile, I've been getting occasional fail events. Unfortunately,
I don't get any information on which disk is failing.

When the system comes up, it seems to be entirely random which disk comes
up as /dev/sda and which comes up as /dev/sdb. In fact, since my root disk
is on SATA, at least one time it came up as /dev/sda and the USB drives
came up as /dev/sdb and /dev/sdc, though I think that was under a different
kernel version. When I get a failure email, it tells me that it might be
due to /dev/sda1 failing -- except when it tells me that it might be due to
/dev/sdb1 failing. When things are working, mdadm -D /dev/md0 looks like
this:


/dev/md0:
Version : 00.90
Creation Time : Wed Feb 22 20:50:29 2006
Raid Level : raid1
Array Size : 312496256 (298.02 GiB 320.00 GB)
Used Dev Size : 312496256 (298.02 GiB 320.00 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Thu Jul 22 07:30:46 2010
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0

UUID : e4feee4a:6b6be6d2:013f88ab:1b80cac5
Events : 0.17961786

Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 8 1 1 active sync /dev/sda1

When it fails, however, the device names disappear and it just tells me
it's clean, degraded and shows an active disk, a removed disk, and a faulty
spare without any device names.

I even tried doing dd if=/dev/md0 of=/dev/null to see if I could get the
light flickering on one and not the other, but I just get I/O errors. Once
a disk fails, the RAID seems to go into a nasty state where it reads
properly through the crypto loop and LVM I have on top of it, but the
filesystems become read-only and the block devices just give errors. Worse,
the first indication (even before the mdadm email) that something is wrong
is a message to console that an ext3 journal write failed.

What I've been doing (which makes me tremendously uncomfortable since I
know a disk is failing) is to reboot and bring everything back up. This has
been working, but I know it's just a matter of time before the failing disk
becomes a failed disk. I could wait until then, since presumably I'll then
know which is which, but who knows what data corruption is possible between
now and then?

So, um, help?

--Greg




cat /proc/mdstat can help but you need to get the serial numbers. Do this;

~# hdparm -i /dev/sda

/dev/sda:

Model=ST31000340AS , FwRev=SD15 ,
SerialNo=
9QJ1TRWK

Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
BuffType=unknown, BuffSize=0kB, MaxMultSect=16, MultSect=?16?
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1953523055
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6
AdvancedPM=no WriteCache=enabled
Drive conforms to: unknown: ATA/ATAPI-4,5,6,7

* signifies the current active mode

You see it says SerialNo = On each HDD you will see the serial number on
their somewhere, often it's hard to ready, so get a lable machine out
and clearly lable each HDD with it's serial number. When one dies. do a
cat /proc/mdstat to see which drive has failed, so say /dev/sda has
failed, run that command to get the serial number of /dev/sda, open the
case, rip it out, stick a new HDD in making sure you label this one with
it's serial number, boot up and rebuild etc etc



--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org

Archive: 4C483F88.70705@sharescope.co.uk">http://lists.debian.org/4C483F88.70705@sharescope.co.uk
 
Old 07-22-2010, 01:27 PM
randall
 
Default Which disk is failing?

On 07/22/2010 02:54 PM, Michal wrote:

On 22/07/10 12:38, Gregory Seidman wrote:
I have a RAID1 (using md) running on two USB disks. (I'm working on
moving
to eSATA, but it's USB for now.) That means I don't have any insight
using
SMART. Meanwhile, I've been getting occasional fail events.
Unfortunately,

I don't get any information on which disk is failing.

When the system comes up, it seems to be entirely random which disk
comes
up as /dev/sda and which comes up as /dev/sdb. In fact, since my root
disk

is on SATA, at least one time it came up as /dev/sda and the USB drives
came up as /dev/sdb and /dev/sdc, though I think that was under a
different

kernel version. When I get a failure email, it tells me that it might be
due to /dev/sda1 failing -- except when it tells me that it might be
due to

/dev/sdb1 failing. When things are working, mdadm -D /dev/md0 looks like
this:


/dev/md0:
Version : 00.90
Creation Time : Wed Feb 22 20:50:29 2006
Raid Level : raid1
Array Size : 312496256 (298.02 GiB 320.00 GB)
Used Dev Size : 312496256 (298.02 GiB 320.00 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Thu Jul 22 07:30:46 2010
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0

UUID : e4feee4a:6b6be6d2:013f88ab:1b80cac5
Events : 0.17961786

Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 8 1 1 active sync /dev/sda1

When it fails, however, the device names disappear and it just tells me
it's clean, degraded and shows an active disk, a removed disk, and a
faulty

spare without any device names.

I even tried doing dd if=/dev/md0 of=/dev/null to see if I could get the
light flickering on one and not the other, but I just get I/O errors.
Once

a disk fails, the RAID seems to go into a nasty state where it reads
properly through the crypto loop and LVM I have on top of it, but the
filesystems become read-only and the block devices just give errors.
Worse,
the first indication (even before the mdadm email) that something is
wrong

is a message to console that an ext3 journal write failed.

What I've been doing (which makes me tremendously uncomfortable since I
know a disk is failing) is to reboot and bring everything back up.
This has
been working, but I know it's just a matter of time before the
failing disk
becomes a failed disk. I could wait until then, since presumably I'll
then
know which is which, but who knows what data corruption is possible
between

now and then?

So, um, help?

--Greg


cat /proc/mdstat can help but you need to get the serial numbers. Do
this;


~# hdparm -i /dev/sda

/dev/sda:

Model=ST31000340AS , FwRev=SD15 ,
SerialNo=
9QJ1TRWK

Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
BuffType=unknown, BuffSize=0kB, MaxMultSect=16, MultSect=?16?
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1953523055
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6
AdvancedPM=no WriteCache=enabled
Drive conforms to: unknown: ATA/ATAPI-4,5,6,7

* signifies the current active mode

You see it says SerialNo = On each HDD you will see the serial number
on their somewhere, often it's hard to ready, so get a lable machine
out and clearly lable each HDD with it's serial number. When one
dies. do a cat /proc/mdstat to see which drive has failed, so say
/dev/sda has failed, run that command to get the serial number of
/dev/sda, open the case, rip it out, stick a new HDD in making sure
you label this one with it's serial number, boot up and rebuild etc etc




you could also try smartctl -a /dev/sda to get the disks serial numbers


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org

Archive: 4C484756.60809@songshu.org">http://lists.debian.org/4C484756.60809@songshu.org
 
Old 07-22-2010, 01:32 PM
Stan Hoeppner
 
Default Which disk is failing?

Gregory Seidman put forth on 7/22/2010 6:38 AM:
> I have a RAID1 (using md) running on two USB disks. (I'm working on moving
> to eSATA, but it's USB for now.) That means I don't have any insight using
> SMART. Meanwhile, I've been getting occasional fail events. Unfortunately,
> I don't get any information on which disk is failing.

Are any USB communication errors being logged along with the md and ext3 errors?

Are you sure it's a disk drive problem and not an issue with the kernel
drivers, system BIOS, USB controller, cabling, or a combination thereof?

How long (days, weeks, months, years) did this exact setup function properly
before you started seeing these problems?

Did you recently perform any major software upgrades (kernel/drivers) shortly
before this problem surfaced?

Is this a laptop? If so which make/model?

What's the make/model of the USB disk drives?

What is the age of each piece of hardware we're discussing?

--
Stan


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4C48486C.7030804@hardwarefreak.com">http://lists.debian.org/4C48486C.7030804@hardwarefreak.com
 
Old 07-22-2010, 02:27 PM
Gregory Seidman
 
Default Which disk is failing?

On Thu, Jul 22, 2010 at 03:27:50PM +0200, randall wrote:
> On 07/22/2010 02:54 PM, Michal wrote:
>> On 22/07/10 12:38, Gregory Seidman wrote:
[...]
>>> So, um, help?
>>>
>>> --Greg
>>>
>> cat /proc/mdstat can help but you need to get the serial numbers. Do
>> this;
>>
>> ~# hdparm -i /dev/sda
[...]

# hdparm -i /dev/sda
HDIO_GET_IDENTITY failed: Invalid argument

/dev/sda:

# hdparm -i /dev/sdb
HDIO_GET_IDENTITY failed: Invalid argument

/dev/sdb:

> you could also try smartctl -a /dev/sda to get the disks serial numbers

# smartctl -a /dev/sda
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: ST332062 0A Version: 3.AA
scsiModePageOffset: response length too short, resp_len=4 offset=4 bd_len=0
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

# smartctl -a /dev/sda -T permissive
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: ST332062 0A Version: 3.AA
scsiModePageOffset: response length too short, resp_len=4 offset=4 bd_len=0
>> Terminate command early due to bad response to IEC mode page

Error Counter logging not supported
scsiModePageOffset: response length too short, resp_len=4 offset=4 bd_len=0
Device does not support Self Test logging


Neither of these tools seem to be of much use here.

--Greg


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20100722142710.GC10802@anthropohedron.net">http://lists.debian.org/20100722142710.GC10802@anthropohedron.net
 
Old 07-22-2010, 02:39 PM
Miles Fidelman
 
Default Which disk is failing?

Gregory Seidman wrote:

# hdparm -i /dev/sda
HDIO_GET_IDENTITY failed: Invalid argument

/dev/sda:

# hdparm -i /dev/sdb
HDIO_GET_IDENTITY failed: Invalid argument


You might try "lusb" - to list devices on your usb bus. That might help
you identify specific devices.


Also try nosing around in the sub-directories under /dev/disk

And, perhaps an obvious question, but does the drive maker provide any
device-specific drivers or utilities that might help?



--
In theory, there is no difference between theory and practice.
In<fnord> practice, there is. .... Yogi Berra



--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org

Archive: 4C485832.7090009@meetinghouse.net">http://lists.debian.org/4C485832.7090009@meetinghouse.net
 
Old 07-22-2010, 02:45 PM
Gregory Seidman
 
Default Which disk is failing?

On Thu, Jul 22, 2010 at 08:32:28AM -0500, Stan Hoeppner wrote:
> Gregory Seidman put forth on 7/22/2010 6:38 AM:
> > I have a RAID1 (using md) running on two USB disks. (I'm working on moving
> > to eSATA, but it's USB for now.) That means I don't have any insight using
> > SMART. Meanwhile, I've been getting occasional fail events. Unfortunately,
> > I don't get any information on which disk is failing.
>
> Are any USB communication errors being logged along with the md and ext3 errors?

I'm not seeing anything in /var/log/{dmesg,syslog,kern.log}. Is there
somewhere else I should be looking?

> Are you sure it's a disk drive problem and not an issue with the kernel
> drivers, system BIOS, USB controller, cabling, or a combination thereof?

I'm not 100% sure, but it's the most likely possibility.

> How long (days, weeks, months, years) did this exact setup function properly
> before you started seeing these problems?

Months. I switched from Mac hardware to PC hardware when the Mac
motherboard finally died after 10 years a few months ago. Most of the same
hardware (cables, drives, enclosures, etc.) had been working well for a
couple of years before that.

> Did you recently perform any major software upgrades (kernel/drivers)
> shortly before this problem surfaced?

No.

> Is this a laptop? If so which make/model?

Nope, it's a tower:
ThinkCentre M52 3.2GHz Intel Pentium IV Desktop PC

> What's the make/model of the USB disk drives?

The two USB drive enclosures are different makes and models, and I don't
have them handy to check (I'm at work right now). They aren't no-name crap,
though. Incidentally, one is an eSATA & USB, the other is IEEE1394 and USB.

The drives are either Seagate or Western Digital. I don't remember which at
this point, since it's been a while since I put them in their enclosures.

> What is the age of each piece of hardware we're discussing?

The tower was purchased refurbished, but is probably circa 2004. The
drives, cables, and enclosures are no more than two years old.

> Stan
--Greg


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20100722144537.GD10802@anthropohedron.net">http://lists.debian.org/20100722144537.GD10802@anthropohedron.net
 
Old 07-22-2010, 02:50 PM
Paul Cartwright
 
Default Which disk is failing?

On Thu July 22 2010, Miles Fidelman wrote:
> You might try "lusb" - to list devices on your usb bus. *That might help
> you identify specific devices.

I think you meant lsusb ..

--
Paul Cartwright
Registered Linux user # 367800
Registered Ubuntu User #12459


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 201007221050.18471.debian@pcartwright.com">http://lists.debian.org/201007221050.18471.debian@pcartwright.com
 
Old 07-22-2010, 04:26 PM
Stan Hoeppner
 
Default Which disk is failing?

Gregory Seidman put forth on 7/22/2010 9:45 AM:

> Nope, it's a tower:
> ThinkCentre M52 3.2GHz Intel Pentium IV Desktop PC

> The tower was purchased refurbished, but is probably circa 2004. The
> drives, cables, and enclosures are no more than two years old.

Both external drives are native SATA correct? I think your best course of
action at this point would be to purchase a $15-20 two port PCI SATA card
based on a SiI 3512 chipset, any internal SATA data/power cables you'd need,
and move the drives inside the PC. This will allow smartmontools, hdparm, and
other utils to identify the drives, and you'll likely get a nice speed boost
as well, especially if that PC has a 66MHz 32bit PCI slot, which will allow
full bandwidth to both drives simultaneously.

Newegg has everything you need. I recommend the Koutech 3512 based card. I
have one in my server and it works very well. I gave $15 for it but I think
it's up to $20 now, which is still very reasonable. You should be able to
pick up the Koutech, 2 x 3.5" to 5.25" generic drive bay adapters if you need
them, and 2 combo SATA data/power cables for $30-40 including shipping.

If that PC has motherboard down SATA ports you're in business with no cash
outlay, assuming you have SATA data/power cables.

--
Stan


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4C487141.70509@hardwarefreak.com">http://lists.debian.org/4C487141.70509@hardwarefreak.com
 
Old 07-22-2010, 07:18 PM
Miles Fidelman
 
Default Which disk is failing?

Paul Cartwright wrote:

On Thu July 22 2010, Miles Fidelman wrote:


You might try "lusb" - to list devices on your usb bus. That might help
you identify specific devices.


I think you meant lsusb ..



yup - oops... sorry about that

--
In theory, there is no difference between theory and practice.
In<fnord> practice, there is. .... Yogi Berra



--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org

Archive: 4C4899A2.9060102@meetinghouse.net">http://lists.debian.org/4C4899A2.9060102@meetinghouse.net
 

Thread Tools




All times are GMT. The time now is 04:38 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org