FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian User

 
 
LinkBack Thread Tools
 
Old 10-29-2008, 03:47 PM
Hendrik Boom
 
Default Paranoia about DegradedArray

I got the message (via email)

This is an automatically generated mail message from mdadm
running on april

A DegradedArray event had been detected on md device /dev/md0.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1]
md0 : active raid1 hda3[0]
242219968 blocks [2/1] [U_]

unused devices: <none>


Now I gather from what I've googled that somehow I've got to get the RAID
to reestablish the failed drive by copying from the nonfailed drive.
I do believe the hardware is basically OK, and that what I've got is
probably a problem due to a power failure (We've had a lot of these
recently) or something transient.

(a) How do I do this?

(b) is hda3 the failed drive, or is it the one that's still working?

-- hendrik


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 10-29-2008, 04:00 PM
Hal Vaughan
 
Default Paranoia about DegradedArray

On Wednesday 29 October 2008, Hendrik Boom wrote:
> I got the message (via email)
>
> This is an automatically generated mail message from mdadm
> running on april
>
> A DegradedArray event had been detected on md device /dev/md0.
>
> Faithfully yours, etc.
>
> P.S. The /proc/mdstat file currently contains the following:
>
> Personalities : [raid1]
> md0 : active raid1 hda3[0]
> 242219968 blocks [2/1] [U_]
>
> unused devices: <none>
>

You don't mention that you've checked the array with
mdadm --detail /dev/md0. Try that and it will give you some good
information.

I've never used /proc/mdstat because the --detail option gives me more
data in one shot. From what I remember, this is a raid1, right? It
looks like it has 2 devices and one is still working, but I might be
wrong. Again --detail will spell out a lot of this explicitly.

> Now I gather from what I've googled that somehow I've got to get the
> RAID to reestablish the failed drive by copying from the nonfailed
> drive. I do believe the hardware is basically OK, and that what I've
> got is probably a problem due to a power failure (We've had a lot of
> these recently) or something transient.
>
> (a) How do I do this?

If a drive has actually failed, then mdadm --remove /dev/md0 /dev/hdxx.
If the drive has not failed, then you need to fail it first with --fail
as an option/switch for mdadm.

> (b) is hda3 the failed drive, or is it the one that's still working?

That's one of the things mdadm --detail /dev/md0 will tell you. It will
list the active drives and the failed drives.

Hal


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 10-29-2008, 04:39 PM
Hendrik Boom
 
Default Paranoia about DegradedArray

On Wed, 29 Oct 2008 13:00:25 -0400, Hal Vaughan wrote:

> On Wednesday 29 October 2008, Hendrik Boom wrote:
>> I got the message (via email)
>>
>> This is an automatically generated mail message from mdadm running on
>> april
>>
>> A DegradedArray event had been detected on md device /dev/md0.
>>
>> Faithfully yours, etc.
>>
>> P.S. The /proc/mdstat file currently contains the following:
>>
>> Personalities : [raid1]
>> md0 : active raid1 hda3[0]
>> 242219968 blocks [2/1] [U_]
>>
>> unused devices: <none>
>>
>>
> You don't mention that you've checked the array with mdadm --detail
> /dev/md0. Try that and it will give you some good information.

april:/farhome/hendrik# mdadm --detail /dev/md0
/dev/md0:
Version : 00.90.03
Creation Time : Sun Feb 19 10:53:01 2006
Raid Level : raid1
Array Size : 242219968 (231.00 GiB 248.03 GB)
Device Size : 242219968 (231.00 GiB 248.03 GB)
Raid Devices : 2
Total Devices : 1
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Wed Oct 29 13:23:15 2008
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0

UUID : 4dc189ba:e7a12d38:e6262cdf:db1beda2
Events : 0.5130704

Number Major Minor RaidDevice State
0 3 3 0 active sync /dev/hda3
1 0 0 1 removed
april:/farhome/hendrik#



So from this do I conclude that /dev/hda3 is still working, but that it's
the other drive (which isn't identified) that has trouble?

I'm a bit surprised that none of the messages identifies the other
drive, /dev/hdc3. Is this normal? Is that information available
somewhere besides the sysadmin's memory?

>
> I've never used /proc/mdstat because the --detail option gives me more
> data in one shot. From what I remember, this is a raid1, right? It
> looks like it has 2 devices and one is still working, but I might be
> wrong. Again --detail will spell out a lot of this explicitly.
>
>> Now I gather from what I've googled that somehow I've got to get the
>> RAID to reestablish the failed drive by copying from the nonfailed
>> drive. I do believe the hardware is basically OK, and that what I've
>> got is probably a problem due to a power failure (We've had a lot of
>> these recently) or something transient.
>>
>> (a) How do I do this?
>
> If a drive has actually failed, then mdadm --remove /dev/md0 /dev/hdxx.
> If the drive has not failed, then you need to fail it first with --fail
> as an option/switch for mdadm.

So presumably the thing to do is
mdadm --fail /dev/md0 /dev/hdc3
mdadm --remove /dev/md0 /dev/hdc3
and then
mdadm --add/dev/md0 /dev/hdc3

Is the --fail really needed in my case? the --detail option seems to
have given /dev/hdc3 the status of "removed" (although it failed to
mention is was /dev/hdc3).

>
>> (b) is hda3 the failed drive, or is it the one that's still working?
>
> That's one of the things mdadm --detail /dev/md0 will tell you. It will
> list the active drives and the failed drives.

Well. I'm glad I was paranoid enough to ask. It seems to be the drive
that's working. Glas I didn't try to remove and add in *that* one.

Thanks,

-- hendrik


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 10-29-2008, 08:58 PM
Hal Vaughan
 
Default Paranoia about DegradedArray

On Wednesday 29 October 2008, Hendrik Boom wrote:
> On Wed, 29 Oct 2008 13:00:25 -0400, Hal Vaughan wrote:
> > On Wednesday 29 October 2008, Hendrik Boom wrote:
> >> I got the message (via email)
> >>
> >> This is an automatically generated mail message from mdadm running
> >> on april
> >>
> >> A DegradedArray event had been detected on md device /dev/md0.
> >>
> >> Faithfully yours, etc.
> >>
> >> P.S. The /proc/mdstat file currently contains the following:
> >>
> >> Personalities : [raid1]
> >> md0 : active raid1 hda3[0]
> >> 242219968 blocks [2/1] [U_]
> >>
> >> unused devices: <none>
> >
> > You don't mention that you've checked the array with mdadm --detail
> > /dev/md0. Try that and it will give you some good information.
>
> april:/farhome/hendrik# mdadm --detail /dev/md0
> /dev/md0:
> Version : 00.90.03
> Creation Time : Sun Feb 19 10:53:01 2006
> Raid Level : raid1
> Array Size : 242219968 (231.00 GiB 248.03 GB)
> Device Size : 242219968 (231.00 GiB 248.03 GB)
> Raid Devices : 2
> Total Devices : 1
> Preferred Minor : 0
> Persistence : Superblock is persistent
>
> Update Time : Wed Oct 29 13:23:15 2008
> State : clean, degraded
> Active Devices : 1
> Working Devices : 1
> Failed Devices : 0
> Spare Devices : 0
>
> UUID : 4dc189ba:e7a12d38:e6262cdf:db1beda2
> Events : 0.5130704
>
> Number Major Minor RaidDevice State
> 0 3 3 0 active sync /dev/hda3
> 1 0 0 1 removed
> april:/farhome/hendrik#
>
>
>
> So from this do I conclude that /dev/hda3 is still working, but that
> it's the other drive (which isn't identified) that has trouble?
>
> I'm a bit surprised that none of the messages identifies the other
> drive, /dev/hdc3. Is this normal? Is that information available
> somewhere besides the sysadmin's memory?

Luckily it's been at least a couple months since I worked with a
degraded array, but I *thought* it listed the failed devices as well.
It looks like the device has not only failed but been removed -- is
there a chance you removed it after the failure, before running this
command?


> > I've never used /proc/mdstat because the --detail option gives me
> > more data in one shot. From what I remember, this is a raid1,
> > right? It looks like it has 2 devices and one is still working,
> > but I might be wrong. Again --detail will spell out a lot of this
> > explicitly.
> >
> >> Now I gather from what I've googled that somehow I've got to get
> >> the RAID to reestablish the failed drive by copying from the
> >> nonfailed drive. I do believe the hardware is basically OK, and
> >> that what I've got is probably a problem due to a power failure
> >> (We've had a lot of these recently) or something transient.
> >>
> >> (a) How do I do this?
> >
> > If a drive has actually failed, then mdadm --remove /dev/md0
> > /dev/hdxx. If the drive has not failed, then you need to fail it
> > first with --fail as an option/switch for mdadm.
>
> So presumably the thing to do is
> mdadm --fail /dev/md0 /dev/hdc3
> mdadm --remove /dev/md0 /dev/hdc3
> and then
> mdadm --add/dev/md0 /dev/hdc3

I think there's a "--readd" that you have to use or something like that,
but I'd try --add first and see if that works. You might find that
hdc3 has already failed and, form the output above, looks like it's
already been removed.

> Is the --fail really needed in my case? the --detail option seems to
> have given /dev/hdc3 the status of "removed" (although it failed to
> mention is was /dev/hdc3).

I've had trouble with removing drives if I didn't manually fail them.
Someone who knows the inner workings of mdadm might be able to provide
more information on that.

> >> (b) is hda3 the failed drive, or is it the one that's still
> >> working?
> >
> > That's one of the things mdadm --detail /dev/md0 will tell you. It
> > will list the active drives and the failed drives.
>
> Well. I'm glad I was paranoid enough to ask. It seems to be the
> drive that's working. Glas I didn't try to remove and add in *that*
> one.

Yes, paranoia is a good thing in system administration. It's kept me
from severe problems previously!


Hal


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 10-30-2008, 04:21 PM
Hendrik Boom
 
Default Paranoia about DegradedArray

On Wed, 29 Oct 2008 17:58:56 -0400, Hal Vaughan wrote:

> On Wednesday 29 October 2008, Hendrik Boom wrote:
>>
>> I'm a bit surprised that none of the messages identifies the other
>> drive, /dev/hdc3. Is this normal? Is that information available
>> somewhere besides the sysadmin's memory?
>
> Luckily it's been at least a couple months since I worked with a
> degraded array, but I *thought* it listed the failed devices as well. It
> looks like the device has not only failed but been removed -- is there a
> chance you removed it after the failure, before running this command?

No. I did not explicitly fail it or remove it. There must have been
some automatic mechanism that did.

>>
>> So presumably the thing to do is
>> mdadm --fail /dev/md0 /dev/hdc3
>> mdadm --remove /dev/md0 /dev/hdc3
>> and then
>> mdadm --add/dev/md0 /dev/hdc3
>
> I think there's a "--readd" that you have to use or something like that,
> but I'd try --add first and see if that works. You might find that hdc3
> has already failed and, form the output above, looks like it's already
> been removed.

In the docs, re-add is specified as something to use if a drive has been
removed *recently*, and then it writes all the blocks that were to have
been written while it was out -- a way of doing an update instead of s
full copy. It doesn't seem relevant in this case.

>
>> Is the --fail really needed in my case? the --detail option seems to
>> have given /dev/hdc3 the status of "removed" (although it failed to
>> mention is was /dev/hdc3).
>
> I've had trouble with removing drives if I didn't manually fail them.
> Someone who knows the inner workings of mdadm might be able to provide
> more information on that.

I wonder if /dev/hdc3 still needs to be manually failed. I wonder if it
is even possible to fail a removed drive...


>
> Yes, paranoia is a good thing in system administration. It's kept me
> from severe problems previously!

And paranoia will make sure I have two complete backups before I actually
do any of this fixup.

- hendrik

>
>
> Hal



--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 10-30-2008, 04:43 PM
Hal Vaughan
 
Default Paranoia about DegradedArray

On Thursday 30 October 2008, Hendrik Boom wrote:
...
> > I've had trouble with removing drives if I didn't manually fail
> > them. Someone who knows the inner workings of mdadm might be able
> > to provide more information on that.
>
> I wonder if /dev/hdc3 still needs to be manually failed. I wonder if
> it is even possible to fail a removed drive...


Try adding it. If it works, then you're okay -- assuming the drive is
okay. If it doesn't work, you'll get an error message and it won't add
it.

> > Yes, paranoia is a good thing in system administration. It's kept
> > me from severe problems previously!
>
> And paranoia will make sure I have two complete backups before I
> actually do any of this fixup.

I've learned, among other things, to not trust RAID5 with mdadm. I've
also learned that even with RAID1, I have a full backup elsewhere. I
stick with RAID1 so if it blows, as long as one drive is still okay, I
can always remount it as a regular drive.


Hal


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 10-30-2008, 05:49 PM
Hendrik Boom
 
Default Paranoia about DegradedArray

On Thu, 30 Oct 2008 13:43:52 -0400, Hal Vaughan wrote:

> On Thursday 30 October 2008, Hendrik Boom wrote: ...
>> > I've had trouble with removing drives if I didn't manually fail them.
>> > Someone who knows the inner workings of mdadm might be able to
>> > provide more information on that.
>>
>> I wonder if /dev/hdc3 still needs to be manually failed. I wonder if
>> it is even possible to fail a removed drive...
>
>
> Try adding it. If it works, then you're okay -- assuming the drive is
> okay. If it doesn't work, you'll get an error message and it won't add
> it.

There have been occasional reboots; presumably the add failed on reboot.
I should perhaps check the system log.

-- hendrik


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 

Thread Tools




All times are GMT. The time now is 10:19 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org