FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > CentOS > CentOS

 
 
LinkBack Thread Tools
 
Old 04-17-2008, 05:01 PM
Mark Hennessy
 
Default Question about RAID 5 array rebuild with mdadm

I'm using Centos 4.5 right now, and I had a RAID 5 array stop because
two drives became unavailable. After adjusting the cables on several
occasions and shutting down and restarting, I was able to see the
drives again. This is when I snatched defeat from the jaws of
victory. Please, someone with vast knowledge of how RAID 5 with mdadm
works, tell me if I have any chance at all that this array will pull
through with most or all of my data.


Background info about the machine
/dev/md0 is a RAID1 consisting of /dev/sda1 and /dev/sda2
/dev/md1 is a RAID1 consisting of /dev/sda2 and /dev/sdb2
/dev/md2 (our special friend) is a RAID5 consisting of /dev/sd[c-j]

/dev/sdi and /dev/sdj were the drives that detached from the array and
were marked as faulty.


I did the following things that in hindsight were probably VERY BAD

Step 1 (Misassign drives to wrong array):
I could probably have had things going again in a tenth of a second if
I hadn't typed this:

mdadm --manage --add /dev/md0 /dev/sdi
mdadm --manage --add /dev/md0 /dev/sdi

This clobbered the superblock and replaced it with that of /dev/md0,
yes?

well, that's what mdadm --misc --examine /dev/sdi and sdj said anyhow.

Ok, so what next?
Step 2 (rebuild the array but make sure the params are right!):
I wipe out the superblocks on all of the drives in the array and
rebuild with --assume-clean

for i in c d e f g h i j ; do mdadm --zero-superblock /dev/sd$i ; done
mdadm --create /dev/md2 --assume-clean --level=5 --raid-devices=8 /dev/
sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj


ok, now it says that the array is recovering and will take about 10
hours to rebulid.
/dev/sd[c-i] say that they are "active sync" and /dev/sdj says it's a
spare that's rebuilding.
But now I scroll back in my history and see that oops, the chunk size
is WRONG. Not only that, but I don't stop the array until the rebuild
is at around 8%


Ok, I stop the array and rebuild with
mdadm --create /dev/md2 --assume-clean --level=5 --chunk --raid-
devices=8 /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/
sdi /dev/sdj


Now it says it's going to take another 10 hours to rebuild.

How likely are my data irretrievable/gone and at what step would it
have happened if so?

_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 04-17-2008, 05:06 PM
Mark Hennessy
 
Default Question about RAID 5 array rebuild with mdadm

Sorry about that, my previous e-mail had just '--chunk' toward the
bottom. It should have been '--chunk=256' Please see the quoted
snippet for detail.


On Apr 17, 2008, at 1:01 PM, Mark Hennessy wrote:

Ok, I stop the array and rebuild with
mdadm --create /dev/md2 --assume-clean --level=5 --chunk=256 --raid-
devices=8 /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/
sdi /dev/sdj

_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 04-17-2008, 05:50 PM
"Ross S. W. Walker"
 
Default Question about RAID 5 array rebuild with mdadm

Mark Hennessy wrote:
>
> I'm using Centos 4.5 right now, and I had a RAID 5 array stop because
> two drives became unavailable. After adjusting the cables on several
> occasions and shutting down and restarting, I was able to see the
> drives again. This is when I snatched defeat from the jaws of
> victory. Please, someone with vast knowledge of how RAID 5 with mdadm
> works, tell me if I have any chance at all that this array will pull
> through with most or all of my data.

It may be possible...

> Background info about the machine
> /dev/md0 is a RAID1 consisting of /dev/sda1 and /dev/sda2
> /dev/md1 is a RAID1 consisting of /dev/sda2 and /dev/sdb2
> /dev/md2 (our special friend) is a RAID5 consisting of /dev/sd[c-j]
>
> /dev/sdi and /dev/sdj were the drives that detached from the array and
> were marked as faulty.
>
> I did the following things that in hindsight were probably VERY BAD
>
> Step 1 (Misassign drives to wrong array):
> I could probably have had things going again in a tenth of a second if
> I hadn't typed this:
> mdadm --manage --add /dev/md0 /dev/sdi
> mdadm --manage --add /dev/md0 /dev/sdi
>
> This clobbered the superblock and replaced it with that of /dev/md0, yes?
> well, that's what mdadm --misc --examine /dev/sdi and sdj said anyhow.

Hmm, not good, but we will mark this drive 'sdi' as bad.

> Ok, so what next?
> Step 2 (rebuild the array but make sure the params are right!):
> I wipe out the superblocks on all of the drives in the array and
> rebuild with --assume-clean
> for i in c d e f g h i j ; do mdadm --zero-superblock /dev/sd$i ; done
> mdadm --create /dev/md2 --assume-clean --level=5 --raid-devices=8 /dev/
> sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj

Nooo, you need to make sure sdi is marked as 'bad' offline, you are
going to need to assemble the array degraded, then add sdi as a
replacement and let it rebuild sdi off the parity.

> ok, now it says that the array is recovering and will take about 10
> hours to rebulid.
> /dev/sd[c-i] say that they are "active sync" and
> /dev/sdj says it's a
> spare that's rebuilding.
> But now I scroll back in my history and see that oops, the chunk size
> is WRONG. Not only that, but I don't stop the array until the rebuild
> is at around 8%

Well, now I think it's all messed up.

> Ok, I stop the array and rebuild with
> mdadm --create /dev/md2 --assume-clean --level=5 --chunk --raid-
> devices=8 /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/
> sdi /dev/sdj
>
> Now it says it's going to take another 10 hours to rebuild.

It's truly hosed now.

> How likely are my data irretrievable/gone and at what step would it
> have happened if so?

I hope you have backups cause your going to need them.

If only you posted to the list BEFORE you tried to recover it without
knowing what to do.

-Ross

__________________________________________________ ____________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.

_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 04-17-2008, 06:13 PM
Mark Hennessy
 
Default Question about RAID 5 array rebuild with mdadm

Thanks for answering my e-mail!!

On Apr 17, 2008, at 1:50 PM, Ross S. W. Walker wrote:


Mark Hennessy wrote:


ok, now it says that the array is recovering and will take about 10
hours to rebulid.
/dev/sd[c-i] say that they are "active sync" and
/dev/sdj says it's a
spare that's rebuilding.
But now I scroll back in my history and see that oops, the chunk size
is WRONG. Not only that, but I don't stop the array until the
rebuild

is at around 8%


Well, now I think it's all messed up.


Ok, I stop the array and rebuild with
mdadm --create /dev/md2 --assume-clean --level=5 --chunk --raid-
devices=8 /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/
sdi /dev/sdj

Now it says it's going to take another 10 hours to rebuild.



It's truly hosed now.



I was thinking that too, but I waited until the drive was about 5%
recovered and mounted it read-only. It mounted successfully. I was
able to cat log files stored there as well as do full listings of
tarballs there without interruption. I went ahead and copied a bunch
of important things off of that array onto another one and received no
complaints from the OS.


What did I miss? I just want to learn and to understand. Perhaps
there is documentation that I didn't find via Google and Wikipedia
that would explain in more detail how this works that you could direct
me to.


Thanks for your kind assistance!


How likely are my data irretrievable/gone and at what step would it
have happened if so?


I hope you have backups cause your going to need them.


What's the likelihood of data corruption despite the fs being
browsable and the files accessible like I describe?




If only you posted to the list BEFORE you tried to recover it without
knowing what to do.


Agreed (strongly).



-Ross

__________________________________________________ ____________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.

_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 

Thread Tools




All times are GMT. The time now is 05:25 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org