I'm using Centos 4.5 right now, and I had a RAID 5 array stop because
two drives became unavailable. After adjusting the cables on several
occasions and shutting down and restarting, I was able to see the
drives again. This is when I snatched defeat from the jaws of
victory. Please, someone with vast knowledge of how RAID 5 with mdadm
works, tell me if I have any chance at all that this array will pull
through with most or all of my data.
Background info about the machine
/dev/md0 is a RAID1 consisting of /dev/sda1 and /dev/sda2
/dev/md1 is a RAID1 consisting of /dev/sda2 and /dev/sdb2
/dev/md2 (our special friend) is a RAID5 consisting of /dev/sd[c-j]
/dev/sdi and /dev/sdj were the drives that detached from the array and
were marked as faulty.
I did the following things that in hindsight were probably VERY BAD
Step 1 (Misassign drives to wrong array):
I could probably have had things going again in a tenth of a second if
I hadn't typed this:
This clobbered the superblock and replaced it with that of /dev/md0,
yes?
well, that's what mdadm --misc --examine /dev/sdi and sdj said anyhow.
Ok, so what next?
Step 2 (rebuild the array but make sure the params are right!):
I wipe out the superblocks on all of the drives in the array and
rebuild with --assume-clean
for i in c d e f g h i j ; do mdadm --zero-superblock /dev/sd$i ; done
mdadm --create /dev/md2 --assume-clean --level=5 --raid-devices=8 /dev/
sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj
ok, now it says that the array is recovering and will take about 10
hours to rebulid.
/dev/sd[c-i] say that they are "active sync" and /dev/sdj says it's a
spare that's rebuilding.
But now I scroll back in my history and see that oops, the chunk size
is WRONG. Not only that, but I don't stop the array until the rebuild
is at around 8%
Ok, I stop the array and rebuild with
mdadm --create /dev/md2 --assume-clean --level=5 --chunk --raid-
devices=8 /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/
sdi /dev/sdj
Now it says it's going to take another 10 hours to rebuild.
How likely are my data irretrievable/gone and at what step would it
have happened if so?
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
04-17-2008, 05:06 PM
Mark Hennessy
Question about RAID 5 array rebuild with mdadm
Sorry about that, my previous e-mail had just '--chunk' toward the
bottom. It should have been '--chunk=256' Please see the quoted
snippet for detail.
On Apr 17, 2008, at 1:01 PM, Mark Hennessy wrote:
Ok, I stop the array and rebuild with
mdadm --create /dev/md2 --assume-clean --level=5 --chunk=256 --raid-
devices=8 /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/
sdi /dev/sdj
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
04-17-2008, 05:50 PM
"Ross S. W. Walker"
Question about RAID 5 array rebuild with mdadm
Mark Hennessy wrote:
>
> I'm using Centos 4.5 right now, and I had a RAID 5 array stop because
> two drives became unavailable. After adjusting the cables on several
> occasions and shutting down and restarting, I was able to see the
> drives again. This is when I snatched defeat from the jaws of
> victory. Please, someone with vast knowledge of how RAID 5 with mdadm
> works, tell me if I have any chance at all that this array will pull
> through with most or all of my data.
It may be possible...
> Background info about the machine
> /dev/md0 is a RAID1 consisting of /dev/sda1 and /dev/sda2
> /dev/md1 is a RAID1 consisting of /dev/sda2 and /dev/sdb2
> /dev/md2 (our special friend) is a RAID5 consisting of /dev/sd[c-j]
>
> /dev/sdi and /dev/sdj were the drives that detached from the array and
> were marked as faulty.
>
> I did the following things that in hindsight were probably VERY BAD
>
> Step 1 (Misassign drives to wrong array):
> I could probably have had things going again in a tenth of a second if
> I hadn't typed this:
> mdadm --manage --add /dev/md0 /dev/sdi
> mdadm --manage --add /dev/md0 /dev/sdi
>
> This clobbered the superblock and replaced it with that of /dev/md0, yes?
> well, that's what mdadm --misc --examine /dev/sdi and sdj said anyhow.
Hmm, not good, but we will mark this drive 'sdi' as bad.
> Ok, so what next?
> Step 2 (rebuild the array but make sure the params are right!):
> I wipe out the superblocks on all of the drives in the array and
> rebuild with --assume-clean
> for i in c d e f g h i j ; do mdadm --zero-superblock /dev/sd$i ; done
> mdadm --create /dev/md2 --assume-clean --level=5 --raid-devices=8 /dev/
> sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj
Nooo, you need to make sure sdi is marked as 'bad' offline, you are
going to need to assemble the array degraded, then add sdi as a
replacement and let it rebuild sdi off the parity.
> ok, now it says that the array is recovering and will take about 10
> hours to rebulid.
> /dev/sd[c-i] say that they are "active sync" and
> /dev/sdj says it's a
> spare that's rebuilding.
> But now I scroll back in my history and see that oops, the chunk size
> is WRONG. Not only that, but I don't stop the array until the rebuild
> is at around 8%
Well, now I think it's all messed up.
> Ok, I stop the array and rebuild with
> mdadm --create /dev/md2 --assume-clean --level=5 --chunk --raid-
> devices=8 /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/
> sdi /dev/sdj
>
> Now it says it's going to take another 10 hours to rebuild.
It's truly hosed now.
> How likely are my data irretrievable/gone and at what step would it
> have happened if so?
I hope you have backups cause your going to need them.
If only you posted to the list BEFORE you tried to recover it without
knowing what to do.
-Ross
__________________________________________________ ____________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
04-17-2008, 06:13 PM
Mark Hennessy
Question about RAID 5 array rebuild with mdadm
Thanks for answering my e-mail!!
On Apr 17, 2008, at 1:50 PM, Ross S. W. Walker wrote:
Mark Hennessy wrote:
ok, now it says that the array is recovering and will take about 10
hours to rebulid.
/dev/sd[c-i] say that they are "active sync" and
/dev/sdj says it's a
spare that's rebuilding.
But now I scroll back in my history and see that oops, the chunk size
is WRONG. Not only that, but I don't stop the array until the
rebuild
is at around 8%
Well, now I think it's all messed up.
Ok, I stop the array and rebuild with
mdadm --create /dev/md2 --assume-clean --level=5 --chunk --raid-
devices=8 /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/
sdi /dev/sdj
Now it says it's going to take another 10 hours to rebuild.
It's truly hosed now.
I was thinking that too, but I waited until the drive was about 5%
recovered and mounted it read-only. It mounted successfully. I was
able to cat log files stored there as well as do full listings of
tarballs there without interruption. I went ahead and copied a bunch
of important things off of that array onto another one and received no
complaints from the OS.
What did I miss? I just want to learn and to understand. Perhaps
there is documentation that I didn't find via Google and Wikipedia
that would explain in more detail how this works that you could direct
me to.
Thanks for your kind assistance!
How likely are my data irretrievable/gone and at what step would it
have happened if so?
I hope you have backups cause your going to need them.
What's the likelihood of data corruption despite the fs being
browsable and the files accessible like I describe?
If only you posted to the list BEFORE you tried to recover it without
knowing what to do.
Agreed (strongly).
-Ross
__________________________________________________ ____________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos