FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian ISP

 
 
LinkBack Thread Tools
 
Old 01-04-2010, 06:50 AM
George Chelidze
 
Default RAID1 problem - server freezes on md data-check

Hello,

I'v got an HP ML110 Intel Dual-Core E2160 server with 2 HDDs:

GB0250EAFYK - HP 250GB 3G SATA 7.2K 3.5" MDL 250 GB SATA Hard Drive
GB0250C8045 - HP 250GB 7.2K SATA Hard Disk Drive

So, I use SATA 3.0-Gb/s and SATA 1.5 Gb/s for RAID-1 configuration. I
have configured 4 MD volumes and it's running fine for some time,
however every now and then servers freezes. At that time I can ping the
server from the network, however I can't ssh into the server, even a
keyboard us useless, so I have to hard reset the server. Below are the
last messages from my kern.log:

Jan 3 00:57:01 barambo1 kernel: [986475.159596] md: data-check of RAID
array md0
Jan 3 00:57:01 barambo1 kernel: [986475.159600] md: minimum
_guaranteed_ speed: 1000 KB/sec/disk.
Jan 3 00:57:01 barambo1 kernel: [986475.159602] md: using maximum
available idle IO bandwidth (but not more than 200000 KB/sec) for
data-check.
Jan 3 00:57:01 barambo1 kernel: [986475.159606] md: using 128k window,
over a total of 3903680 blocks.
Jan 3 00:57:01 barambo1 kernel: [986475.162041] md: delaying data-check
of md1 until md0 has finished (they share one or more physical units)
Jan 3 00:57:01 barambo1 kernel: [986475.164449] md: delaying data-check
of md2 until md0 has finished (they share one or more physical units)
Jan 3 00:57:01 barambo1 kernel: [986475.164455] md: delaying data-check
of md1 until md2 has finished (they share one or more physical units)
Jan 3 00:57:01 barambo1 kernel: [986475.166695] md: delaying data-check
of md3 until md0 has finished (they share one or more physical units)
Jan 3 00:57:01 barambo1 kernel: [986475.166699] md: delaying data-check
of md1 until md3 has finished (they share one or more physical units)
Jan 3 00:57:01 barambo1 kernel: [986475.166705] md: delaying data-check
of md2 until md3 has finished (they share one or more physical units)
Jan 3 00:58:13 barambo1 kernel: [986547.257883] md: md0: data-check
done.
Jan 3 00:58:13 barambo1 kernel: [986547.276663] md: delaying data-check
of md1 until md3 has finished (they share one or more physical units)
Jan 3 00:58:13 barambo1 kernel: [986547.276668] md: data-check of RAID
array md3
Jan 3 00:58:13 barambo1 kernel: [986547.276671] md: minimum
_guaranteed_ speed: 1000 KB/sec/disk.
Jan 3 00:58:13 barambo1 kernel: [986547.276674] md: using maximum
available idle IO bandwidth (but not more than 200000 KB/sec) for
data-check.
Jan 3 00:58:13 barambo1 kernel: [986547.276678] md: using 128k window,
over a total of 122126016 blocks.
Jan 3 00:58:13 barambo1 kernel: [986547.276681] md: delaying data-check
of md2 until md3 has finished (they share one or more physical units)

OS is Debian 5.0.3 Lenny stable with linux-image-2.6.30-bpo.2-686
kernel. I had the same results with linux-image-2.6.26-2-686 stock
kernel. My basic question is can this happen because I use 2 different
drives? I have a chance to replace GB0250C8045 with GB0250EAFYK or
GB0250EAFYK with GB0250C8045 and have 2 identical drives. Is it a good
idea and will it solve my problem?

Thank you in advance for any input,

Best Regards,

George Chelidze



--
To UNSUBSCRIBE, email to debian-isp-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 01-04-2010, 09:36 AM
"Ross Halliday"
 
Default RAID1 problem - server freezes on md data-check

The total locking up sounds like a problem that someone who develops the
software might be able to help with (I am reminded of a bug that Ubuntu
featured where checkarray would completely freeze or reboot certain
systems on Linux 2.6.24 or so). Aside from any bugs that checkarray
function is definitely a pain on a production system.

You can try changing the disks out so that both of them run at 3.0 Gbps,
this may speed up the process. Otherwise I would suggest checking out
the help for /usr/share/mdadm/checkarray and modifying the system cron
job (see /etc/cron.d/mdadm) so the checks are timed and staggered per
array the way you like.

Cheers

---
Ross Halliday
Network Operations
WTC Communications



> -----Original Message-----
> From: George Chelidze [mailto:wrath@geo.net.ge]
> Sent: Monday, January 04, 2010 2:50 AM
> To: debian-isp@lists.debian.org
> Subject: RAID1 problem - server freezes on md data-check
>
> Hello,
>
> I'v got an HP ML110 Intel Dual-Core E2160 server with 2 HDDs:
>
> GB0250EAFYK - HP 250GB 3G SATA 7.2K 3.5" MDL 250 GB SATA Hard Drive
> GB0250C8045 - HP 250GB 7.2K SATA Hard Disk Drive
>
> So, I use SATA 3.0-Gb/s and SATA 1.5 Gb/s for RAID-1 configuration. I
> have configured 4 MD volumes and it's running fine for some time,
> however every now and then servers freezes. At that time I can ping
the
> server from the network, however I can't ssh into the server, even a
> keyboard us useless, so I have to hard reset the server. Below are the
> last messages from my kern.log:
>
> Jan 3 00:57:01 barambo1 kernel: [986475.159596] md: data-check of
RAID
> array md0
> Jan 3 00:57:01 barambo1 kernel: [986475.159600] md: minimum
> _guaranteed_ speed: 1000 KB/sec/disk.
> Jan 3 00:57:01 barambo1 kernel: [986475.159602] md: using maximum
> available idle IO bandwidth (but not more than 200000 KB/sec) for
> data-check.
> Jan 3 00:57:01 barambo1 kernel: [986475.159606] md: using 128k
window,
> over a total of 3903680 blocks.
> Jan 3 00:57:01 barambo1 kernel: [986475.162041] md: delaying data-
> check
> of md1 until md0 has finished (they share one or more physical units)
> Jan 3 00:57:01 barambo1 kernel: [986475.164449] md: delaying data-
> check
> of md2 until md0 has finished (they share one or more physical units)
> Jan 3 00:57:01 barambo1 kernel: [986475.164455] md: delaying data-
> check
> of md1 until md2 has finished (they share one or more physical units)
> Jan 3 00:57:01 barambo1 kernel: [986475.166695] md: delaying data-
> check
> of md3 until md0 has finished (they share one or more physical units)
> Jan 3 00:57:01 barambo1 kernel: [986475.166699] md: delaying data-
> check
> of md1 until md3 has finished (they share one or more physical units)
> Jan 3 00:57:01 barambo1 kernel: [986475.166705] md: delaying data-
> check
> of md2 until md3 has finished (they share one or more physical units)
> Jan 3 00:58:13 barambo1 kernel: [986547.257883] md: md0: data-check
> done.
> Jan 3 00:58:13 barambo1 kernel: [986547.276663] md: delaying data-
> check
> of md1 until md3 has finished (they share one or more physical units)
> Jan 3 00:58:13 barambo1 kernel: [986547.276668] md: data-check of
RAID
> array md3
> Jan 3 00:58:13 barambo1 kernel: [986547.276671] md: minimum
> _guaranteed_ speed: 1000 KB/sec/disk.
> Jan 3 00:58:13 barambo1 kernel: [986547.276674] md: using maximum
> available idle IO bandwidth (but not more than 200000 KB/sec) for
> data-check.
> Jan 3 00:58:13 barambo1 kernel: [986547.276678] md: using 128k
window,
> over a total of 122126016 blocks.
> Jan 3 00:58:13 barambo1 kernel: [986547.276681] md: delaying data-
> check
> of md2 until md3 has finished (they share one or more physical units)
>
> OS is Debian 5.0.3 Lenny stable with linux-image-2.6.30-bpo.2-686
> kernel. I had the same results with linux-image-2.6.26-2-686 stock
> kernel. My basic question is can this happen because I use 2 different
> drives? I have a chance to replace GB0250C8045 with GB0250EAFYK or
> GB0250EAFYK with GB0250C8045 and have 2 identical drives. Is it a good
> idea and will it solve my problem?
>
> Thank you in advance for any input,
>
> Best Regards,
>
> George Chelidze


--
To UNSUBSCRIBE, email to debian-isp-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 01-04-2010, 11:34 AM
Thomas Goirand
 
Default RAID1 problem - server freezes on md data-check

Ross Halliday wrote:
> Aside from any bugs that checkarray
> function is definitely a pain on a production system.

Well, it's even more a pain to have no monthly check at all, and have
your drive silently die without a warning. Also, my findings is that
most of the time, such lock-up happens only on certain kind of
controllers, or with defective (half working) HDD.

Thomas


--
To UNSUBSCRIBE, email to debian-isp-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 01-05-2010, 04:22 AM
George Chelidze
 
Default RAID1 problem - server freezes on md data-check

First let me say thank you to all who shared their experience and
knowledge. It was really helpful.

Yesterday I managed to replace 1.5Gb/s drive with 3.0Gb/s drive and now
both drives are identical. The replacement required to rebuild an array
and it passed but with one exception: at the end of reconstruction
process I got "task * blocked for more than 120 seconds" messages in my
logs:

Jan 4 23:38:35 barambo1 kernel: [12517.683173] INFO: task
kjournald:1088 blocked for more than 120 seconds.
Jan 4 23:38:35 barambo1 kernel: [12517.683227] "echo 0
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 4 23:38:35 barambo1 kernel: [12517.683310] kjournald D 0735ccb1
0 1088 2
Jan 4 23:38:35 barambo1 kernel: [12517.683313] f78ef0c0 00000046
f7817c28 0735ccb1 00000a43 f78ef24c c180bfc0 00000000
Jan 4 23:38:35 barambo1 kernel: [12517.683319] f7495edc 008e3f0c
000006b2 00000000 008e3f0c f7495edc 008e3f0c f7939f18
Jan 4 23:38:35 barambo1 kernel: [12517.683326] c180bfc0 01451000
f7939f18 c1801688 c02b8a70 f7939f10 00000000 c019098e
Jan 4 23:38:35 barambo1 kernel: [12517.683332] Call Trace:
Jan 4 23:38:35 barambo1 kernel: [12517.683339] [<c02b8a70>]
io_schedule+0x49/0x80
Jan 4 23:38:35 barambo1 kernel: [12517.683343] [<c019098e>]
sync_buffer+0x30/0x33
Jan 4 23:38:35 barambo1 kernel: [12517.683347] [<c02b8c5e>]
__wait_on_bit+0x33/0x58
Jan 4 23:38:35 barambo1 kernel: [12517.683351] [<c019095e>]
sync_buffer+0x0/0x33
Jan 4 23:38:35 barambo1 kernel: [12517.683355] [<c019095e>]
sync_buffer+0x0/0x33
Jan 4 23:38:35 barambo1 kernel: [12517.683358] [<c02b8ce2>]
out_of_line_wait_on_bit+0x5f/0x67
Jan 4 23:38:35 barambo1 kernel: [12517.683364] [<c01319c9>]
wake_bit_function+0x0/0x3c
Jan 4 23:38:35 barambo1 kernel: [12517.683369] [<c019092a>]
__wait_on_buffer+0x16/0x18
Jan 4 23:38:35 barambo1 kernel: [12517.683373] [<f894fd7a>]
journal_commit_transaction+0x6cf/0xb3d [jbd]
Jan 4 23:38:35 barambo1 kernel: [12517.683386] [<c0129b2c>]
lock_timer_base+0x19/0x35
Jan 4 23:38:35 barambo1 kernel: [12517.683393] [<f8952468>] kjournald
+0xa5/0x1c6 [jbd]
Jan 4 23:38:35 barambo1 kernel: [12517.683402] [<c013199c>]
autoremove_wake_function+0x0/0x2d
Jan 4 23:38:35 barambo1 kernel: [12517.683406] [<f89523c3>] kjournald
+0x0/0x1c6 [jbd]
Jan 4 23:38:35 barambo1 kernel: [12517.683414] [<c01318db>] kthread
+0x38/0x5d
Jan 4 23:38:35 barambo1 kernel: [12517.683417] [<c01318a3>] kthread
+0x0/0x5d
Jan 4 23:38:35 barambo1 kernel: [12517.683421] [<c01044f3>]
kernel_thread_helper+0x7/0x10
Jan 4 23:38:35 barambo1 kernel: [12517.683426] =======================

(please check attached file with similar messages for different
processes) However, after several minutes server returned to it's normal
state and since then working fine. Now it's running
linux-image-2.6.26-2-686 stock kernel. Any ideas?

Best Regards,

George Chelidze
 
Old 01-05-2010, 11:07 AM
Thomas Goirand
 
Default RAID1 problem - server freezes on md data-check

George Chelidze wrote:
> First let me say thank you to all who shared their experience and
> knowledge. It was really helpful.
>
> Yesterday I managed to replace 1.5Gb/s drive with 3.0Gb/s drive and now
> both drives are identical. The replacement required to rebuild an array
> and it passed but with one exception: at the end of reconstruction
> process I got "task * blocked for more than 120 seconds" messages in my
> logs:
>
> Jan 4 23:38:35 barambo1 kernel: [12517.683173] INFO: task
> kjournald:1088 blocked for more than 120 seconds.

I never had this, however, my understanding is that this is related to
ext3 journaling filesystem (as this is kjournald that is blocked for 2
minutes), not to RAID (which would be mdadm, mdX_raidY and the like...),
and that it shouldn't be blocking anything apart writing the journal.
Was the server in a frozen state when this appeared in your log?

Just my 2 cents guess here,

Thomas


--
To UNSUBSCRIBE, email to debian-isp-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 01-05-2010, 01:59 PM
micah anderson
 
Default RAID1 problem - server freezes on md data-check

On Mon, 04 Jan 2010 20:34:09 +0800, Thomas Goirand <thomas@goirand.fr> wrote:
> Ross Halliday wrote:
> > Aside from any bugs that checkarray
> > function is definitely a pain on a production system.

I have this same problem with the Lenny kernels on certain machines. I
have not been able to identify anything specific that is identical on
the machines where this happens yet. Essentially, on these systems, the
monthly raid check requires a reboot as the drive subsystem becomes so
blocked that the load goes over 500 and the raid resync never
completes. I can wait for days for it and it wont finish.

If I reboot the system and sync the raid arrays before anything starts
to use that particular partition, then everything works fine.

On these systems I disable the monthly raid check, its not the right
solution obviously, but it sucks to wake up on Sunday morning to find
multiple outages due to this scheduled raid check.

> Well, it's even more a pain to have no monthly check at all, and have
> your drive silently die without a warning. Also, my findings is that
> most of the time, such lock-up happens only on certain kind of
> controllers, or with defective (half working) HDD.

I agree silent drive death is bad, but in a raid mirror setup, if one of
the drives dies, wont you be fine?

I am pretty certain its not a particular type of controller, because I
have a number of duplicate hardware machines, some have this problem,
some do not. The 'half working' HDD was my theory as well, but smart
tests, badblocks doesn't seem to do anything.

m
 
Old 01-05-2010, 02:49 PM
Peter Vratny
 
Default RAID1 problem - server freezes on md data-check

micah anderson wrote:
> I have this same problem with the Lenny kernels on certain machines. I
> have not been able to identify anything specific that is identical on
> the machines where this happens yet. Essentially, on these systems, the

here it is the same. the problem introduced with lenny. all our
maschines where this happens are IBM Blades HS20 with IDE, Hardware-Raid
disabled (using md and ext3).

> On these systems I disable the monthly raid check, its not the right
> solution obviously, but it sucks to wake up on Sunday morning to find
> multiple outages due to this scheduled raid check.

thats what we did, too (you are lucky that your monitoring lets you
sleep until the morning :-)).


>> Well, it's even more a pain to have no monthly check at all, and have
>> your drive silently die without a warning. Also, my findings is that
>> most of the time, such lock-up happens only on certain kind of
>> controllers, or with defective (half working) HDD.
>
> I agree silent drive death is bad, but in a raid mirror setup, if one of
> the drives dies, wont you be fine?
>
> I am pretty certain its not a particular type of controller, because I
> have a number of duplicate hardware machines, some have this problem,
> some do not. The 'half working' HDD was my theory as well, but smart
> tests, badblocks doesn't seem to do anything.

I second this. Imho its a problem of the kernel (resp. some driver). i
hoped this would end with some upgrade, it did not (we're using stock
kernel).

ys
Peter

--
"Wer nichts zu verbergen hat, hat bereits alles verloren"
http://klicklich.at
 
Old 01-05-2010, 03:56 PM
Thomas Goirand
 
Default RAID1 problem - server freezes on md data-check

micah anderson wrote:
> I am pretty certain its not a particular type of controller, because I
> have a number of duplicate hardware machines, some have this problem,
> some do not. The 'half working' HDD was my theory as well, but smart
> tests, badblocks doesn't seem to do anything.
>
> m

There has been some very interesting statistics that google has
published on their webfarm about SMART. The result were that SMART
catches about 60% of the failures, and that on the other 40%, it doesn't
sees anything. So it's not a very reliable test (I'm not saying it's
useless, just that it wont catch all failures).

As for the badblocks, how did you check for them?

Altogether, I really think that RAID and HDD failure are really a big
issue that us, providers, have to deal with. I wish there was some
reliable solutions out there, considering the imperfection of RAID
(hardware OR software, both have issues...).

Thomas


--
To UNSUBSCRIBE, email to debian-isp-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 01-05-2010, 04:00 PM
Thomas Goirand
 
Default RAID1 problem - server freezes on md data-check

Peter Vratny wrote:
> I second this. Imho its a problem of the kernel (resp. some driver). i
> hoped this would end with some upgrade, it did not (we're using stock
> kernel).

So you guys are saying this is in the sata driver? If so, then what's
the SATA controler that you are running? It would be interesting to know
if you all got the same hardware (and then using the same driver).

Thomas


--
To UNSUBSCRIBE, email to debian-isp-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 01-05-2010, 09:02 PM
Peter Vratny
 
Default RAID1 problem - server freezes on md data-check

Thomas Goirand wrote:
> So you guys are saying this is in the sata driver? If so, then what's
> the SATA controler that you are running? It would be interesting to know
> if you all got the same hardware (and then using the same driver).

No the opposit, that's why I mentioned that it's IDE (PATA) on our Blades...

found 3 of them in a quick research, all with this controller:

[ 2.258305] SvrWks CSB6: IDE controller (0x1166:0x0213 rev 0xb0) at
PCI slot 0000:00:0f.1

yours
Peter
--
"Wer nichts zu verbergen hat, hat bereits alles verloren"
http://klicklich.at
 

Thread Tools




All times are GMT. The time now is 09:35 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org