Haven't gotten any tips on a solution to the problem below.
It happened again this weekend.
My next test steps (order not determined):
1. Downgrade to CentOS 4
2. Swap out PERC controller with a spare
I have never had a problem with the PERC4/DC controllers on our
other machines (RHEL3/4, CentOS 4). Although, I've no other
machine that has 5 300G Fujitsu SCSI drives either.
Any suggestions on the below, or which order on the above to
I have a 6650 with a PERC4/DC running CentOS5.
After 1 to 3 weeks of operation (running VMWare Server) it
'dies' (raid array gets taken offline) and you get rejecting
I/O to offline device.
When this system was setup late last year, the 6650 was
given all the latest firmware along with the PERC4/DC.
using linttylog, the last entries from when the system must
have 'checked out' last night, I see the data attached below.
Some time back I thought I had cured this problem by adding
noapic to the kernel boot parameters in boot.conf. It had
gone away for a long time... but is now back.
according to lintty, it reports controller firmware is:
T0: Firmware version 352D build on Mar 19 2007 at 17:43:23
T0: MegaRAID Series 518 firmware version 352D
using strings tty.log | grep 'MedErr on pd' | cut -c17- | sort | uniq
-c | sort -n I see:
163 REC:log MedErr on pd #retries=0
165 REC:log MedErr on pd #retries=0
168 REC:log MedErr on pd #retries=0
If I am to believe this, Patrol read is finding media errors on
physical drives 1, 2, and 4 ! ?
These drives are not even a year old, and to have an almost even
distribution of errors across 3 drives seems far fetched (unless
patrol read is reading past the end of drive ?, but then it would
be doing that with all 5 drives).
Is the PERC busted ? driver issue ?
I'm running CentOS 5 with kernel 2.6.18-53.1.13.el5PAE
from dmesg, megaraid related driver versions:
megaraid cmm: 220.127.116.11 (Release Date: Sun Jul 16 00:01:03 EST 2006)
megaraid: 18.104.22.168 (Release Date: Thu Nov 16 15:32:35 EST 2006)
Anyone seen this behavior before ? Anyone have a solution ?
We have several Dells in a hosting environment with PERC4/DC
running RHEL3, RHEL4.X, and CentOS4.X. We have not had this
issue on any of them (though they do not have 5 300G Fujitsu
SCSI drives in a RAID 5 config either (as this one does)).
Hoping someone can shed some light on this... so far I keep
coming up short on finding a solution.
Here is the full content of the last lines recorded in the PERC
as pulled by linttylog: