FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > CentOS > CentOS

 
 
LinkBack Thread Tools
 
Old 09-26-2011, 08:09 PM
Benjamin Smith
 
Default Hard I/O lockup with EL6

On Monday, September 26, 2011 12:36:19 PM m.roth@5-cent.us wrote:
> a) have you checked
> /var/log/message for memory or drive errors?

Looked through the logs, there's *nothing* I can find that's out of sorts. When
the IO problem happens, nothing can be written.

> Maybe memtest86?

I replaced all the RAM from working/non-working machines. In several cases
where replacing RAM resolved the issue, memtest didn't indicate any problems,
so I'm not inclined to trust it.

> b) diffed
> dmesg between working and dying machines?

Other than the IRQ difference noted earlier, visual scan revealed no differences
involving mpt2.

>
> One more thing: should we assume you were trying to do things, when they
> die, from the console? I ask because I note that you're using the e1000e
> driver, which was just the subject of a thread here.

I'm familiar with the stale EL6 e1000e driver. I've been using one included by
yum from elrepo. Manually downloaded RPM so that ethernet works before doing a
yum -y update. I've been assuming this was unrelated.

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 09-26-2011, 09:00 PM
Brian McKerr
 
Default Hard I/O lockup with EL6

Have you checked the cables you are using ?


On Tue, Sep 27, 2011 at 6:09 AM, Benjamin Smith <lists@benjamindsmith.com>wrote:

> On Monday, September 26, 2011 12:36:19 PM m.roth@5-cent.us wrote:
> > a) have you checked
> > /var/log/message for memory or drive errors?
>
> Looked through the logs, there's *nothing* I can find that's out of sorts.
> When
> the IO problem happens, nothing can be written.
>
> > Maybe memtest86?
>
> I replaced all the RAM from working/non-working machines. In several cases
> where replacing RAM resolved the issue, memtest didn't indicate any
> problems,
> so I'm not inclined to trust it.
>
> > b) diffed
> > dmesg between working and dying machines?
>
> Other than the IRQ difference noted earlier, visual scan revealed no
> differences
> involving mpt2.
>
> >
> > One more thing: should we assume you were trying to do things, when they
> > die, from the console? I ask because I note that you're using the e1000e
> > driver, which was just the subject of a thread here.
>
> I'm familiar with the stale EL6 e1000e driver. I've been using one included
> by
> yum from elrepo. Manually downloaded RPM so that ethernet works before
> doing a
> yum -y update. I've been assuming this was unrelated.
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
> _______________________________________________
> CentOS mailing list
> CentOS@centos.org
> http://lists.centos.org/mailman/listinfo/centos
>
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 09-26-2011, 09:40 PM
Benjamin Smith
 
Default Hard I/O lockup with EL6

On Monday, September 26, 2011 02:00:52 PM Brian McKerr wrote:
> Have you checked the cables you are using ?

There are none - it's a front-loaded hot-swap rackmount. The systems are
stable under EL5.

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 09-26-2011, 09:42 PM
Devin Reade
 
Default Hard I/O lockup with EL6

--On Monday, September 26, 2011 12:11:47 PM -0700 Benjamin Smith
<lists@benjamindsmith.com> wrote:

> I'm trying to figure out why 2 machines have a "hard I/O lock" on the HDD
> when running EL6.

I _won't_ chime in with a "check your <whatever>". Instead here's a
potentially useless datapoint:

I have an older but still usuable 32 bit 686 class machine that was formerly
a production machine running Fedora Core 6. Its services were migrated
off a while back and I decided I'd use it as a test of CentOS 6. For
this test I needed a few disks in RAID6 and the motherboard only had
two SATA ports so I added a multiport PCI SATA card (a model that
has served me well in the past).

Short version: Although the install went fine, trying to run CentOS 6
on this with a four disk RAID6 (with the first 200MB of each disk in
RAID1 for /boot, the remainder as RAID6 with LVM on top) resulted in
an unstable system. After some unpredictable amount of time (anywhere
from 15 minutes to days) the system would lock up hard. Unfortunately
I don't recall if the error messages were identical to yours, but it
seems eerily familiar.

I did the usual tricks about swapping out drive controllers, disks,
using different combinations of onboard vs addon SATA, memtest86,
increased power supply capacity, etc. No dice.

I eventually ended up getting new hardware for the task (an HP
MicroServer) and so far the new machine seems to be stable enough
running CentOS 6 in the RAID1 /boot + RAID6 LVM configuration. I've
not had the chance yet to go back and experiment with the old
machine under C6.

Unfortunately in trying to use C6 on the old machine I wound up with
far too many changed variables to figure out where the problem was.
Despite that, my gut tells me that it's not a hardware problem.

Devin

_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 09-26-2011, 10:13 PM
Benjamin Smith
 
Default Hard I/O lockup with EL6

On Monday, September 26, 2011 02:42:18 PM Devin Reade wrote:
> --On Monday, September 26, 2011 12:11:47 PM -0700 Benjamin Smith
> Unfortunately in trying to use C6 on the old machine I wound up with
> far too many changed variables to figure out where the problem was.
> Despite that, my gut tells me that it's not a hardware problem.

Thanks for the feedback. Unfortunately, these aren't ancient 686 systems, they
are 1-ish year old 8-core Intel Xeons with 32 GB of ECC RAM apiece. I can't
justify replacing them, especially since two of the four are happily
deliverying gorgeous performance!


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 09-26-2011, 10:22 PM
Scott Silva
 
Default Hard I/O lockup with EL6

on 9/26/2011 3:13 PM Benjamin Smith spake the following:
> On Monday, September 26, 2011 02:42:18 PM Devin Reade wrote:
>> --On Monday, September 26, 2011 12:11:47 PM -0700 Benjamin Smith
>> Unfortunately in trying to use C6 on the old machine I wound up with
>> far too many changed variables to figure out where the problem was.
>> Despite that, my gut tells me that it's not a hardware problem.
>
> Thanks for the feedback. Unfortunately, these aren't ancient 686 systems, they
> are 1-ish year old 8-core Intel Xeons with 32 GB of ECC RAM apiece. I can't
> justify replacing them, especially since two of the four are happily
> deliverying gorgeous performance!
>
>
Cane in late, but I suppose you tried the standards like re-seating anything
that is removable? Cards, memory, etc...


_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 09-26-2011, 10:32 PM
Devin Reade
 
Default Hard I/O lockup with EL6

--On Monday, September 26, 2011 03:13:09 PM -0700 Benjamin Smith
<lists@benjamindsmith.com> wrote:

> Thanks for the feedback. Unfortunately, these aren't ancient 686 systems,
> they are 1-ish year old 8-core Intel Xeons with 32 GB of ECC RAM apiece.
> I can't justify replacing them, especially since two of the four are
> happily deliverying gorgeous performance!

No doubt.

My comment about replacing it was just a statement about what happened
in my case (since fighting with old hardware is not exactly a valuable use
of my time). It wasn't a suggestion.

The post itself was not much more than "you're not the only person
seeing something wonky". Mine is unfortunately lacking in hard data.

Devin

_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 09-26-2011, 10:52 PM
Ross Walker
 
Default Hard I/O lockup with EL6

On Sep 26, 2011, at 3:11 PM, Benjamin Smith <lists@benjamindsmith.com> wrote:

> I'm trying to figure out why 2 machines have a "hard I/O lock" on the HDD when
> running EL6.
>
> I have 4 identical machines, all were stable with EL5. 2 work great with EL6,
> 2 do not. I've checked momtherboard BIOS versions and settings, SAS controller
> BIOS versions and settings, they are the same between the working and non-
> working systems.
>
> When booting a non-working system, it boots straight up to the boot prompt
> (runlevel 3) without issue, and everything works fine. When the machine sits
> idle for a period of time (ranging from 15 minutes or so and up) the HDD
> becomes unreadable/unwritable and the system is useless for any purpose and
> must be hard restarted with a full power cycle - it won't even shut down.
>
> Since nothing is logged, I've had precious little information to diagnose
> with. After several attempts to find out what's going on, I find the following
> emitted to the screen:
>
> mpt2sas0: diag reset: FAILED
> mpt2sas0: diag reset: FAILED
> mpt2sas0: diag reset: FAILED
> end_request: I/O error, dev sda, sector 226972349
> Buffer I/O error, device sda5, logical block 2719747
> sd 0:0:0:0rejecting I/O to offline device
> sd 0:0:0:0rejecting I/O to offline device
> sd 0:0:0:0rejecting I/O to offline device
>
> This is NOT due to a faulty HDD: I've tried new hard disks, SATA/SAS, I've
> swapped hard disks with an identical working unit and verified that the working
> unit remains working and the failing unit continues to fail. I've reformatted
> and re-installed EL6 numerous times with consistent results.
>
> Googling this error returned very little useful information: where should I go
> now? Below, please find outputs of dmesg and lspci. I've compared outputs of
> dmesg between working and nonworking systems, the output of anything with
> "mpt" at the beginning is identical except for different IRQ ports.

Tried upgrading BIOS?

Errors during idle periods might point to C-State or P-State compatibility issues.

You could try disabling the power management (Speedstep) in the BIOS and see if that makes a difference.

-Ross

_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 09-27-2011, 05:16 AM
Emmanuel Noobadmin
 
Default Hard I/O lockup with EL6

On 9/27/11, Benjamin Smith <lists@benjamindsmith.com> wrote:
> When booting a non-working system, it boots straight up to the boot prompt
> (runlevel 3) without issue, and everything works fine. When the machine sits
> idle for a period of time (ranging from 15 minutes or so and up) the HDD
> becomes unreadable/unwritable and the system is useless for any purpose and
> must be hard restarted with a full power cycle - it won't even shut down.

I'm thinking I might have a similar problem with my test install of
EL6. Initially I had dismissed it as a one-off but it has apparently
locked up again. I'll be visiting it physically later and maybe have
more information to share. However, it is also a 3400-series Xeon, the
rest of the hardware are all cheaper hardware.

What drives are you using in these servers?
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 09-27-2011, 05:55 AM
Benjamin Smith
 
Default Hard I/O lockup with EL6

On Monday, September 26, 2011 10:16:14 PM Emmanuel Noobadmin wrote:
> On 9/27/11, Benjamin Smith <lists@benjamindsmith.com> wrote:
> > When booting a non-working system, it boots straight up to the boot
> > prompt (runlevel 3) without issue, and everything works fine. When the
> > machine sits idle for a period of time (ranging from 15 minutes or so
> > and up) the HDD becomes unreadable/unwritable and the system is useless
> > for any purpose and must be hard restarted with a full power cycle - it
> > won't even shut down.
>
> I'm thinking I might have a similar problem with my test install of
> EL6. Initially I had dismissed it as a one-off but it has apparently
> locked up again. I'll be visiting it physically later and maybe have
> more information to share. However, it is also a 3400-series Xeon, the
> rest of the hardware are all cheaper hardware.
>
> What drives are you using in these servers?

The drives are confirmed irrelevant. SATA/SAS with different manufacturers make
no differents; it's not a factor.

However, I did eventually find update instructions for how to update x8si6f
motherboard (having the SAS 2008 chipset) chipset that matches the board I
have. I've (roughly) followed the instructions and have achieved a full 8
hours of uptime - already well ahead of expectations. If it lasts 48 hours,
I'll consider it a success, and will post details here so that other techies
can find it.

http://www.servethehome.com/howto-flash-supermicro-x8si6f-lsi-sas-2008-
controller-lsi-firmware/

I wish you the best of luck!

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 

Thread Tools




All times are GMT. The time now is 08:13 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org