FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > CentOS > CentOS

 
 
LinkBack Thread Tools
 
Old 11-15-2008, 07:16 AM
"Rudi Ahlers"
 
Default how to debug hardware lockups?

Hi,

We have a server which locks up about once a week (for the past 3
weeks now), without any warning, and the only way to recover it, is to
reset the server. This causes unwanted downtime, and often software
loss as well.

How do I debug the server, which runs CentOS 5.2 to see why it locks
up? The CPU is an Intel Q9300 Core 2 Quad, with 8 GB RAM, on an Intel
Motherboard

The last few entries before the server froze, is:


Nov 15 07:15:20 saturn snmpd[2527]: Connection from UDP: [127.0.0.1]:59008
Nov 15 07:15:20 saturn snmpd[2527]: Received SNMP packet(s) from UDP:
[127.0.0.1]:59008
Nov 15 07:15:20 saturn snmpd[2527]: Connection from UDP: [127.0.0.1]:47729
Nov 15 07:15:20 saturn snmpd[2527]: Received SNMP packet(s) from UDP:
[127.0.0.1]:47729
Nov 15 07:15:20 saturn snmpd[2527]: Connection from UDP: [127.0.0.1]:47890
Nov 15 07:15:20 saturn snmpd[2527]: Received SNMP packet(s) from UDP:
[127.0.0.1]:47890
Nov 15 07:15:20 saturn snmpd[2527]: Connection from UDP: [127.0.0.1]:50023
Nov 15 07:15:20 saturn snmpd[2527]: Received SNMP packet(s) from UDP:
[127.0.0.1]:50023
Nov 15 07:15:20 saturn snmpd[2527]: Connection from UDP: [127.0.0.1]:58459
Nov 15 07:15:20 saturn snmpd[2527]: Received SNMP packet(s) from UDP:
[127.0.0.1]:58459
Nov 15 10:10:10 saturn syslogd 1.4.1: restart.
Nov 15 10:10:11 saturn kernel: klogd 1.4.1, log source = /proc/kmsg started.
Nov 15 10:10:11 saturn kernel: Bootdata ok (command line is ro
root=/dev/System/root)
Nov 15 10:10:11 saturn kernel: Linux version 2.6.18-92.1.17.el5xen
(mockbuild@builder10.centos.org) (gcc version 4.1.2 20071124 (Red Hat
4.1
.2-42)) #1 SMP Tue Nov 4 14:13:09 EST 2008
Nov 15 10:10:11 saturn kernel: BIOS-provided physical RAM map:
Nov 15 10:10:11 saturn kernel: Xen: 0000000000000000 -
00000001ef958000 (usable)
Nov 15 10:10:11 saturn kernel: DMI 2.4 present.
Nov 15 10:10:11 saturn kernel: ACPI: LAPIC (acpi_id[0x01]
lapic_id[0x00] enabled)
Nov 15 10:10:11 saturn kernel: ACPI: LAPIC (acpi_id[0x03]
lapic_id[0x02] enabled)
Nov 15 10:10:11 saturn kernel: ACPI: LAPIC (acpi_id[0x02]
lapic_id[0x01] enabled)
Nov 15 10:10:11 saturn kernel: ACPI: LAPIC (acpi_id[0x04]
lapic_id[0x03] enabled)
Nov 15 10:10:11 saturn kernel: ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1])
Nov 15 10:10:11 saturn kernel: ACPI: LAPIC_NMI (acpi_id[0x02] dfl dfl lint[0x1])
Nov 15 10:10:11 saturn kernel: ACPI: IOAPIC (id[0x02]
address[0xfec00000] gsi_base[0])
Nov 15 10:10:11 saturn kernel: IOAPIC[0]: apic_id 2, version 32,
address 0xfec00000, GSI 0-23
Nov 15 10:10:11 saturn kernel: ACPI: INT_SRC_OVR (bus 0 bus_irq 0
global_irq 2 dfl dfl)
Nov 15 10:10:11 saturn kernel: ACPI: INT_SRC_OVR (bus 0 bus_irq 9
global_irq 9 high level)




--

Kind Regards
Rudi Ahlers
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 11-15-2008, 01:47 PM
"Richard Karhuse"
 
Default how to debug hardware lockups?

On Sat, Nov 15, 2008 at 3:16 AM, Rudi Ahlers <rudiahlers@gmail.com> wrote:

Hi,



We have a server which locks up about once a week (for the past 3

weeks now), without any warning, and the only way to recover it, is to

reset the server. This causes unwanted downtime, and often software

loss as well.



How do I debug the server, which runs CentOS 5.2 to see why it locks

up? The CPU is an Intel Q9300 Core 2 Quad, with 8 GB RAM, on an Intel

Motherboard


Attach a local console to the video port and let us know what it says -->
that will (probably) be very insightful.* E.G., Kernel panic, MCE, ....

Next, run memtest86+ -- at least overnight.* [Note: I've had less than

stellar results with memtest86 recently, but if it shows errors, you've got
a problem big time; if it doesn't show errors, you still not 100% sure that
memory is good:-):-).]* Is it ECC memory??* If not, why not -- particularly

given it is a critical server ....

Are all the fans spinning -- particularly the CPU??* Do you have lm-sensors
enabled??* Either create a script or using something like munin to track things
and see if fans, temperature, voltages are all stable & within range up to death.


Can you easilhy swap power supplies??* (Is the unit dual powered or just
one unit?)

Clearly, just a start, but you get the idea of elementary, 101 problem solving ....

** -rak-


_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 11-15-2008, 03:11 PM
"Rudi Ahlers"
 
Default how to debug hardware lockups?

On Sat, Nov 15, 2008 at 4:47 PM, Richard Karhuse <rkarhuse@gmail.com> wrote:
>
>
> On Sat, Nov 15, 2008 at 3:16 AM, Rudi Ahlers <rudiahlers@gmail.com> wrote:
>>
>> Hi,
>>
>> We have a server which locks up about once a week (for the past 3
>> weeks now), without any warning, and the only way to recover it, is to
>> reset the server. This causes unwanted downtime, and often software
>> loss as well.
>>
>> How do I debug the server, which runs CentOS 5.2 to see why it locks
>> up? The CPU is an Intel Q9300 Core 2 Quad, with 8 GB RAM, on an Intel
>> Motherboard
>
> Attach a local console to the video port and let us know what it says -->
> that will (probably) be very insightful. E.G., Kernel panic, MCE, ....
>
> Next, run memtest86+ -- at least overnight. [Note: I've had less than
> stellar results with memtest86 recently, but if it shows errors, you've got
> a problem big time; if it doesn't show errors, you still not 100% sure that
> memory is good:-):-).] Is it ECC memory?? If not, why not -- particularly
> given it is a critical server ....
>
> Are all the fans spinning -- particularly the CPU?? Do you have lm-sensors
> enabled?? Either create a script or using something like munin to track
> things
> and see if fans, temperature, voltages are all stable & within range up to
> death.
>
> Can you easilhy swap power supplies?? (Is the unit dual powered or just
> one unit?)
>
> Clearly, just a start, but you get the idea of elementary, 101 problem
> solving ....
>
> -rak-
>
>
> _______________________________________________

Unfortunately, I can't leave a monitor attached to the server all the
time. The server is in a shared cabinet @ a 3rd party ISP, and they
lock the cabinets once we're done working with it. The last lockup was
about 6 days ago, and previous one about 8 days ago. There's no
consitancy.

How can I redirect all console output to a file instead?

I have got lm-sensors installed, but it doesn't pick-up the
motherboard's sensors. All fans are working when I checked last time,
but it's a 1U chassis, so it's got limited air-flow. I don't know if
it get's too hot, or not. When I rebooted it, the temp was about 45
degrees celcius, but the lockup only happened about 6 days later. So,
I can't even sit there 24/7 to see what happens.


--

Kind Regards
Rudi Ahlers
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 11-15-2008, 04:26 PM
Vandaman
 
Default how to debug hardware lockups?

Rudi Ahlers wrote:

> We have a server which locks up about once a week (for the
> past 3
> weeks now), without any warning, and the only way to
> recover it, is to
> reset the server. This causes unwanted downtime, and often
> software
> loss as well.
>
> How do I debug the server, which runs CentOS 5.2 to see why
> it locks
> up?

Are those the only logs you've got. Normally linux is very chatty,
and you get WARNING, PANIC etc messages. What kernel are you using?
Does a previous kernel or CentOS plus kernel stop the problem?

Regards,
Vandaman.




_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 11-15-2008, 05:13 PM
"Rudi Ahlers"
 
Default how to debug hardware lockups?

On Sat, Nov 15, 2008 at 7:26 PM, Vandaman <vandaman2002-sk@yahoo.co.uk> wrote:
> Rudi Ahlers wrote:
>
>> We have a server which locks up about once a week (for the
>> past 3
>> weeks now), without any warning, and the only way to
>> recover it, is to
>> reset the server. This causes unwanted downtime, and often
>> software
>> loss as well.
>>
>> How do I debug the server, which runs CentOS 5.2 to see why
>> it locks
>> up?
>
> Are those the only logs you've got. Normally linux is very chatty,
> and you get WARNING, PANIC etc messages. What kernel are you using?
> Does a previous kernel or CentOS plus kernel stop the problem?
>
> Regards,
> Vandaman.
>
>
>

Well, on a standard CentOS 5.2, /var/log/messages will be the the
place to log problems like this, or where else can I get more info?

I've upgraded the kernel to xen.gz-2.6.18-92.1.18.el5 but can only
reboot the server tomorrow, during a planned maintenaince window and
then see what it does

--

Kind Regards
Rudi Ahlers
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 11-15-2008, 05:17 PM
"nate"
 
Default how to debug hardware lockups?

Rudi Ahlers wrote:

> Unfortunately, I can't leave a monitor attached to the server all the
> time. The server is in a shared cabinet @ a 3rd party ISP, and they
> lock the cabinets once we're done working with it. The last lockup was
> about 6 days ago, and previous one about 8 days ago. There's no
> consitancy.
>
> How can I redirect all console output to a file instead?

Configure a serial console, connect the console to another
system and use something like minicom to log the console to a file.
You can't really log to the local system in this situation as
you likely won't capture the event(if you did you would of
seen the error in the system logs)

In my experience most of these kinds of problems are related
to bad ram.

If your running CentOS 4.x configure netdump to send the kernel
dumps to another server, if your using CentOS 5.x configure
diskdump(?) to store the dump to local disk.

Run memtest86 on the system for a few days, replace the system
with a known working one so you can take the broken system off
site from the ISP for diagnostics.

I like running cerberus http://sourceforge.net/projects/va-ctcs/
as a burn-in tool, if the system can survive that running for
a couple days it should be good. In running against a hundred or
so systems I don't recall it taking longer than a few hours
to crash the system if there was a problem.

nate

_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 11-15-2008, 06:59 PM
"Rudi Ahlers"
 
Default how to debug hardware lockups?

On Sat, Nov 15, 2008 at 8:17 PM, nate <centos@linuxpowered.net> wrote:
> Rudi Ahlers wrote:
>
>> Unfortunately, I can't leave a monitor attached to the server all the
>> time. The server is in a shared cabinet @ a 3rd party ISP, and they
>> lock the cabinets once we're done working with it. The last lockup was
>> about 6 days ago, and previous one about 8 days ago. There's no
>> consitancy.
>>
>> How can I redirect all console output to a file instead?
>
> Configure a serial console, connect the console to another
> system and use something like minicom to log the console to a file.
> You can't really log to the local system in this situation as
> you likely won't capture the event(if you did you would of
> seen the error in the system logs)
>
> In my experience most of these kinds of problems are related
> to bad ram.
>
> If your running CentOS 4.x configure netdump to send the kernel
> dumps to another server, if your using CentOS 5.x configure
> diskdump(?) to store the dump to local disk.
>
> Run memtest86 on the system for a few days, replace the system
> with a known working one so you can take the broken system off
> site from the ISP for diagnostics.
>
> I like running cerberus http://sourceforge.net/projects/va-ctcs/
> as a burn-in tool, if the system can survive that running for
> a couple days it should be good. In running against a hundred or
> so systems I don't recall it taking longer than a few hours
> to crash the system if there was a problem.
>
> nate
>
> _______________________________________________
> CentOS mailing list
> CentOS@centos.org
> http://lists.centos.org/mailman/listinfo/centos
>

That machine doesn't have a serial port (why do vendors think serial
ports are obsolete????), so is there any other way to send to logs to
a different machine then?

--

Kind Regards
Rudi Ahlers
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 11-15-2008, 10:14 PM
John R Pierce
 
Default how to debug hardware lockups?

Rudi Ahlers wrote:

Well, on a standard CentOS 5.2, /var/log/messages will be the the
place to log problems like this, or where else can I get more info?



tough to write to the disk when the kernel is crashing. ditto the
network. that leaves VGAs and serial ports, which can be written to by
self contained emergency-crash routines...


IIRC, you said this was a Q9something quad core... thats a desktop
processor... does this server have ECC memory? (I ask, because few
desktop platforms do, while ECC is fairly standard on servers).
Without ECC, the system has no way of knowing it read in bad data from
the ram, and if the bad data happens to be code and that code happens to
be in the kernel, ka-RASH, without any detection or warning, it leaps
off into never-land, and you get a kernel fault, almost always resulting
in...


kernel panic
system halted

with no additional useful information available. with ECC memory,
single bit errors get corrected on the fly, and log an ECC error event,
while double bit errors result in a system halt with a message
indicating such.




_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 11-15-2008, 10:21 PM
"Rudi Ahlers"
 
Default how to debug hardware lockups?

On Sun, Nov 16, 2008 at 1:14 AM, John R Pierce <pierce@hogranch.com> wrote:
> Rudi Ahlers wrote:
>>
>> Well, on a standard CentOS 5.2, /var/log/messages will be the the
>> place to log problems like this, or where else can I get more info?
>>
>
> tough to write to the disk when the kernel is crashing. ditto the network.
> that leaves VGAs and serial ports, which can be written to by self
> contained emergency-crash routines...
>
> IIRC, you said this was a Q9something quad core... thats a desktop
> processor... does this server have ECC memory? (I ask, because few desktop
> platforms do, while ECC is fairly standard on servers). Without ECC, the
> system has no way of knowing it read in bad data from the ram, and if the
> bad data happens to be code and that code happens to be in the kernel,
> ka-RASH, without any detection or warning, it leaps off into never-land, and
> you get a kernel fault, almost always resulting in...
>
> kernel panic
> system halted
>
> with no additional useful information available. with ECC memory, single
> bit errors get corrected on the fly, and log an ECC error event, while
> double bit errors result in a system halt with a message indicating such.
>
>


No, the motherboard doesn't support ECC RAM. The motherboard is a
Intel DG35EC - http://www.intel.com/products/desktop/motherboards/DG35EC/DG35EC-overview.htm



--

Kind Regards
Rudi Ahlers
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 11-15-2008, 10:32 PM
John R Pierce
 
Default how to debug hardware lockups?

Rudi Ahlers wrote:

No, the motherboard doesn't support ECC RAM. The motherboard is a
Intel DG35EC - http://www.intel.com/products/desktop/motherboards/DG35EC/DG35EC-overview.htm



midrange business desktop board. I use a DG33TL as my desktop, same
thing.



_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 

Thread Tools




All times are GMT. The time now is 10:41 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org