FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian Kernel

 
 
LinkBack Thread Tools
 
Old 07-01-2012, 11:51 AM
Hans-Juergen Mauser
 
Default Bug#678443: Hard lockups due to "lockup-detector" (NMIs) on multi-Pentium-3 SMP systems on all kernel builds since 2.6.38

Hello,

currently the system starts reaching an amount of uptime that was hardly
possible before setting "nowatchdog":


netfinity5000:~# uptime
13:43:12 up 10 days, 20:43, 2 users, load average: 0,01, 0,08, 0,07

When we reach 14 days or more, we know that it's really the watchdog/NMI
"feature" causing these SMP systems to lock up intermittently but quite
deterministic after an uptime of 1 to 8 days.


To avoid any side-effects while testing, I did not change anything on
the system except this kernel boot parameter after the last lockup those
10 days ago. No software updates, no additional change to the kernel
(this means the current kernel produced at least one "successful" lockup
as I had tried various configurations and versions before the hint to
the NMI/watchdog issue gained my full attention).


After having me frustrated for months, I have quite a detailed
impression of this misbehaviour and nothing ever made me feel that
confident in restored reliability than setting this boot parameter.


Here is my current interrupt state:

netfinity5000:~# cat /proc/interrupts
CPU0 CPU1
0: 49 0 IO-APIC-edge timer
1: 3 0 IO-APIC-edge i8042
6: 3 0 IO-APIC-edge floppy
7: 1 0 IO-APIC-edge parport0
8: 0 0 IO-APIC-edge rtc0
9: 0 0 IO-APIC-fasteoi acpi
12: 1 3 IO-APIC-edge i8042
14: 42 74 IO-APIC-edge ata_generic
15: 0 0 IO-APIC-edge ata_generic
16: 49 48 IO-APIC-fasteoi aic7xxx, aic7xxx
17: 19391683 19362804 IO-APIC-fasteoi eth0
18: 649647 660452 IO-APIC-fasteoi megaraid, ohci_hcd:usb2
19: 8761472 8704241 IO-APIC-fasteoi eth1
22: 11804557 11924853 IO-APIC-fasteoi ehci_hcd:usb1,
ohci_hcd:usb3, ohci_hcd:usb4, eth2, eth3

NMI: 1 1 Non-maskable interrupts
LOC: 62410645 76099188 Local timer interrupts
SPU: 0 0 Spurious interrupts
PMI: 0 0 Performance monitoring interrupts
IWI: 0 0 IRQ work interrupts
RTR: 2 0 APIC ICR read retries
RES: 1628056 1619691 Rescheduling interrupts
CAL: 293382 396292 Function call interrupts
TLB: 211292 194994 TLB shootdowns
TRM: 0 0 Thermal event interrupts
THR: 0 0 Threshold APIC interrupts
MCE: 0 0 Machine check exceptions
MCP: 3129 3129 Machine check polls
ERR: 0
MIS: 0

Here are my boot parameters and the reboot date since which the system
has been running flawlessly:


Jun 20 17:01:49 netfinity5000 kernel: [ 0.000000] Kernel command
line: auto BOOT_IMAGE=Linux ro
root=UUID=338417b5-b8c8-47ed-97ee-2ebc9c8afee8
aic7xxx=no_reset,allow_memio pci=bios,use_crs,routeirq
libata.force=mwdma2 reboot=warm rootdelay=30 nowatchdog


Just for comparison: before this, reboots/lockups occured on June 4th,
June 6th, June 7th, June 8th, June 11th, June 13th, June 15th and June 20th.



If you need more information like a full kernel boot log or whatever,
just ask me.



Thanks and best regards,

Hans-Juergen





--
To UNSUBSCRIBE, email to debian-kernel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4FF039C6.6090209@gmx.net">http://lists.debian.org/4FF039C6.6090209@gmx.net
 
Old 07-11-2012, 07:49 PM
Hans-Juergen Mauser
 
Default Bug#678443: Hard lockups due to "lockup-detector" (NMIs) on multi-Pentium-3 SMP systems on all kernel builds since 2.6.38

Hello,


Ben Hutchings wrote:
> [...]
>
> I think it's fine and has nothing to do with the problem.
>
> Since you say it has taken 1-8 days for any problem to appear, I suppose
> you will have to wait a few weeks to have some confidence that
> 'nowatchdog' makes a difference.


well, even if you think it has nothing to do with the problem, now I am
almost sure it has. Nothing is more evident than uptime:


netfinity5000:~# uptime
21:39:39 up 21 days, 4:39, 2 users, load average: 0,13, 0,10, 0,07

For comparison, see the last mail I added to this bug, the maximal
continuous operation time was nothing more than about 8 days.


It would be great if anyone took care of this bug, maybe there are other
people getting hit by this and not being able to track it down.


Would you recommend me to report this on bugzilla.kernel.org ?

Thanks and best regards,

Hans-Juergen





--
To UNSUBSCRIBE, email to debian-kernel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4FFDD8AF.1010103@gmx.net">http://lists.debian.org/4FFDD8AF.1010103@gmx.net
 
Old 07-14-2012, 04:50 AM
Ben Hutchings
 
Default Bug#678443: Hard lockups due to "lockup-detector" (NMIs) on multi-Pentium-3 SMP systems on all kernel builds since 2.6.38

On Wed, 2012-07-11 at 21:49 +0200, Hans-Juergen Mauser wrote:
> Hello,
>
>
> Ben Hutchings wrote:
> > [...]
> >
> > I think it's fine and has nothing to do with the problem.
> >
> > Since you say it has taken 1-8 days for any problem to appear, I suppose
> > you will have to wait a few weeks to have some confidence that
> > 'nowatchdog' makes a difference.
>
>
> well, even if you think it has nothing to do with the problem, now I am
> almost sure it has. Nothing is more evident than uptime:
>
> netfinity5000:~# uptime
> 21:39:39 up 21 days, 4:39, 2 users, load average: 0,13, 0,10, 0,07
>
> For comparison, see the last mail I added to this bug, the maximal
> continuous operation time was nothing more than about 8 days.

I agree, it does sound like you were right. Sorry for being so
sceptical.

> It would be great if anyone took care of this bug, maybe there are other
> people getting hit by this and not being able to track it down.
>
> Would you recommend me to report this on bugzilla.kernel.org ?

Either there or LKML (linux-kernel@vger.kernel.org). On Bugzilla it
think it would belong under Platform Specific/Hardware, i386.

Ben.

--
Ben Hutchings
The generation of random numbers is too important to be left to chance.
- Robert Coveyou
 
Old 08-05-2012, 03:23 PM
Hans-Juergen Mauser
 
Default Bug#678443: Hard lockups due to "lockup-detector" (NMIs) on multi-Pentium-3 SMP systems on all kernel builds since 2.6.38

Hello,

thanks for your reply. Due to a lot of work "at work", I did not yet
manage to report the bug, but I will do so soon.


Today I want to add my current uptime and interrupt state for a last
time, as I might have to power down the system in a few days for
maintenance measures (and anyway want to put and end to ompelled uptime
watching related to this bug). In addition to the flawless uptime, the
complete system and all running tasks have proven to be absolutely
flawless over this amount of time (well, that's the way I expect it from
a Linux operating system as long as no very risky software is running -
but it also confirms that the hardware really has no problems and my
problems were only related to the "lockup detector". Even the amount of
shared interrupts and their dependencies on the APIC system and correct
driver implementations don't hurt. No kernel errors have been logged
since 17 July, and these were link down/up messages due to a switch
reboot...



netfinity5000:~$ uptime
17:14:06 up 46 days, 14 min, 2 users, load average: 0,05, 0,06, 0,05


netfinity5000:~$ cat /proc/interrupts
CPU0 CPU1
0: 49 0 IO-APIC-edge timer
1: 3 0 IO-APIC-edge i8042
6: 3 0 IO-APIC-edge floppy
7: 1 0 IO-APIC-edge parport0
8: 0 0 IO-APIC-edge rtc0
9: 0 0 IO-APIC-fasteoi acpi
12: 1 3 IO-APIC-edge i8042
14: 42 74 IO-APIC-edge ata_generic
15: 0 0 IO-APIC-edge ata_generic
16: 49 48 IO-APIC-fasteoi aic7xxx, aic7xxx
17: 154500925 154495377 IO-APIC-fasteoi eth0
18: 2657528 2728937 IO-APIC-fasteoi megaraid, ohci_hcd:usb2
19: 69807511 69703638 IO-APIC-fasteoi eth1
22: 91578533 91635430 IO-APIC-fasteoi ehci_hcd:usb1,
ohci_hcd:usb3, ohci_hcd:usb4, eth2, eth3

NMI: 1 1 Non-maskable interrupts
LOC: 262393426 323398808 Local timer interrupts
SPU: 0 0 Spurious interrupts
PMI: 0 0 Performance monitoring interrupts
IWI: 0 0 IRQ work interrupts
RTR: 2 0 APIC ICR read retries
RES: 6791711 6755464 Rescheduling interrupts
CAL: 1231644 1607457 Function call interrupts
TLB: 859984 805603 TLB shootdowns
TRM: 0 0 Thermal event interrupts
THR: 0 0 Threshold APIC interrupts
MCE: 0 0 Machine check exceptions
MCP: 13251 13251 Machine check polls
ERR: 0
MIS: 0


netfinity5000:~$ free
total used free shared buffers cached
Mem: 2074804 1340228 734576 0 294672 805404
-/+ buffers/cache: 240152 1834652
Swap: 1943860 0 1943860


Best regards,

Hans-Juergen Mauser


--
To UNSUBSCRIBE, email to debian-kernel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 501E8FF8.7050409@gmx.net">http://lists.debian.org/501E8FF8.7050409@gmx.net
 

Thread Tools




All times are GMT. The time now is 03:17 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org