Bug#678443: Hard lockups due to "lockup-detector" (NMIs) on multi-Pentium-3 SMP systems on all kernel builds since 2.6.38
Hello,
currently the system starts reaching an amount of uptime that was hardly possible before setting "nowatchdog": netfinity5000:~# uptime 13:43:12 up 10 days, 20:43, 2 users, load average: 0,01, 0,08, 0,07 When we reach 14 days or more, we know that it's really the watchdog/NMI "feature" causing these SMP systems to lock up intermittently but quite deterministic after an uptime of 1 to 8 days. To avoid any side-effects while testing, I did not change anything on the system except this kernel boot parameter after the last lockup those 10 days ago. No software updates, no additional change to the kernel (this means the current kernel produced at least one "successful" lockup as I had tried various configurations and versions before the hint to the NMI/watchdog issue gained my full attention). After having me frustrated for months, I have quite a detailed impression of this misbehaviour and nothing ever made me feel that confident in restored reliability than setting this boot parameter. Here is my current interrupt state: netfinity5000:~# cat /proc/interrupts CPU0 CPU1 0: 49 0 IO-APIC-edge timer 1: 3 0 IO-APIC-edge i8042 6: 3 0 IO-APIC-edge floppy 7: 1 0 IO-APIC-edge parport0 8: 0 0 IO-APIC-edge rtc0 9: 0 0 IO-APIC-fasteoi acpi 12: 1 3 IO-APIC-edge i8042 14: 42 74 IO-APIC-edge ata_generic 15: 0 0 IO-APIC-edge ata_generic 16: 49 48 IO-APIC-fasteoi aic7xxx, aic7xxx 17: 19391683 19362804 IO-APIC-fasteoi eth0 18: 649647 660452 IO-APIC-fasteoi megaraid, ohci_hcd:usb2 19: 8761472 8704241 IO-APIC-fasteoi eth1 22: 11804557 11924853 IO-APIC-fasteoi ehci_hcd:usb1, ohci_hcd:usb3, ohci_hcd:usb4, eth2, eth3 NMI: 1 1 Non-maskable interrupts LOC: 62410645 76099188 Local timer interrupts SPU: 0 0 Spurious interrupts PMI: 0 0 Performance monitoring interrupts IWI: 0 0 IRQ work interrupts RTR: 2 0 APIC ICR read retries RES: 1628056 1619691 Rescheduling interrupts CAL: 293382 396292 Function call interrupts TLB: 211292 194994 TLB shootdowns TRM: 0 0 Thermal event interrupts THR: 0 0 Threshold APIC interrupts MCE: 0 0 Machine check exceptions MCP: 3129 3129 Machine check polls ERR: 0 MIS: 0 Here are my boot parameters and the reboot date since which the system has been running flawlessly: Jun 20 17:01:49 netfinity5000 kernel: [ 0.000000] Kernel command line: auto BOOT_IMAGE=Linux ro root=UUID=338417b5-b8c8-47ed-97ee-2ebc9c8afee8 aic7xxx=no_reset,allow_memio pci=bios,use_crs,routeirq libata.force=mwdma2 reboot=warm rootdelay=30 nowatchdog Just for comparison: before this, reboots/lockups occured on June 4th, June 6th, June 7th, June 8th, June 11th, June 13th, June 15th and June 20th. If you need more information like a full kernel boot log or whatever, just ask me. Thanks and best regards, Hans-Juergen -- To UNSUBSCRIBE, email to debian-kernel-REQUEST@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org Archive: 4FF039C6.6090209@gmx.net">http://lists.debian.org/4FF039C6.6090209@gmx.net |
Bug#678443: Hard lockups due to "lockup-detector" (NMIs) on multi-Pentium-3 SMP systems on all kernel builds since 2.6.38
Hello,
Ben Hutchings wrote: > [...] > > I think it's fine and has nothing to do with the problem. > > Since you say it has taken 1-8 days for any problem to appear, I suppose > you will have to wait a few weeks to have some confidence that > 'nowatchdog' makes a difference. well, even if you think it has nothing to do with the problem, now I am almost sure it has. Nothing is more evident than uptime: netfinity5000:~# uptime 21:39:39 up 21 days, 4:39, 2 users, load average: 0,13, 0,10, 0,07 For comparison, see the last mail I added to this bug, the maximal continuous operation time was nothing more than about 8 days. It would be great if anyone took care of this bug, maybe there are other people getting hit by this and not being able to track it down. Would you recommend me to report this on bugzilla.kernel.org ? Thanks and best regards, Hans-Juergen -- To UNSUBSCRIBE, email to debian-kernel-REQUEST@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org Archive: 4FFDD8AF.1010103@gmx.net">http://lists.debian.org/4FFDD8AF.1010103@gmx.net |
Bug#678443: Hard lockups due to "lockup-detector" (NMIs) on multi-Pentium-3 SMP systems on all kernel builds since 2.6.38
On Wed, 2012-07-11 at 21:49 +0200, Hans-Juergen Mauser wrote:
> Hello, > > > Ben Hutchings wrote: > > [...] > > > > I think it's fine and has nothing to do with the problem. > > > > Since you say it has taken 1-8 days for any problem to appear, I suppose > > you will have to wait a few weeks to have some confidence that > > 'nowatchdog' makes a difference. > > > well, even if you think it has nothing to do with the problem, now I am > almost sure it has. Nothing is more evident than uptime: > > netfinity5000:~# uptime > 21:39:39 up 21 days, 4:39, 2 users, load average: 0,13, 0,10, 0,07 > > For comparison, see the last mail I added to this bug, the maximal > continuous operation time was nothing more than about 8 days. I agree, it does sound like you were right. Sorry for being so sceptical. > It would be great if anyone took care of this bug, maybe there are other > people getting hit by this and not being able to track it down. > > Would you recommend me to report this on bugzilla.kernel.org ? Either there or LKML (linux-kernel@vger.kernel.org). On Bugzilla it think it would belong under Platform Specific/Hardware, i386. Ben. -- Ben Hutchings The generation of random numbers is too important to be left to chance. - Robert Coveyou |
Bug#678443: Hard lockups due to "lockup-detector" (NMIs) on multi-Pentium-3 SMP systems on all kernel builds since 2.6.38
Hello,
thanks for your reply. Due to a lot of work "at work", I did not yet manage to report the bug, but I will do so soon. Today I want to add my current uptime and interrupt state for a last time, as I might have to power down the system in a few days for maintenance measures (and anyway want to put and end to ompelled uptime watching related to this bug). In addition to the flawless uptime, the complete system and all running tasks have proven to be absolutely flawless over this amount of time (well, that's the way I expect it from a Linux operating system as long as no very risky software is running - but it also confirms that the hardware really has no problems and my problems were only related to the "lockup detector". Even the amount of shared interrupts and their dependencies on the APIC system and correct driver implementations don't hurt. No kernel errors have been logged since 17 July, and these were link down/up messages due to a switch reboot... netfinity5000:~$ uptime 17:14:06 up 46 days, 14 min, 2 users, load average: 0,05, 0,06, 0,05 netfinity5000:~$ cat /proc/interrupts CPU0 CPU1 0: 49 0 IO-APIC-edge timer 1: 3 0 IO-APIC-edge i8042 6: 3 0 IO-APIC-edge floppy 7: 1 0 IO-APIC-edge parport0 8: 0 0 IO-APIC-edge rtc0 9: 0 0 IO-APIC-fasteoi acpi 12: 1 3 IO-APIC-edge i8042 14: 42 74 IO-APIC-edge ata_generic 15: 0 0 IO-APIC-edge ata_generic 16: 49 48 IO-APIC-fasteoi aic7xxx, aic7xxx 17: 154500925 154495377 IO-APIC-fasteoi eth0 18: 2657528 2728937 IO-APIC-fasteoi megaraid, ohci_hcd:usb2 19: 69807511 69703638 IO-APIC-fasteoi eth1 22: 91578533 91635430 IO-APIC-fasteoi ehci_hcd:usb1, ohci_hcd:usb3, ohci_hcd:usb4, eth2, eth3 NMI: 1 1 Non-maskable interrupts LOC: 262393426 323398808 Local timer interrupts SPU: 0 0 Spurious interrupts PMI: 0 0 Performance monitoring interrupts IWI: 0 0 IRQ work interrupts RTR: 2 0 APIC ICR read retries RES: 6791711 6755464 Rescheduling interrupts CAL: 1231644 1607457 Function call interrupts TLB: 859984 805603 TLB shootdowns TRM: 0 0 Thermal event interrupts THR: 0 0 Threshold APIC interrupts MCE: 0 0 Machine check exceptions MCP: 13251 13251 Machine check polls ERR: 0 MIS: 0 netfinity5000:~$ free total used free shared buffers cached Mem: 2074804 1340228 734576 0 294672 805404 -/+ buffers/cache: 240152 1834652 Swap: 1943860 0 1943860 Best regards, Hans-Juergen Mauser -- To UNSUBSCRIBE, email to debian-kernel-REQUEST@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org Archive: 501E8FF8.7050409@gmx.net">http://lists.debian.org/501E8FF8.7050409@gmx.net |
| All times are GMT. The time now is 03:41 PM. |
VBulletin, Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.