When we reach 14 days or more, we know that it's really the watchdog/NMI
"feature" causing these SMP systems to lock up intermittently but quite
deterministic after an uptime of 1 to 8 days.
To avoid any side-effects while testing, I did not change anything on
the system except this kernel boot parameter after the last lockup those
10 days ago. No software updates, no additional change to the kernel
(this means the current kernel produced at least one "successful" lockup
as I had tried various configurations and versions before the hint to
the NMI/watchdog issue gained my full attention).
After having me frustrated for months, I have quite a detailed
impression of this misbehaviour and nothing ever made me feel that
confident in restored reliability than setting this boot parameter.
Here are my boot parameters and the reboot date since which the system
has been running flawlessly:
Jun 20 17:01:49 netfinity5000 kernel: [ 0.000000] Kernel command
line: auto BOOT_IMAGE=Linux ro
root=UUID=338417b5-b8c8-47ed-97ee-2ebc9c8afee8
aic7xxx=no_reset,allow_memio pci=bios,use_crs,routeirq
libata.force=mwdma2 reboot=warm rootdelay=30 nowatchdog
Just for comparison: before this, reboots/lockups occured on June 4th,
June 6th, June 7th, June 8th, June 11th, June 13th, June 15th and June 20th.
If you need more information like a full kernel boot log or whatever,
just ask me.
Thanks and best regards,
Hans-Juergen
--
To UNSUBSCRIBE, email to debian-kernel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4FF039C6.6090209@gmx.net">http://lists.debian.org/4FF039C6.6090209@gmx.net
07-11-2012, 07:49 PM
Hans-Juergen Mauser
Bug#678443: Hard lockups due to "lockup-detector" (NMIs) on multi-Pentium-3 SMP systems on all kernel builds since 2.6.38
Hello,
Ben Hutchings wrote:
> [...]
>
> I think it's fine and has nothing to do with the problem.
>
> Since you say it has taken 1-8 days for any problem to appear, I suppose
> you will have to wait a few weeks to have some confidence that
> 'nowatchdog' makes a difference.
well, even if you think it has nothing to do with the problem, now I am
almost sure it has. Nothing is more evident than uptime:
For comparison, see the last mail I added to this bug, the maximal
continuous operation time was nothing more than about 8 days.
It would be great if anyone took care of this bug, maybe there are other
people getting hit by this and not being able to track it down.
Would you recommend me to report this on bugzilla.kernel.org ?
Thanks and best regards,
Hans-Juergen
--
To UNSUBSCRIBE, email to debian-kernel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4FFDD8AF.1010103@gmx.net">http://lists.debian.org/4FFDD8AF.1010103@gmx.net
07-14-2012, 04:50 AM
Ben Hutchings
Bug#678443: Hard lockups due to "lockup-detector" (NMIs) on multi-Pentium-3 SMP systems on all kernel builds since 2.6.38
On Wed, 2012-07-11 at 21:49 +0200, Hans-Juergen Mauser wrote:
> Hello,
>
>
> Ben Hutchings wrote:
> > [...]
> >
> > I think it's fine and has nothing to do with the problem.
> >
> > Since you say it has taken 1-8 days for any problem to appear, I suppose
> > you will have to wait a few weeks to have some confidence that
> > 'nowatchdog' makes a difference.
>
>
> well, even if you think it has nothing to do with the problem, now I am
> almost sure it has. Nothing is more evident than uptime:
>
> netfinity5000:~# uptime
> 21:39:39 up 21 days, 4:39, 2 users, load average: 0,13, 0,10, 0,07
>
> For comparison, see the last mail I added to this bug, the maximal
> continuous operation time was nothing more than about 8 days.
I agree, it does sound like you were right. Sorry for being so
sceptical.
> It would be great if anyone took care of this bug, maybe there are other
> people getting hit by this and not being able to track it down.
>
> Would you recommend me to report this on bugzilla.kernel.org ?
Either there or LKML (linux-kernel@vger.kernel.org). On Bugzilla it
think it would belong under Platform Specific/Hardware, i386.
Ben.
--
Ben Hutchings
The generation of random numbers is too important to be left to chance.
- Robert Coveyou
08-05-2012, 03:23 PM
Hans-Juergen Mauser
Bug#678443: Hard lockups due to "lockup-detector" (NMIs) on multi-Pentium-3 SMP systems on all kernel builds since 2.6.38
Hello,
thanks for your reply. Due to a lot of work "at work", I did not yet
manage to report the bug, but I will do so soon.
Today I want to add my current uptime and interrupt state for a last
time, as I might have to power down the system in a few days for
maintenance measures (and anyway want to put and end to ompelled uptime
watching related to this bug). In addition to the flawless uptime, the
complete system and all running tasks have proven to be absolutely
flawless over this amount of time (well, that's the way I expect it from
a Linux operating system as long as no very risky software is running -
but it also confirms that the hardware really has no problems and my
problems were only related to the "lockup detector". Even the amount of
shared interrupts and their dependencies on the APIC system and correct
driver implementations don't hurt. No kernel errors have been logged
since 17 July, and these were link down/up messages due to a switch
reboot...
--
To UNSUBSCRIBE, email to debian-kernel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 501E8FF8.7050409@gmx.net">http://lists.debian.org/501E8FF8.7050409@gmx.net