FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Crash Utility

 
 
LinkBack Thread Tools
 
Old 06-25-2010, 09:43 PM
"Silacci, Lucas"
 
Default infinite loop in crash due to double-NMI on x86_64 system

Below is the output of running crash (with the patch) against one of
these dumps.

-Lucas


crash 5.0.5
Copyright (C) 2002-2010 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public
License,
and you are welcome to change it and/or distribute copies of it under

certain conditions. Enter "help copying" to see the conditions.

This program has absolutely no warranty. Enter "help warranty" for
details.


GNU gdb (GDB) 7.0

Copyright (C) 2009 Free Software Foundation, Inc.

License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.

There is NO WARRANTY, to the extent permitted by law. Type "show
copying"
and "show warranty" for details.

This GDB was configured as "x86_64-unknown-linux-gnu"...


please wait... (determining panic task)
WARNING: Loop detected in the NMI Exception Stack!

bt: cannot transition from exception stack to current process stack:
exception stack pointer: ffffffff8046dc50
process stack pointer: ffffffff8046ddd8
current stack base: ffffffff80422000

SYSTEM MAP: /boot/System.map-2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
DEBUG KERNEL: /boot/vmlinux-2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
(2.6.16.53-0.8.PTF.434477.9.TDC.0-smp)
DUMPFILE: /var/crash/lucas.save/vmcore [PARTIAL DUMP]
CPUS: 4
DATE: Tue May 18 12:46:07 2010
UPTIME: 07:24:54
LOAD AVERAGE: 85.74, 82.85, 82.29
TASKS: 2449
NODENAME: POLO5_1-9
RELEASE: 2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
VERSION: #1 SMP Fri Aug 31 06:07:27 PDT 2007
MACHINE: x86_64 (2660 Mhz)
MEMORY: 7.9 GB
PANIC: "Kernel panic - not syncing: dumpsw: Dump switch pushed;
reason: 0x20 args=0xffffffff8046df08"
PID: 0
COMMAND: "swapper"
TASK: ffffffff8038c340 (1 of 4) [THREAD_INFO:
ffffffff80422000]
CPU: 0
STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 0 TASK: ffffffff8038c340 CPU: 0 COMMAND: "swapper"
#0 [ffffffff8046dc50] machine_kexec at ffffffff8011a95b
#1 [ffffffff8046dd20] crash_kexec at ffffffff80154351
#2 [ffffffff8046dde0] panic at ffffffff801327fa
#3 [ffffffff8046ded0] dumpsw_notify at ffffffff8831c0c3
#4 [ffffffff8046dee0] notifier_call_chain at ffffffff8032481f
#5 [ffffffff8046df00] default_do_nmi at ffffffff80322fab
#6 [ffffffff8046df40] do_nmi at ffffffff80323365
#7 [ffffffff8046df50] nmi at ffffffff8032268f
[exception RIP: smp_send_stop+84]
RIP: ffffffff80116e44 RSP: ffffffff8046ddd8 RFLAGS: 00000246
RAX: 00000000000000ff RBX: ffffffff8831c1f8 RCX: 000041049c7256e8
RDX: 0000000000000005 RSI: 000000005238a938 RDI: 00000000002896a0
RBP: ffffffff8046df08 R8: 00000000000040fb R9: 000000005238a7e8
R10: 0000000000000002 R11: 0000ffff0000ffff R12: 000000000000000c
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#8 [ffffffff8046ddd8] smp_send_stop at ffffffff80116e44
bt: WARNING: Loop detected in the NMI Exception Stack!
bt: cannot transition from exception stack to current process stack:
exception stack pointer: ffffffff8046dc50
process stack pointer: ffffffff8046ddd8
current stack base: ffffffff80422000
crash>


-----Original Message-----
From: crash-utility-bounces@redhat.com
[mailto:crash-utility-bounces@redhat.com] On Behalf Of Dave Anderson
Sent: Friday, June 25, 2010 12:32 PM
To: Discussion list for crash utility usage,maintenance and development
Subject: Re: [Crash-utility] infinite loop in crash due to double-NMI on
x86_64 system


----- "Lucas Silacci" <Lucas.Silacci@teradata.com> wrote:

> Hi,
>
> I've run into an issue where crash will enter an infinite loop while
> decoding exception stacks if those stacks get corrupted.
>
> We've seen this on four different systems where the hardware generated
> multiple NMIs and the second and subsequent NMIs caused the NMI
> exception stack to be overwritten. When this condition is hit, the
> bottom rsp on the NMI exception stack (which would normally point you
> back to the kernel thread stack or possibly a different exception
stack)
> points you back into the middle of the same NMI exception stack. This
> causes crash to infinitely loop when it tries to decode that exception
> stack.
>
> Now clearly the root cause of the issue is faulty hardware that
> generated multiple NMIs. However a very small change in crash can
detect
> this issue and stop the infinite loop from happening thereby allowing
> you to get to a point in crash where you can actually tell that it was
> an NMI that caused the system to dump.
>
> The patch is attached to this email. For x86_64 it will detect the
> condition of any exception stack that points back at itself.
>
> Please feel free to ask me any questions on this.

Wow, that's pretty interesting -- I've certainly never seen that before.
Can you show me what the backtrace looks like with your patch applied?

Thanks,
Dave

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 
Old 06-28-2010, 08:20 PM
"Silacci, Lucas"
 
Default infinite loop in crash due to double-NMI on x86_64 system

> -----Original Message-----
> From: crash-utility-bounces@redhat.com
> [mailto:crash-utility-bounces@redhat.com] On Behalf Of Dave Anderson
> Sent: Monday, June 28, 2010 12:11 PM
> To: Discussion list for crash utility usage,maintenance and
> development
> Subject: Re: [Crash-utility] infinite loop in crash due to
> double-NMI on x86_64 system
>
>
>
> ----- "Lucas Silacci" <Lucas.Silacci@teradata.com> wrote:
>
> > Below is the output of running crash (with the patch) against one of
> > these dumps.
> >
> > -Lucas
> >
> >
> > crash 5.0.5
> > Copyright (C) 2002-2010 Red Hat, Inc.
> > Copyright (C) 2004, 2005, 2006 IBM Corporation
> > Copyright (C) 1999-2006 Hewlett-Packard Co
> > Copyright (C) 2005, 2006 Fujitsu Limited
> > Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
> > Copyright (C) 2005 NEC Corporation
> > Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
> > Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
> > This program is free software, covered by the GNU General
> Public License,
> > and you are welcome to change it and/or distribute copies
> of it under
> > certain conditions. Enter "help copying" to see the conditions.
> >
> > This program has absolutely no warranty. Enter "help warranty" for
> > details.
> >
> > GNU gdb (GDB) 7.0
> > Copyright (C) 2009 Free Software Foundation, Inc.
> > License GPLv3+: GNU GPL version 3 or later
> > <http://gnu.org/licenses/gpl.html>
> > This is free software: you are free to change and redistribute it.
> > There is NO WARRANTY, to the extent permitted by law. Type
> "show copying"
> > and "show warranty" for details.
> >
> > This GDB was configured as "x86_64-unknown-linux-gnu"...
> >
> > please wait... (determining panic task)
>
> >
> > WARNING: Loop detected in the NMI Exception Stack!
>
> >
> >
> > bt: cannot transition from exception stack to current process stack:
> > exception stack pointer: ffffffff8046dc50
> > process stack pointer: ffffffff8046ddd8
> > current stack base: ffffffff80422000
> >
> > SYSTEM MAP: /boot/System.map-2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
> > DEBUG KERNEL: /boot/vmlinux-2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
> > (2.6.16.53-0.8.PTF.434477.9.TDC.0-smp)
> > DUMPFILE: /var/crash/lucas.save/vmcore [PARTIAL DUMP]
> > CPUS: 4
> > DATE: Tue May 18 12:46:07 2010
> > UPTIME: 07:24:54
> > LOAD AVERAGE: 85.74, 82.85, 82.29
> > TASKS: 2449
> > NODENAME: POLO5_1-9
> > RELEASE: 2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
> > VERSION: #1 SMP Fri Aug 31 06:07:27 PDT 2007
> > MACHINE: x86_64 (2660 Mhz)
> > MEMORY: 7.9 GB
> > PANIC: "Kernel panic - not syncing: dumpsw: Dump
> switch pushed; reason: 0x20 args=0xffffffff8046df08"
> > PID: 0
> > COMMAND: "swapper"
> > TASK: ffffffff8038c340 (1 of 4) [THREAD_INFO:
> ffffffff80422000]
> > CPU: 0
> > STATE: TASK_RUNNING (PANIC)
> >
> > crash> bt
> > PID: 0 TASK: ffffffff8038c340 CPU: 0 COMMAND: "swapper"
> > #0 [ffffffff8046dc50] machine_kexec at ffffffff8011a95b
> > #1 [ffffffff8046dd20] crash_kexec at ffffffff80154351
> > #2 [ffffffff8046dde0] panic at ffffffff801327fa
> > #3 [ffffffff8046ded0] dumpsw_notify at ffffffff8831c0c3
> > #4 [ffffffff8046dee0] notifier_call_chain at ffffffff8032481f
> > #5 [ffffffff8046df00] default_do_nmi at ffffffff80322fab
> > #6 [ffffffff8046df40] do_nmi at ffffffff80323365
> > #7 [ffffffff8046df50] nmi at ffffffff8032268f
> > [exception RIP: smp_send_stop+84]
> > RIP: ffffffff80116e44 RSP: ffffffff8046ddd8 RFLAGS: 00000246
> > RAX: 00000000000000ff RBX: ffffffff8831c1f8 RCX:
> 000041049c7256e8
> > RDX: 0000000000000005 RSI: 000000005238a938 RDI:
> 00000000002896a0
> > RBP: ffffffff8046df08 R8: 00000000000040fb R9:
> 000000005238a7e8
> > R10: 0000000000000002 R11: 0000ffff0000ffff R12:
> 000000000000000c
> > R13: 0000000000000000 R14: 0000000000000000 R15:
> 0000000000000000
> > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> > --- <NMI exception stack> ---
> > #8 [ffffffff8046ddd8] smp_send_stop at ffffffff80116e44
> > bt: WARNING: Loop detected in the NMI Exception Stack!
> > bt: cannot transition from exception stack to current process stack:
> > exception stack pointer: ffffffff8046dc50
> > process stack pointer: ffffffff8046ddd8
> > current stack base: ffffffff80422000
> > crash>
>
> What exactly was the sequence of events? Was the system
> repeatedly and
> erroneously running one NMI after another for some reason,
> and *then* the
> "dump switch" was pressed? And the dumpsw_notify() function
> sends another
> NMI? And where does that dumpsw_notify() function live anyway?
>
> I'm just trying to get a grip on whether this will ever
> happen again, or
> whether it's fixing a one-time hardware abnormality?
>
> Dave
>

As far as I am aware, we have had three separate customers encounter
this issue. It appears from the hardware SEL log that multiple PCI
SERR's came in at the same time and somehow triggered multiple NMIs. You
can see the SEL entries from the output of the "ipmitool sel" command:

0231 11FC 02 01:53:47 12/17/09 3300 04 13 EB 6F A5 15 08 Crit.
Interrupt PCI SERR (PCI Bus 15 Device 1 Function 0) was asserted
0232 1210 02 01:53:47 12/17/09 3300 04 13 EB 6F A5 16 20 Crit.
Interrupt PCI SERR (PCI Bus 16 Device 4 Function 0) was asserted
0233 1224 02 01:53:47 12/17/09 3300 04 13 EB 6F A5 16 21 Crit.
Interrupt PCI SERR (PCI Bus 16 Device 4 Function 1) was asserted
0234 1238 02 01:53:47 12/17/09 3300 04 13 EB 6F A5 16 30 Crit.
Interrupt PCI SERR (PCI Bus 16 Device 6 Function 0) was asserted
0235 124C 02 01:53:47 12/17/09 3300 04 13 EB 6F A5 16 31 Crit.
Interrupt PCI SERR (PCI Bus 16 Device 6 Function 1) was asserted

My understanding of the architecture of the system is that only one NMI
should have been asserted to the OS regardless of the number of times
there was a hardware error, but clearly that wasn't the case in these
three instances.

Also, it seemed like my patch made crash a little bit more tolerant of
"corrupted" dump images which I thought could only be a good thing.

-Lucas

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 
Old 06-28-2010, 08:45 PM
"Silacci, Lucas"
 
Default infinite loop in crash due to double-NMI on x86_64 system

> -----Original Message-----
> From: crash-utility-bounces@redhat.com
> [mailto:crash-utility-bounces@redhat.com] On Behalf Of Dave Anderson
> Sent: Monday, June 28, 2010 1:35 PM
> To: Discussion list for crash utility usage,maintenance and
> development
> Subject: Re: [Crash-utility] infinite loop in crash due to
> double-NMI on x86_64 system
>
>
> ----- "Lucas Silacci" <Lucas.Silacci@teradata.com> wrote:
>
> > > -----Original Message-----
> > > From: crash-utility-bounces@redhat.com
> > > [mailto:crash-utility-bounces@redhat.com] On Behalf Of Dave
> > Anderson
> > > Sent: Monday, June 28, 2010 12:11 PM
> > > To: Discussion list for crash utility usage,maintenance and
> > > development
> > > Subject: Re: [Crash-utility] infinite loop in crash due to
> > > double-NMI on x86_64 system
> > >
> > >
> > >
> > > ----- "Lucas Silacci" <Lucas.Silacci@teradata.com> wrote:
> > >
> > > > Below is the output of running crash (with the patch)
> against one
> > of
> > > > these dumps.
> > > >
> > > > -Lucas
> > > >
> > > >
> > > > crash 5.0.5
> > > > Copyright (C) 2002-2010 Red Hat, Inc.
> > > > Copyright (C) 2004, 2005, 2006 IBM Corporation
> > > > Copyright (C) 1999-2006 Hewlett-Packard Co
> > > > Copyright (C) 2005, 2006 Fujitsu Limited
> > > > Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
> > > > Copyright (C) 2005 NEC Corporation
> > > > Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
> > > > Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux,
> > Inc.
> > > > This program is free software, covered by the GNU
> General Public License,
> > > > and you are welcome to change it and/or distribute
> copies of it under
> > > > certain conditions. Enter "help copying" to see the conditions.
> > > > This program has absolutely no warranty. Enter "help
> warranty" for
> > > > details.
> > > >
> > > > GNU gdb (GDB) 7.0
> > > > Copyright (C) 2009 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later
> > > > <http://gnu.org/licenses/gpl.html>
> > > > This is free software: you are free to change and
> redistribute it.
> > > > There is NO WARRANTY, to the extent permitted by law.
> Type "show copying"
> > > > and "show warranty" for details.
> > > >
> > > > This GDB was configured as "x86_64-unknown-linux-gnu"...
> > > >
> > > > please wait... (determining panic task)
> > >
> > > >
> > > > WARNING: Loop detected in the NMI Exception Stack!
> > >
> > > >
> > > >
> > > > bt: cannot transition from exception stack to current process
> > stack:
> > > > exception stack pointer: ffffffff8046dc50
>
> >
> > > > process stack pointer: ffffffff8046ddd8
> > > > current stack base: ffffffff80422000
> > > >
> > > > SYSTEM MAP:
> > /boot/System.map-2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
> > > > DEBUG KERNEL: /boot/vmlinux-2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
> > > > (2.6.16.53-0.8.PTF.434477.9.TDC.0-smp)
> > > > DUMPFILE: /var/crash/lucas.save/vmcore [PARTIAL DUMP]
> > > > CPUS: 4
> > > > DATE: Tue May 18 12:46:07 2010
> > > > UPTIME: 07:24:54
> > > > LOAD AVERAGE: 85.74, 82.85, 82.29
> > > > TASKS: 2449
> > > > NODENAME: POLO5_1-9
> > > > RELEASE: 2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
> > > > VERSION: #1 SMP Fri Aug 31 06:07:27 PDT 2007
> > > > MACHINE: x86_64 (2660 Mhz)
> > > > MEMORY: 7.9 GB
> > > > PANIC: "Kernel panic - not syncing: dumpsw: Dump
> > > switch pushed; reason: 0x20 args=0xffffffff8046df08"
> > > > PID: 0
> > > > COMMAND: "swapper"
> > > > TASK: ffffffff8038c340 (1 of 4) [THREAD_INFO:
> > > ffffffff80422000]
> > > > CPU: 0
> > > > STATE: TASK_RUNNING (PANIC)
> > > >
> > > > crash> bt
> > > > PID: 0 TASK: ffffffff8038c340 CPU: 0 COMMAND: "swapper"
> > > > #0 [ffffffff8046dc50] machine_kexec at ffffffff8011a95b
> > > > #1 [ffffffff8046dd20] crash_kexec at ffffffff80154351
> > > > #2 [ffffffff8046dde0] panic at ffffffff801327fa
> > > > #3 [ffffffff8046ded0] dumpsw_notify at ffffffff8831c0c3
> > > > #4 [ffffffff8046dee0] notifier_call_chain at ffffffff8032481f
> > > > #5 [ffffffff8046df00] default_do_nmi at ffffffff80322fab
> > > > #6 [ffffffff8046df40] do_nmi at ffffffff80323365
> > > > #7 [ffffffff8046df50] nmi at ffffffff8032268f
> > > > [exception RIP: smp_send_stop+84]
> > > > RIP: ffffffff80116e44 RSP: ffffffff8046ddd8 RFLAGS:
> > 00000246
> > > > RAX: 00000000000000ff RBX: ffffffff8831c1f8 RCX:
> > > 000041049c7256e8
> > > > RDX: 0000000000000005 RSI: 000000005238a938 RDI:
> > > 00000000002896a0
> > > > RBP: ffffffff8046df08 R8: 00000000000040fb R9:
> > > 000000005238a7e8
> > > > R10: 0000000000000002 R11: 0000ffff0000ffff R12:
> > > 000000000000000c
> > > > R13: 0000000000000000 R14: 0000000000000000 R15:
> > > 0000000000000000
> > > > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> > > > --- <NMI exception stack> ---
> > > > #8 [ffffffff8046ddd8] smp_send_stop at ffffffff80116e44
> > > > bt: WARNING: Loop detected in the NMI Exception Stack!
> > > > bt: cannot transition from exception stack to current process
> > stack:
> > > > exception stack pointer: ffffffff8046dc50
> > > > process stack pointer: ffffffff8046ddd8
> > > > current stack base: ffffffff80422000
> > > > crash>
> > >
> > > What exactly was the sequence of events? Was the system
> repeatedly and
> > > erroneously running one NMI after another for some
> reason, and *then* the
> > > "dump switch" was pressed? And the dumpsw_notify()
> function sends another
> > > NMI? And where does that dumpsw_notify() function live anyway?
> > >
> > > I'm just trying to get a grip on whether this will ever
> happen again, or
> > > whether it's fixing a one-time hardware abnormality?
> > >
> > > Dave
> > >
> >
> > As far as I am aware, we have had three separate customers encounter
> > this issue. It appears from the hardware SEL log that multiple PCI
> > SERR's came in at the same time and somehow triggered multiple NMIs.
> > You can see the SEL entries from the output of the "ipmitool sel"
> > command:
> >
> > 0231 11FC 02 01:53:47 12/17/09 3300 04 13 EB 6F A5 15 08
> > Crit.
> > Interrupt PCI SERR (PCI Bus 15 Device 1 Function 0) was asserted
> > 0232 1210 02 01:53:47 12/17/09 3300 04 13 EB 6F A5 16 20
> > Crit.
> > Interrupt PCI SERR (PCI Bus 16 Device 4 Function 0) was asserted
> > 0233 1224 02 01:53:47 12/17/09 3300 04 13 EB 6F A5 16 21
> > Crit.
> > Interrupt PCI SERR (PCI Bus 16 Device 4 Function 1) was asserted
> > 0234 1238 02 01:53:47 12/17/09 3300 04 13 EB 6F A5 16 30
> > Crit.
> > Interrupt PCI SERR (PCI Bus 16 Device 6 Function 0) was asserted
> > 0235 124C 02 01:53:47 12/17/09 3300 04 13 EB 6F A5 16 31
> > Crit.
> > Interrupt PCI SERR (PCI Bus 16 Device 6 Function 1) was asserted
> >
> > My understanding of the architecture of the system is that
> only one NMI
> > should have been asserted to the OS regardless of the
> number of times
> > there was a hardware error, but clearly that wasn't the
> case in these
> > three instances.
> >
> > Also, it seemed like my patch made crash a little bit more
> tolerant of
> > "corrupted" dump images which I thought could only be a good thing.
>
> Right, I understand that...
>
> But you didn't answer my questions re: the "dump switch" procedure and
> the dumpsw_notify() function. Was the system stuck in the
> NMI handler,
> somebody noticed the repetetive NMIs (?), and so they hit the
> "dump switch"?
> (whatever that may be...)
>
> Dave
>
> --
> Crash-utility mailing list
> Crash-utility@redhat.com
> https://www.redhat.com/mailman/listinfo/crash-utility
>

Sorry, guess I wasn't clear. Nobody hit the dump switch on these
systems. They simply had multiple hardware errors that apparently
triggered the NMI more than once. That's what I was trying to show with
the SEL records, that the multiple NMIs were straight from hardware with
no human intervention.

The systems went through a panic (due to multiple NMIs), a reboot, and
then crash was run on the resulting dump. In fact crash was
automatically run via a startup script and there was no human
intervention until after it was noticed that crash was filling up the
root file system with a temporary file due to the inifinite loop.

-Lucas

-Lucas

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 
Old 06-28-2010, 09:26 PM
"Silacci, Lucas"
 
Default infinite loop in crash due to double-NMI on x86_64 system

The dumpsw_notify function is part of a driver that was added to our
systems to trigger kernel panics when an NMI occurs. In the version of
the kernel we are using (SLES 10 SP1) this was necessary to cause an
actual panic to happen and a dump to be saved when an NMI occurred
(especially due to a dump switch being pressed, hence the name).

That driver registers a callback (dumpsw_notify) into the die_chain and
calls panic() if the die code is a DIE_NMI.

-Lucas

> -----Original Message-----
> From: crash-utility-bounces@redhat.com
> [mailto:crash-utility-bounces@redhat.com] On Behalf Of Dave Anderson
> Sent: Monday, June 28, 2010 2:15 PM
> To: Discussion list for crash utility usage,maintenance and
> development
> Subject: Re: [Crash-utility] infinite loop in crash due to
> double-NMI on x86_64 system
>
>
> ----- "Lucas Silacci" <Lucas.Silacci@teradata.com> wrote:
>
>
> > Sorry, guess I wasn't clear. Nobody hit the dump switch on these
> > systems. They simply had multiple hardware errors that apparently
> > triggered the NMI more than once. That's what I was trying
> to show with
> > the SEL records, that the multiple NMIs were straight from
> hardware with
> > no human intervention.
> >
> > The systems went through a panic (due to multiple NMIs),
>
> That's what I'm trying to figure out -- when and how was it
> decided that
> the machine should panic instead of continuing to handle the
> stream of NMIs?
>
> In other words, this "dumpsw_notify" function -- why was it called?
>
> > > PID: 0 TASK: ffffffff8038c340 CPU: 0 COMMAND: "swapper"
> > > #0 [ffffffff8046dc50] machine_kexec at ffffffff8011a95b
> > > #1 [ffffffff8046dd20] crash_kexec at ffffffff80154351
> > > #2 [ffffffff8046dde0] panic at ffffffff801327fa
> > > #3 [ffffffff8046ded0] dumpsw_notify at ffffffff8831c0c3
> > > #4 [ffffffff8046dee0] notifier_call_chain at ffffffff8032481f
> > > #5 [ffffffff8046df00] default_do_nmi at ffffffff80322fab
> > > #6 [ffffffff8046df40] do_nmi at ffffffff80323365
> > > #7 [ffffffff8046df50] nmi at ffffffff8032268f
> > > [exception RIP: smp_send_stop+84]
> > > RIP: ffffffff80116e44 RSP: ffffffff8046ddd8 RFLAGS: 00000246
> > > RAX: 00000000000000ff RBX: ffffffff8831c1f8 RCX:
> 000041049c7256e8
> > > RDX: 0000000000000005 RSI: 000000005238a938 RDI:
> 00000000002896a0
> > > RBP: ffffffff8046df08 R8: 00000000000040fb R9:
> 000000005238a7e8
> > > R10: 0000000000000002 R11: 0000ffff0000ffff R12:
> 000000000000000c
> > > R13: 0000000000000000 R14: 0000000000000000 R15:
> 0000000000000000
> > > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> > > --- <NMI exception stack> ---
> > > #8 [ffffffff8046ddd8] smp_send_stop at ffffffff80116e44
>
> >From what you're implying, there is no physical "dump switch".
> So I'm trying figure out where that "dumpsw_notify()" function
> comes from? Whose module is that and what is its purpose?
>
> Dave
>
>
> > a reboot, and
> > then crash was run on the resulting dump. In fact crash was
> > automatically run via a startup script and there was no human
> > intervention until after it was noticed that crash was
> filling up the
> > root file system with a temporary file due to the inifinite loop.
>
> --
> Crash-utility mailing list
> Crash-utility@redhat.com
> https://www.redhat.com/mailman/listinfo/crash-utility
>

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 
Old 06-29-2010, 04:21 PM
"Silacci, Lucas"
 
Default infinite loop in crash due to double-NMI on x86_64 system

My only guess is that there is something in the transition between the regular kernel and the kdump kernel (somewhere in the kexec path) that re-opens the door for a queued up NMI to come in just before the kdump kernel takes over. I've been digging through that code, but so far haven't come up with anything that explains it yet.

-Lucas

> -----Original Message-----
> From: crash-utility-bounces@redhat.com
> [mailto:crash-utility-bounces@redhat.com] On Behalf Of Dave Anderson
> Sent: Tuesday, June 29, 2010 5:58 AM
> To: Discussion list for crash utility usage,maintenance and
> development
> Subject: Re: [Crash-utility] infinite loop in crash due to
> double-NMI on x86_64 system
>
>
> ----- "Petr Tesarik" <ptesarik@suse.cz> wrote:
>
> > Silacci, Lucas p*še v Po 28. 06. 2010 v 17:26 -0400:
> > > The dumpsw_notify function is part of a driver that was
> added to our
> > > systems to trigger kernel panics when an NMI occurs. In
> the version of
> > > the kernel we are using (SLES 10 SP1) this was necessary
> to cause an
> > > actual panic to happen and a dump to be saved when an NMI occurred
> > > (especially due to a dump switch being pressed, hence the name).
> > >
> > > That driver registers a callback (dumpsw_notify) into the
> die_chain and
> > > calls panic() if the die code is a DIE_NMI.
> >
> > Hi,
> >
> > my opinion is that a NMI is ... well, a non-maskable
> interrupt. Which
> > means there is nothing the kernel could possibly do to
> prevent the NMI
> > handler itself from being interrupted by another NMI. Whatever the
> > reason for it.
>
> Really? According to the AMD x86_64 manual -- note the
> "Masking" section:
>
> 8.3.3 NMI-Non-Maskable-Interrupt Exception (Vector 2)
>
> An NMI exception occurs as a result of system logic signalling a
> non-maskable interrupt to the processor.
>
> Error Code Returned: None.
>
> Program Restart: NMI is an interrupt. The processor
> recognizes an NMI
> at an instruction boundary. The saved instruction pointer
> points to the
> instruction immediately following the boundary where the
> NMI was recognized.
>
> Masking: NMI cannot be masked. However, when an NMI is
> executed by the
> processor, recognition of subsequent NMIs are disabled
> until an IRET
> instruction is executed.
>
> And looking at the backtrace, I'm still having a hard time
> understanding how
> it was possible. What am I missing?
>
> Dave
>
> > Having the crash utility loop forever on such dumps is
> annoying, at the
> > very least. And I imagine, such hangs could cause quite
> some headache to
> > Louis Bouchard.
> >
> > Just my $0.02,
> > Petr Tesarik
>
>
> > PID: 0 TASK: ffffffff8038c340 CPU: 0 COMMAND: "swapper"
> > #0 [ffffffff8046dc50] machine_kexec at ffffffff8011a95b
> > #1 [ffffffff8046dd20] crash_kexec at ffffffff80154351
> > #2 [ffffffff8046dde0] panic at ffffffff801327fa
> > #3 [ffffffff8046ded0] dumpsw_notify at ffffffff8831c0c3
> > #4 [ffffffff8046dee0] notifier_call_chain at ffffffff8032481f
> > #5 [ffffffff8046df00] default_do_nmi at ffffffff80322fab
> > #6 [ffffffff8046df40] do_nmi at ffffffff80323365
> > #7 [ffffffff8046df50] nmi at ffffffff8032268f
> > [exception RIP: smp_send_stop+84]
> > RIP: ffffffff80116e44 RSP: ffffffff8046ddd8 RFLAGS: 00000246
> > RAX: 00000000000000ff RBX: ffffffff8831c1f8 RCX:
> 000041049c7256e8
> > RDX: 0000000000000005 RSI: 000000005238a938 RDI:
> 00000000002896a0
> > RBP: ffffffff8046df08 R8: 00000000000040fb R9:
> 000000005238a7e8
> > R10: 0000000000000002 R11: 0000ffff0000ffff R12:
> 000000000000000c
> > R13: 0000000000000000 R14: 0000000000000000 R15:
> 0000000000000000
> > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> > --- <NMI exception stack> ---
> > #8 [ffffffff8046ddd8] smp_send_stop at ffffffff80116e44
>
>
>
> --
> Crash-utility mailing list
> Crash-utility@redhat.com
> https://www.redhat.com/mailman/listinfo/crash-utility
>

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 

Thread Tools




All times are GMT. The time now is 12:16 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org