FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Crash Utility

 
 
LinkBack Thread Tools
 
Old 08-02-2010, 08:00 AM
Petr Tesarik
 
Default Question on online/present/possible CPUs

Hi all,

before making a larger cleanup, I want to ask here for your opinion. It
seems that there is quite a bit of confusion about the meaning of CPU
count printed out by the crash utility.

1. Number of CPUs

Some people think that crash should always output the number of CPUs in
the system (ie. a quad-core server should always output 'CPUS: 4'),
while other people think that only online CPUs should be counted.

2. CPU numbering

For example, if there are 4 CPUs in the system, but some of them are
taken offline (e.g. CPU 1 and CPU 3), _and_ crash output the number of
online CPUs, it would print out 'CPUS: 2'. It's not easy to find out
that valid CPU numbers are 0 and 2 in this case.

3. Examining offline CPU

Sometimes, it may be useful to examine the state of an offline CPU. Now,
I know that the saved state is most likely stale, but it can be useful
in some cases (e.g. a crash after dropping to kdb). The crash utility
currently refuses to select an offline CPU with 'set -c #'. Are there
any concerns about allowing it?

Regards,
Petr Tesarik


--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 
Old 08-02-2010, 10:45 PM
 
Default Question on online/present/possible CPUS

>
> Hi all,
>
> before making a larger cleanup, I want to ask here for your opinion. It
> seems that there is quite a bit of confusion about the meaning of CPU
> count printed out by the crash utility.
>
> 1. Number of CPUs
>
> Some people think that crash should always output the number of CPUs in
> the system (ie. a quad-core server should always output 'CPUS: 4'),
> while other people think that only online CPUs should be counted.
>
> 2. CPU numbering
>
> For example, if there are 4 CPUs in the system, but some of them are
> taken offline (e.g. CPU 1 and CPU 3), _and_ crash output the number of
> online CPUs, it would print out 'CPUS: 2'. It's not easy to find out
> that valid CPU numbers are 0 and 2 in this case.

Hi Petr,

For all but ppc64, the number shown by the initial banner and the
"sys" command is essentially "the-highest-cpu-number-plus-one".
For ppc64 (as requested and implemented by the IBM/ppc64 maintainers),
it shows the number of online cpus. There's reasons for doing it
either of the two ways, but I'm on vacation now, and you can research
the list archives for the various arguments for-and-against doing it
either way. Check the changelog.html for when it was changed for
ppc64, and then cross-reference the revision date with the list
archives.

> 3. Examining offline CPU
>
> Sometimes, it may be useful to examine the state of an offline CPU. Now,
> I know that the saved state is most likely stale, but it can be useful
> in some cases (e.g. a crash after dropping to kdb). The crash utility
> currently refuses to select an offline CPU with 'set -c #'. Are there
> any concerns about allowing it?

I tend to agree with you, but the only thing that's useful and
available from an offline cpu is the swapper task for that cpu
and the runqueue for that cpu. And both of those entities are
readily accessible if you really need them. Although I don't know
anything about kdb status, so maybe there's something of per-cpu
interest, but I don't know why it would be necessary to "set"
that cpu?

In any case, like I said before, I'm just temporarily online while
on vacation, and will be back to work on the 9th.

Thanks,
Dave



--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 
Old 08-10-2010, 06:55 PM
"Hagen, Jeffrey"
 
Default Question on online/present/possible CPUS

Hi Petr and Dave,

I have a couple of comments on Petr's email regarding CPU count.

When the dump is the result of an NMI (nmi switch pressed) due to a hung
system, one often needs to analyze the state and backtrace for all the
CPU's. Since the kernel halts all but CPU0, the crash utility cannot
see the other "offline" CPU's.

This behavior has changed for the x86 architecture somewhere between
2.6.16 (SLES10) and 2.6.32 (SLES11) due to the removal of the x8664_pda
structure.
The function x86_64_init (in x86_64.c) now calls x86_64_per_cpu_init
which doesn't count the offline CPUS when calculating the number of
CPU's. Previously, x86_64_cpu_pda_init (called if x8664_pda exists),
didn't check for online/offline status.

Regarding #3 in Petr's email. It appears that the set command won't
accept a value >= kt_cpus (number of CPUS). It doesn't check if the CPU
is offline or not.

Thanks,

Jeff Hagen



>
> Hi all,
>
> before making a larger cleanup, I want to ask here for your opinion.
It
> seems that there is quite a bit of confusion about the meaning of CPU
> count printed out by the crash utility.
>
> 1. Number of CPUs
>
> Some people think that crash should always output the number of CPUs
in
> the system (ie. a quad-core server should always output 'CPUS: 4'),
> while other people think that only online CPUs should be counted.
>
> 2. CPU numbering
>
> For example, if there are 4 CPUs in the system, but some of them are
> taken offline (e.g. CPU 1 and CPU 3), _and_ crash output the number of
> online CPUs, it would print out 'CPUS: 2'. It's not easy to find out
> that valid CPU numbers are 0 and 2 in this case.

Hi Petr,

For all but ppc64, the number shown by the initial banner and the
"sys" command is essentially "the-highest-cpu-number-plus-one".
For ppc64 (as requested and implemented by the IBM/ppc64 maintainers),
it shows the number of online cpus. There's reasons for doing it
either of the two ways, but I'm on vacation now, and you can research
the list archives for the various arguments for-and-against doing it
either way. Check the changelog.html for when it was changed for
ppc64, and then cross-reference the revision date with the list
archives.

> 3. Examining offline CPU
>
> Sometimes, it may be useful to examine the state of an offline CPU.
Now,
> I know that the saved state is most likely stale, but it can be useful
> in some cases (e.g. a crash after dropping to kdb). The crash utility
> currently refuses to select an offline CPU with 'set -c #'. Are there
> any concerns about allowing it?

I tend to agree with you, but the only thing that's useful and
available from an offline cpu is the swapper task for that cpu
and the runqueue for that cpu. And both of those entities are
readily accessible if you really need them. Although I don't know
anything about kdb status, so maybe there's something of per-cpu
interest, but I don't know why it would be necessary to "set"
that cpu?

In any case, like I said before, I'm just temporarily online while
on vacation, and will be back to work on the 9th.

Thanks,
Dave

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 
Old 08-12-2010, 01:21 PM
Dave Anderson
 
Default Question on online/present/possible CPUS

----- "Jeffrey Hagen" <Jeffrey.Hagen@teradata.com> wrote:

> Hi Petr and Dave,
>
> I have a couple of comments on Petr's email regarding CPU count.
>
> When the dump is the result of an NMI (nmi switch pressed) due to a hung
> system, one often needs to analyze the state and backtrace for all the
> CPU's. Since the kernel halts all but CPU0, the crash utility cannot
> see the other "offline" CPU's.

I've never seen that behavior before. Probably because I've never seen
an x86_64 dumpfile that was created as a result of the NMI switch being
pressed? Anyway, are you saying that the NMI switch shutdown handler
takes the other cpus offline?

> This behavior has changed for the x86 architecture somewhere between
> 2.6.16 (SLES10) and 2.6.32 (SLES11) due to the removal of the x8664_pda
> structure.
> The function x86_64_init (in x86_64.c) now calls x86_64_per_cpu_init
> which doesn't count the offline CPUS when calculating the number of
> CPU's. Previously, x86_64_cpu_pda_init (called if x8664_pda exists),
> didn't check for online/offline status.

Again -- I've never seen this behaviour before.

In any case, I'll look at any patch suggestions you guys have in mind.

Thanks,
Dave


> Regarding #3 in Petr's email. It appears that the set command won't
> accept a value >= kt_cpus (number of CPUS). It doesn't check if the CPU
> is offline or not.
>
> Thanks,
>
> Jeff Hagen
>
>
>
> >
> > Hi all,
> >
> > before making a larger cleanup, I want to ask here for your
> opinion.
> It
> > seems that there is quite a bit of confusion about the meaning of
> CPU
> > count printed out by the crash utility.
> >
> > 1. Number of CPUs
> >
> > Some people think that crash should always output the number of
> CPUs
> in
> > the system (ie. a quad-core server should always output 'CPUS: 4'),
> > while other people think that only online CPUs should be counted.
> >
> > 2. CPU numbering
> >
> > For example, if there are 4 CPUs in the system, but some of them
> are
> > taken offline (e.g. CPU 1 and CPU 3), _and_ crash output the number
> of
> > online CPUs, it would print out 'CPUS: 2'. It's not easy to find
> out
> > that valid CPU numbers are 0 and 2 in this case.
>
> Hi Petr,
>
> For all but ppc64, the number shown by the initial banner and the
> "sys" command is essentially "the-highest-cpu-number-plus-one".
> For ppc64 (as requested and implemented by the IBM/ppc64
> maintainers),
> it shows the number of online cpus. There's reasons for doing it
> either of the two ways, but I'm on vacation now, and you can research
> the list archives for the various arguments for-and-against doing it
> either way. Check the changelog.html for when it was changed for
> ppc64, and then cross-reference the revision date with the list
> archives.
>
> > 3. Examining offline CPU
> >
> > Sometimes, it may be useful to examine the state of an offline CPU.
> Now,
> > I know that the saved state is most likely stale, but it can be
> useful
> > in some cases (e.g. a crash after dropping to kdb). The crash
> utility
> > currently refuses to select an offline CPU with 'set -c #'. Are
> there
> > any concerns about allowing it?
>
> I tend to agree with you, but the only thing that's useful and
> available from an offline cpu is the swapper task for that cpu
> and the runqueue for that cpu. And both of those entities are
> readily accessible if you really need them. Although I don't know
> anything about kdb status, so maybe there's something of per-cpu
> interest, but I don't know why it would be necessary to "set"
> that cpu?
>
> In any case, like I said before, I'm just temporarily online while
> on vacation, and will be back to work on the 9th.
>
> Thanks,
> Dave
>
> --
> Crash-utility mailing list
> Crash-utility@redhat.com
> https://www.redhat.com/mailman/listinfo/crash-utility

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 
Old 09-23-2010, 08:29 PM
"Hagen, Jeffrey"
 
Default Question on online/present/possible CPUS

Hi Dave,

Attached is our suggested patch for the issue with CPU count in
an NMI switch induced coredump. Basically the change uses the
cpu_present_mask instead of the cpu_online_mask in x86_64_per_cpu_init
and x86_64_get_smp_cpus.

In answer to your question below: "Are you saying that the NMI
switch shutdown handler takes the other cpus offline?" --- Yes!!

Thanks,

Jeff


-----Original Message-----
From: crash-utility-bounces@redhat.com
[mailto:crash-utility-bounces@redhat.com] On Behalf Of Dave Anderson
Sent: Thursday, August 12, 2010 6:22 AM
To: Discussion list for crash utility usage,maintenance and development
Subject: Re: [Crash-utility] Question on online/present/possible CPUS


----- "Jeffrey Hagen" <Jeffrey.Hagen@teradata.com> wrote:

> Hi Petr and Dave,
>
> I have a couple of comments on Petr's email regarding CPU count.
>
> When the dump is the result of an NMI (nmi switch pressed) due to a
hung
> system, one often needs to analyze the state and backtrace for all the
> CPU's. Since the kernel halts all but CPU0, the crash utility cannot
> see the other "offline" CPU's.

I've never seen that behavior before. Probably because I've never seen
an x86_64 dumpfile that was created as a result of the NMI switch being
pressed? Anyway, are you saying that the NMI switch shutdown handler
takes the other cpus offline?

> This behavior has changed for the x86 architecture somewhere between
> 2.6.16 (SLES10) and 2.6.32 (SLES11) due to the removal of the
x8664_pda
> structure.
> The function x86_64_init (in x86_64.c) now calls x86_64_per_cpu_init
> which doesn't count the offline CPUS when calculating the number of
> CPU's. Previously, x86_64_cpu_pda_init (called if x8664_pda exists),
> didn't check for online/offline status.

Again -- I've never seen this behaviour before.

In any case, I'll look at any patch suggestions you guys have in mind.

Thanks,
Dave


> Regarding #3 in Petr's email. It appears that the set command won't
> accept a value >= kt_cpus (number of CPUS). It doesn't check if the
CPU
> is offline or not.
>
> Thanks,
>
> Jeff Hagen
>
>
>
> >
> > Hi all,
> >
> > before making a larger cleanup, I want to ask here for your
> opinion.
> It
> > seems that there is quite a bit of confusion about the meaning of
> CPU
> > count printed out by the crash utility.
> >
> > 1. Number of CPUs
> >
> > Some people think that crash should always output the number of
> CPUs
> in
> > the system (ie. a quad-core server should always output 'CPUS: 4'),
> > while other people think that only online CPUs should be counted.
> >
> > 2. CPU numbering
> >
> > For example, if there are 4 CPUs in the system, but some of them
> are
> > taken offline (e.g. CPU 1 and CPU 3), _and_ crash output the number
> of
> > online CPUs, it would print out 'CPUS: 2'. It's not easy to find
> out
> > that valid CPU numbers are 0 and 2 in this case.
>
> Hi Petr,
>
> For all but ppc64, the number shown by the initial banner and the
> "sys" command is essentially "the-highest-cpu-number-plus-one".
> For ppc64 (as requested and implemented by the IBM/ppc64
> maintainers),
> it shows the number of online cpus. There's reasons for doing it
> either of the two ways, but I'm on vacation now, and you can research
> the list archives for the various arguments for-and-against doing it
> either way. Check the changelog.html for when it was changed for
> ppc64, and then cross-reference the revision date with the list
> archives.
>
> > 3. Examining offline CPU
> >
> > Sometimes, it may be useful to examine the state of an offline CPU.
> Now,
> > I know that the saved state is most likely stale, but it can be
> useful
> > in some cases (e.g. a crash after dropping to kdb). The crash
> utility
> > currently refuses to select an offline CPU with 'set -c #'. Are
> there
> > any concerns about allowing it?
>
> I tend to agree with you, but the only thing that's useful and
> available from an offline cpu is the swapper task for that cpu
> and the runqueue for that cpu. And both of those entities are
> readily accessible if you really need them. Although I don't know
> anything about kdb status, so maybe there's something of per-cpu
> interest, but I don't know why it would be necessary to "set"
> that cpu?
>
> In any case, like I said before, I'm just temporarily online while
> on vacation, and will be back to work on the 9th.
>
> Thanks,
> Dave
>
> --
> Crash-utility mailing list
> Crash-utility@redhat.com
> https://www.redhat.com/mailman/listinfo/crash-utility

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 
Old 09-23-2010, 08:55 PM
Dave Anderson
 
Default Question on online/present/possible CPUS

----- "Jeffrey Hagen" <Jeffrey.Hagen@teradata.com> wrote:

> Hi Dave,
>
> Attached is our suggested patch for the issue with CPU count in
> an NMI switch induced coredump. Basically the change uses the
> cpu_present_mask instead of the cpu_online_mask in x86_64_per_cpu_init
> and x86_64_get_smp_cpus.

I understand why you need to do it that way, but to make a change like
this makes me a little nervous because nobody's ever reported this
situation before, and I'm somewhat paranoid it may lead to unexpected
behavior. Plus there are old kernels that don't even have a cpu_present_map.

> In answer to your question below: "Are you saying that the NMI
> switch shutdown handler takes the other cpus offline?" --- Yes!!

Where exactly? Can you point me to the kernel code that does that?

Dave


>
> Thanks,
>
> Jeff
>
>
> -----Original Message-----
> From: crash-utility-bounces@redhat.com
> [mailto:crash-utility-bounces@redhat.com] On Behalf Of Dave Anderson
> Sent: Thursday, August 12, 2010 6:22 AM
> To: Discussion list for crash utility usage,maintenance and
> development
> Subject: Re: [Crash-utility] Question on online/present/possible CPUS
>
>
> ----- "Jeffrey Hagen" <Jeffrey.Hagen@teradata.com> wrote:
>
> > Hi Petr and Dave,
> >
> > I have a couple of comments on Petr's email regarding CPU count.
> >
> > When the dump is the result of an NMI (nmi switch pressed) due to a
> hung
> > system, one often needs to analyze the state and backtrace for all
> the
> > CPU's. Since the kernel halts all but CPU0, the crash utility
> cannot
> > see the other "offline" CPU's.
>
> I've never seen that behavior before. Probably because I've never
> seen
> an x86_64 dumpfile that was created as a result of the NMI switch
> being
> pressed? Anyway, are you saying that the NMI switch shutdown handler
>
> takes the other cpus offline?
>
> > This behavior has changed for the x86 architecture somewhere
> between
> > 2.6.16 (SLES10) and 2.6.32 (SLES11) due to the removal of the
> x8664_pda
> > structure.
> > The function x86_64_init (in x86_64.c) now calls
> x86_64_per_cpu_init
> > which doesn't count the offline CPUS when calculating the number of
> > CPU's. Previously, x86_64_cpu_pda_init (called if x8664_pda
> exists),
> > didn't check for online/offline status.
>
> Again -- I've never seen this behaviour before.
>
> In any case, I'll look at any patch suggestions you guys have in
> mind.
>
> Thanks,
> Dave
>
>
> > Regarding #3 in Petr's email. It appears that the set command
> won't
> > accept a value >= kt_cpus (number of CPUS). It doesn't check if
> the
> CPU
> > is offline or not.
> >
> > Thanks,
> >
> > Jeff Hagen
> >
> >
> >
> > >
> > > Hi all,
> > >
> > > before making a larger cleanup, I want to ask here for your
> > opinion.
> > It
> > > seems that there is quite a bit of confusion about the meaning of
> > CPU
> > > count printed out by the crash utility.
> > >
> > > 1. Number of CPUs
> > >
> > > Some people think that crash should always output the number of
> > CPUs
> > in
> > > the system (ie. a quad-core server should always output 'CPUS:
> 4'),
> > > while other people think that only online CPUs should be counted.
> > >
> > > 2. CPU numbering
> > >
> > > For example, if there are 4 CPUs in the system, but some of them
> > are
> > > taken offline (e.g. CPU 1 and CPU 3), _and_ crash output the
> number
> > of
> > > online CPUs, it would print out 'CPUS: 2'. It's not easy to find
> > out
> > > that valid CPU numbers are 0 and 2 in this case.
> >
> > Hi Petr,
> >
> > For all but ppc64, the number shown by the initial banner and the
> > "sys" command is essentially "the-highest-cpu-number-plus-one".
> > For ppc64 (as requested and implemented by the IBM/ppc64
> > maintainers),
> > it shows the number of online cpus. There's reasons for doing it
> > either of the two ways, but I'm on vacation now, and you can
> research
> > the list archives for the various arguments for-and-against doing
> it
> > either way. Check the changelog.html for when it was changed for
> > ppc64, and then cross-reference the revision date with the list
> > archives.
> >
> > > 3. Examining offline CPU
> > >
> > > Sometimes, it may be useful to examine the state of an offline
> CPU.
> > Now,
> > > I know that the saved state is most likely stale, but it can be
> > useful
> > > in some cases (e.g. a crash after dropping to kdb). The crash
> > utility
> > > currently refuses to select an offline CPU with 'set -c #'. Are
> > there
> > > any concerns about allowing it?
> >
> > I tend to agree with you, but the only thing that's useful and
> > available from an offline cpu is the swapper task for that cpu
> > and the runqueue for that cpu. And both of those entities are
> > readily accessible if you really need them. Although I don't know
> > anything about kdb status, so maybe there's something of per-cpu
> > interest, but I don't know why it would be necessary to "set"
> > that cpu?
> >
> > In any case, like I said before, I'm just temporarily online while
> > on vacation, and will be back to work on the 9th.
> >
> > Thanks,
> > Dave
> >
> > --
> > Crash-utility mailing list
> > Crash-utility@redhat.com
> > https://www.redhat.com/mailman/listinfo/crash-utility
>
> --
> Crash-utility mailing list
> Crash-utility@redhat.com
> https://www.redhat.com/mailman/listinfo/crash-utility
>
> --
> Crash-utility mailing list
> Crash-utility@redhat.com
> https://www.redhat.com/mailman/listinfo/crash-utility

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 
Old 09-23-2010, 10:52 PM
"Hagen, Jeffrey"
 
Default Question on online/present/possible CPUS

Paranoia is usually a good thing in this industry and you know this code
far better that I do...

For the older kernels that don't have cpu_present_map, if they still
have the x8664_pda structure, the code my patch changes shouldn't get
executed. It's the deprecation of the x8664_pda structure (between
SLES10 and SLES11 in our case) that exposes this issue.

The setting of the other CPU's to offline (IPI REBOOT_VECTOR) is done in
native_smp_send_stop [arch/x86/kernel/smp.c] called by panic(). Note
that the SLES11 version of the 2.6.32 kernel allows calling
crash_kexec() after calling atomic_notifer_call_chain() in panic().

The flow during an oops or keyboard induced crash does not use this same
code. In this case crash_kexec() is called by oops_end() which is
called by die().

Jeff


-----Original Message-----
From: crash-utility-bounces@redhat.com
[mailto:crash-utility-bounces@redhat.com] On Behalf Of Dave Anderson
Sent: Thursday, September 23, 2010 1:55 PM
To: Discussion list for crash utility usage,maintenance and development
Subject: Re: [Crash-utility] Question on online/present/possible CPUS


----- "Jeffrey Hagen" <Jeffrey.Hagen@teradata.com> wrote:

> Hi Dave,
>
> Attached is our suggested patch for the issue with CPU count in
> an NMI switch induced coredump. Basically the change uses the
> cpu_present_mask instead of the cpu_online_mask in x86_64_per_cpu_init
> and x86_64_get_smp_cpus.

I understand why you need to do it that way, but to make a change like
this makes me a little nervous because nobody's ever reported this
situation before, and I'm somewhat paranoid it may lead to unexpected
behavior. Plus there are old kernels that don't even have a
cpu_present_map.

> In answer to your question below: "Are you saying that the NMI
> switch shutdown handler takes the other cpus offline?" --- Yes!!

Where exactly? Can you point me to the kernel code that does that?

Dave


>
> Thanks,
>
> Jeff
>
>
> -----Original Message-----
> From: crash-utility-bounces@redhat.com
> [mailto:crash-utility-bounces@redhat.com] On Behalf Of Dave Anderson
> Sent: Thursday, August 12, 2010 6:22 AM
> To: Discussion list for crash utility usage,maintenance and
> development
> Subject: Re: [Crash-utility] Question on online/present/possible CPUS
>
>
> ----- "Jeffrey Hagen" <Jeffrey.Hagen@teradata.com> wrote:
>
> > Hi Petr and Dave,
> >
> > I have a couple of comments on Petr's email regarding CPU count.
> >
> > When the dump is the result of an NMI (nmi switch pressed) due to a
> hung
> > system, one often needs to analyze the state and backtrace for all
> the
> > CPU's. Since the kernel halts all but CPU0, the crash utility
> cannot
> > see the other "offline" CPU's.
>
> I've never seen that behavior before. Probably because I've never
> seen
> an x86_64 dumpfile that was created as a result of the NMI switch
> being
> pressed? Anyway, are you saying that the NMI switch shutdown handler
>
> takes the other cpus offline?
>
> > This behavior has changed for the x86 architecture somewhere
> between
> > 2.6.16 (SLES10) and 2.6.32 (SLES11) due to the removal of the
> x8664_pda
> > structure.
> > The function x86_64_init (in x86_64.c) now calls
> x86_64_per_cpu_init
> > which doesn't count the offline CPUS when calculating the number of
> > CPU's. Previously, x86_64_cpu_pda_init (called if x8664_pda
> exists),
> > didn't check for online/offline status.
>
> Again -- I've never seen this behaviour before.
>
> In any case, I'll look at any patch suggestions you guys have in
> mind.
>
> Thanks,
> Dave
>
>
> > Regarding #3 in Petr's email. It appears that the set command
> won't
> > accept a value >= kt_cpus (number of CPUS). It doesn't check if
> the
> CPU
> > is offline or not.
> >
> > Thanks,
> >
> > Jeff Hagen
> >
> >
> >
> > >
> > > Hi all,
> > >
> > > before making a larger cleanup, I want to ask here for your
> > opinion.
> > It
> > > seems that there is quite a bit of confusion about the meaning of
> > CPU
> > > count printed out by the crash utility.
> > >
> > > 1. Number of CPUs
> > >
> > > Some people think that crash should always output the number of
> > CPUs
> > in
> > > the system (ie. a quad-core server should always output 'CPUS:
> 4'),
> > > while other people think that only online CPUs should be counted.
> > >
> > > 2. CPU numbering
> > >
> > > For example, if there are 4 CPUs in the system, but some of them
> > are
> > > taken offline (e.g. CPU 1 and CPU 3), _and_ crash output the
> number
> > of
> > > online CPUs, it would print out 'CPUS: 2'. It's not easy to find
> > out
> > > that valid CPU numbers are 0 and 2 in this case.
> >
> > Hi Petr,
> >
> > For all but ppc64, the number shown by the initial banner and the
> > "sys" command is essentially "the-highest-cpu-number-plus-one".
> > For ppc64 (as requested and implemented by the IBM/ppc64
> > maintainers),
> > it shows the number of online cpus. There's reasons for doing it
> > either of the two ways, but I'm on vacation now, and you can
> research
> > the list archives for the various arguments for-and-against doing
> it
> > either way. Check the changelog.html for when it was changed for
> > ppc64, and then cross-reference the revision date with the list
> > archives.
> >
> > > 3. Examining offline CPU
> > >
> > > Sometimes, it may be useful to examine the state of an offline
> CPU.
> > Now,
> > > I know that the saved state is most likely stale, but it can be
> > useful
> > > in some cases (e.g. a crash after dropping to kdb). The crash
> > utility
> > > currently refuses to select an offline CPU with 'set -c #'. Are
> > there
> > > any concerns about allowing it?
> >
> > I tend to agree with you, but the only thing that's useful and
> > available from an offline cpu is the swapper task for that cpu
> > and the runqueue for that cpu. And both of those entities are
> > readily accessible if you really need them. Although I don't know
> > anything about kdb status, so maybe there's something of per-cpu
> > interest, but I don't know why it would be necessary to "set"
> > that cpu?
> >
> > In any case, like I said before, I'm just temporarily online while
> > on vacation, and will be back to work on the 9th.
> >
> > Thanks,
> > Dave
> >
> > --
> > Crash-utility mailing list
> > Crash-utility@redhat.com
> > https://www.redhat.com/mailman/listinfo/crash-utility
>
> --
> Crash-utility mailing list
> Crash-utility@redhat.com
> https://www.redhat.com/mailman/listinfo/crash-utility
>
> --
> Crash-utility mailing list
> Crash-utility@redhat.com
> https://www.redhat.com/mailman/listinfo/crash-utility

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 
Old 09-24-2010, 01:15 PM
Dave Anderson
 
Default Question on online/present/possible CPUS

----- "Jeffrey Hagen" <Jeffrey.Hagen@teradata.com> wrote:

> Paranoia is usually a good thing in this industry and you know this code
> far better that I do...
>
> For the older kernels that don't have cpu_present_map, if they still
> have the x8664_pda structure, the code my patch changes shouldn't get
> executed. It's the deprecation of the x8664_pda structure (between
> SLES10 and SLES11 in our case) that exposes this issue.

True...

>
> The setting of the other CPU's to offline (IPI REBOOT_VECTOR) is done in
> native_smp_send_stop [arch/x86/kernel/smp.c] called by panic(). Note
> that the SLES11 version of the 2.6.32 kernel allows calling
> crash_kexec() after calling atomic_notifer_call_chain() in panic().

Ah-ha! That makes sense -- I was under the impression that all of the
other distros would follow upstream with crash_kexec() being called
before, and therefore preventing, the subsequent smp_send_stop() call.

So given that this would happen whenever panic() gets called directly
in a SLES kernel, is the SLES version of the crash utility patched to do
something similar to your patch?

Petr?

> The flow during an oops or keyboard induced crash does not use this same
> code. In this case crash_kexec() is called by oops_end() which is
> called by die().

OK, I'm going to give your patch a run-through with ~150 or so x86_64
dumpfiles I've kept as examples over the years, and see if anything
interesting happens.

Thanks Jeff,
Dave

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 
Old 09-24-2010, 03:40 PM
Dave Anderson
 
Default Question on online/present/possible CPUS

----- "Dave Anderson" <anderson@redhat.com> wrote:

> ----- "Jeffrey Hagen" <Jeffrey.Hagen@teradata.com> wrote:
>
> > Paranoia is usually a good thing in this industry and you know this code
> > far better that I do...
> >
> > For the older kernels that don't have cpu_present_map, if they still
> > have the x8664_pda structure, the code my patch changes shouldn't get
> > executed. It's the deprecation of the x8664_pda structure (between
> > SLES10 and SLES11 in our case) that exposes this issue.
>
> True...
>
> >
> > The setting of the other CPU's to offline (IPI REBOOT_VECTOR) is done in
> > native_smp_send_stop [arch/x86/kernel/smp.c] called by panic(). Note
> > that the SLES11 version of the 2.6.32 kernel allows calling
> > crash_kexec() after calling atomic_notifer_call_chain() in panic().
>
> Ah-ha! That makes sense -- I was under the impression that all of the
> other distros would follow upstream with crash_kexec() being called
> before, and therefore preventing, the subsequent smp_send_stop() call.
>
> So given that this would happen whenever panic() gets called directly
> in a SLES kernel, is the SLES version of the crash utility patched to do
> something similar to your patch?
>
> Petr?
>
> > The flow during an oops or keyboard induced crash does not use this same
> > code. In this case crash_kexec() is called by oops_end() which is
> > called by die().
>
> OK, I'm going to give your patch a run-through with ~150 or so x86_64
> dumpfiles I've kept as examples over the years, and see if anything
> interesting happens.

Hi Jeff,

Nothing interest happened -- so unless I hear anything to the contrary
from the SUSE maintainers, I'm queueing your patch for the next release.

Thanks,
Dave

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 

Thread Tools




All times are GMT. The time now is 05:43 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org