FAQ Search Today's Posts Mark Forums Read

» Linux Archive
Home
New Posts
Search
FAQ


Go Back   Linux Archive > Redhat > Crash Utility

 
 
LinkBack Thread Tools
 
Old 04-02-2008, 05:00 PM
Dave Anderson
 
Default crash aborts with cannot determine idle task

Chandru wrote:



Look at the crash function get_idle_threads() in task.c, which is where
you're failing. It runs through the history of the symbols that Linux
has used over the years for the run queues. For the most recent kernels,
it looks for the "per_cpu__runqueues" symbol. At least on 2.6.25-rc2,
the kernel still defines them in kernel/sched.c like this:

static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

So if you do an "nm -Bn vmlinux | grep runqueues", you should see:

# nm -Bn vmlinux-2.6.25-rc1-ext4-1 | grep runqueues
ffffffff8082b700 d per_cpu__runqueues
#

I'm guessing that's not the problem -- so presuming that the symbol
*does*

exist, find out why it's failing to increment "cnt" in this part of
get_idle_threads():

if (symbol_exists("per_cpu__runqueues") &&
VALID_MEMBER(runqueue_idle)) {
runqbuf = GETBUF(SIZE(runqueue));
for (i = 0; i < nr_cpus; i++) {
if ((kt->flags & SMP) && (kt->flags &
PER_CPU_OFF)) {
runq =
symbol_value("per_cpu__runqueues") +

kt->__per_cpu_offset[i];
} else
runq =
symbol_value("per_cpu__runqueues");


readmem(runq, KVADDR, runqbuf,
SIZE(runqueue), "runqueues entry
(per_cpu)",

FAULT_ON_ERROR);
tasklist[i] = ULONG(runqbuf +
OFFSET(runqueue_idle));

if (IS_KVADDR(tasklist[i]))
cnt++;
}
}

Determine whether it even makes it to the inner for loop, whether
the pre-determined nr_cpus value makes sense, whether the SMP flag
reflects whether the kernel was compiled for SMP, whether the PER_CPU_OFF
flag was set, what address was calculated, etc...

Dave

Thanks for the reply Dave. The code makes it to the inner for loop and
the condition
if (IS_KVADDR(tasklist[i])) fails which is why 'cnt' doesn't get
incremented. The tasklist[i] somewhat has this value : 0x3d60657870722024.


I ran gdb on the vmcore file and printed the memory contents .

(gdb) print per_cpu__runqueues
$1 = {lock = {raw_lock = {slock = 1431524419}}, nr_running =
5283422954284598606,
raw_weighted_load = 5064663116585906736, cpu_load =
{2316051155752670036, 5929356451801411872,

2613857225664584019}, nr_switches = 5644502509443686462,
nr_uninterruptible = 2316072106569976142, expired_timestamp =
5142904381182533935,
timestamp_last_tick = 7235439831918129227, curr = 0x5f66696c650a5243,
idle = 0x3d60657870722024, <<<-----
prev_mm = 0x5243202b20243f60, active = 0xa247b4155535443, expired =
0x5352434449527d2f,



Does this mean that the kernel data was corrupted when vmcore was
collected ?.


I don't know.

You cannot expect gdb to be able to handle it at all, unless
the kernel was configured without CONFIG_SMP. In that case,
the per_cpu__runqueues symbol points to the singular instance
of an rq.

However, more likely your kernel is configured with CONFIG_SMP.
In that case, a per-cpu offset has to be applied to the symbol
value of per_cpu__runqueues to calculate where each cpu's instance
of its rq structure is located. I can guarantee you that gdb
cannot do that, and that's probably why you're seeing "garbage"
data above.

So you can see that's what's happening in the get_idle_threads()
function where it's calculating the "runq" address each time
through the loop. If the kernel is configured CONFIG_SMP,
it adds the per-cpu offset value, otherwise it uses the
symbol value of "per_cpu__runqueues" as is.

As I suggested before, you're going to have to determine why
the tasklist[i] is bogus. The first things to determine are:

(1) what "nr_cpus" was calculated to be, and
(2) whether the SMP and PER_CPU_OFF flags are set in kt->flags.

If those variables/settings make sense, then presumably the
problem is in the determination of the per-cpu offset values.
That's done in a machine-specific way, so I can't help you
without knowing what architecture you're dealing with, not
to mention what kernel version, or whether it's configured
CONFIG_SMP or not, and whether you can run crash on the live
system that generated the dumpfile.

Dave


--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 
Old 04-05-2008, 04:20 PM
Chandru
 
Default crash aborts with cannot determine idle task

Dave Anderson wrote:

As I suggested before, you're going to have to determine why
the tasklist[i] is bogus. The first things to determine are:

(1) what "nr_cpus" was calculated to be, and
(2) whether the SMP and PER_CPU_OFF flags are set in kt->flags.

If those variables/settings make sense, then presumably the
problem is in the determination of the per-cpu offset values.
That's done in a machine-specific way, so I can't help you
without knowing what architecture you're dealing with, not
to mention what kernel version, or whether it's configured
CONFIG_SMP or not, and whether you can run crash on the live
system that generated the dumpfile.

Dave

The machine is a ppc64 box with a RHEL5.1 based SMP kernel. nr_cpus is
equal to '2' in get_idle_threads() , but the system actually has 14 cpus
and 12 of them were offline when a vmcore was collected. The
kt->__per_cpu_offset[12 & 13 ] have per cpu offset values where as
kt->__per_cpu_offset[0 to 11] = 0. I changed kt->__per_cpu_offset[i]
in ppc64_paca_init() to kt->__per_cpu_offset[cpus] and that started
crash. But backtrace 'bt' exited with segmentation fault . Looking
further the code in get_netdump_regs_ppc64()

if (nd->num_prstatus_notes > 1)
{
note = (Elf64_Nhdr *)
nd->nt_prstatus_percpu[bt->tc->processor];
}
had bt->tc->processor as '12'. I changed it to '0' and that gave the
backtrace.


Regards,
Chandru

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 
Old 04-05-2008, 08:06 PM
Dave Anderson
 
Default crash aborts with cannot determine idle task

Chandru wrote:

Dave Anderson wrote:

As I suggested before, you're going to have to determine why
the tasklist[i] is bogus. The first things to determine are:

(1) what "nr_cpus" was calculated to be, and
(2) whether the SMP and PER_CPU_OFF flags are set in kt->flags.

If those variables/settings make sense, then presumably the
problem is in the determination of the per-cpu offset values.
That's done in a machine-specific way, so I can't help you
without knowing what architecture you're dealing with, not
to mention what kernel version, or whether it's configured
CONFIG_SMP or not, and whether you can run crash on the live
system that generated the dumpfile.

Dave

The machine is a ppc64 box with a RHEL5.1 based SMP kernel. nr_cpus
is equal to '2' in get_idle_threads() , but the system actually has 14
cpus and 12 of them were offline when a vmcore was collected. The
kt->__per_cpu_offset[12 & 13 ] have per cpu offset values where as
kt->__per_cpu_offset[0 to 11] = 0. I changed kt->__per_cpu_offset[i]
in ppc64_paca_init() to kt->__per_cpu_offset[cpus] and that started
crash. But backtrace 'bt' exited with segmentation fault . Looking
further the code in get_netdump_regs_ppc64()

if (nd->num_prstatus_notes > 1)
{
note = (Elf64_Nhdr *)
nd->nt_prstatus_percpu[bt->tc->processor];
}
had bt->tc->processor as '12'. I changed it to '0' and that gave the
backtrace.


Regards,
Chandru



OK, it sounds like the kt->cpus value should have been set to 14 by
ppc64_paca_init().


And it appears that when kdump created the vmcore, it only installs two
NT_PRSTATUS
sections. And that being the case, the consumer of the ELF header has
to figure out what
cpu that each NT_PRSTATUS section one belongs to. I'm not sure how that
can be
determined.


And it seems that there may be other oddities that may crop up when running
other commands. Or maybe not...

I'm going to ultimately defer this back to the IBM for resolution.
haren@us.ibm.com
<https://www.redhat.com/mailman/options/crash-utility/haren--at--us.ibm.com>
wrote the ppc64.c file, and he is on this mailing

list. But would it be possible for you to make the vmlinux/vmcore pair
available to me? (If so, you can send me the particulars off-line)

Thanks,
Dave




--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 
Old 04-08-2008, 05:24 PM
Dave Anderson
 
Default crash aborts with cannot determine idle task

Chandru wrote:
The machine is a ppc64 box with a RHEL5.1 based SMP kernel. nr_cpus is
equal to '2' in get_idle_threads() , but the system actually has 14 cpus
and 12 of them were offline when a vmcore was collected. The
kt->__per_cpu_offset[12 & 13 ] have per cpu offset values where as
kt->__per_cpu_offset[0 to 11] = 0. I changed kt->__per_cpu_offset[i]
in ppc64_paca_init() to kt->__per_cpu_offset[cpus] and that started
crash. But backtrace 'bt' exited with segmentation fault . Looking
further the code in get_netdump_regs_ppc64()

if (nd->num_prstatus_notes > 1)
{
note = (Elf64_Nhdr *)
nd->nt_prstatus_percpu[bt->tc->processor];
}
had bt->tc->processor as '12'. I changed it to '0' and that gave the
backtrace.


Regards,
Chandru


Chandru,

I can reproduce the "idle-task" initialization-time failure on a
4-cpu ppc64 by offlining cpus 0 and 1. That can be fixed fairly trivially
in ppc64_paca_init() by checking the cpu_present_map instead of the
cpu_online_map. So this hack should get you to a prompt:

# diff ppc64.c.orig ppc64.c
2400c2400
< readmem(symbol_value("cpu_online_map"), KVADDR, &cpu_online_map[0],
---
> readmem(symbol_value("cpu_present_map"), KVADDR, &cpu_online_map[0],
#

With respect to the "bt" failure, that will take a bit of tinkering.
When kdump collects NT_PRSTATUS segments, it only does it for
online cpus. So in your case, there would be 2 NT_PRSTATUS notes,
one each for cpu 12 and cpu 13. That being the case, get_netdump_regs_ppc64()
would have to be modified to pick the proper one, given that the
processor number will be 12 or 13, and that would have to mapped
to the associated "online index" of NT_PRSTATUS notes. That could
get ugly. There is an elf_prstatus.ppid field that could be matched
against the incoming "bt" task pid, although there could be multiple
pid 0's, so that probably is not the best answer. I'm not sure
what is the best way to go here.

So again, I prefer not to tinker with the ppc64-specific code base
in crash, and have always deferred it back to the author (haren).
If he is not available, can you find out who in IBM is the proper
person to run this by?

Thanks,
Dave


--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 

Thread Tools




All times are GMT. The time now is 02:20 AM.

VBulletin, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright ©2007 - 2008, www.linux-archive.org