Linux Archive

Linux Archive (http://www.linux-archive.org/)
-   Crash Utility (http://www.linux-archive.org/crash-utility/)
-   -   bt: cannot determine starting stack pointer (http://www.linux-archive.org/crash-utility/632915-bt-cannot-determine-starting-stack-pointer.html)

Bruce Korb 02-14-2012 05:22 PM

bt: cannot determine starting stack pointer
 
Hi,

I need the stack traces of the tasks that are on-proc as well as the
tasks that are not. "bt" fails for the on-proc tasks, even though there
is a backup mechanism for finding the stack: the "stack" field of the
task structure. Even if it is a bit out-of-date, it is better than an
"I dunno" message. Perhaps augment the stack trace with a "this
might be slightly out-of-date because the task was running when
the kernel crashed" message.

Example:

crash> foreach bt
[...]
PID: 20311 TASK: ffff8803ff654140 CPU: 9 COMMAND: "xtnhc"
bt: cannot determine starting stack pointer
[...]
crash> ps | egrep '^>'
> 0 0 4 ffff880205f6b0c0 RU 0.0 0 0 [swapper]
> 0 0 5 ffff880205f77870 RU 0.0 0 0 [swapper]
> 0 0 7 ffff880205d557f0 RU 0.0 0 0 [swapper]
> 0 0 10 ffff880205d5c080 RU 0.0 0 0 [swapper]
> 2982 2 11 ffff8801fd3b07f0 RU 0.0 0 0 [ldlm_cb_00]
> 2983 2 8 ffff880205548080 RU 0.0 0 0 [ldlm_cb_01]
> 20250 20245 1 ffff880202deb0c0 RU 0.0 82388 2372 fcntl17
> 20251 20245 2 ffff88020537b7b0 RU 0.0 82388 2396 fcntl17
> 20252 20245 3 ffff8801fd3b4770 RU 0.0 82388 2376 fcntl17
> 20264 20249 0 ffff8801fd444830 RU 0.0 0 0 fcntl17
> 20290 1 6 ffff8803fe86f7b0 RU 0.0 14044 516 xtnhc
> 20311 20305 9 ffff8803ff654140 RU 0.0 14044 516 xtnhc
crash> set ffff8803ff654140
PID: 20311
COMMAND: "xtnhc"
TASK: ffff8803ff654140 [THREAD_INFO: ffff8803fd85a000]
CPU: 9
STATE: TASK_RUNNING (ACTIVE)
crash> p task->stack
p: gdb request failed: p task->stack
crash> task
PID: 20311 TASK: ffff8803ff654140 CPU: 9 COMMAND: "xtnhc"
struct task_struct {
state = 0,
stack = 0xffff8803fd85a000,
[...]
crash> bt -S 0xffff8803fd85a000
PID: 20311 TASK: ffff8803ff654140 CPU: 9 COMMAND: "xtnhc"
#0 [ffff8803fd85a000] schedule at ffffffff81297bc5
#1 [ffff8803fd85b830] ldlm_resource_get at ffffffffa0269380 [ptlrpc]
#2 [ffff8803fd85b900] ldlm_lock_match at ffffffffa0267359 [ptlrpc]
#3 [ffff8803fd85ba10] mdc_revalidate_lock at ffffffffa0423a8e [mdc]
#4 [ffff8803fd85bac0] mdc_intent_lock at ffffffffa042723f [mdc]
#5 [ffff8803fd85bbc0] __ll_inode_revalidate_it at ffffffffa04a79c2 [lustre]
#6 [ffff8803fd85bcf0] ll_inode_permission at ffffffffa04a8266 [lustre]
#7 [ffff8803fd85bd90] inode_permission at ffffffff810f0a09
#8 [ffff8803fd85bda0] may_open at ffffffff810f14d7
#9 [ffff8803fd85bdd0] do_filp_open at ffffffff810f5294
#10 [ffff8803fd85bf20] do_sys_open at ffffffff810e5850
#11 [ffff8803fd85bf70] sys_open at ffffffff810e596b
#12 [ffff8803fd85bf80] system_call_fastpath at ffffffff81002eab
RIP: 00007ffff78f2f80 RSP: 00007fffffffd818 RFLAGS: 00010202
RAX: 0000000000000002 RBX: ffffffff81002eab RCX: 00000000006130f0
RDX: 00000000000001b6 RSI: 0000000000000000 RDI: 000000000060f960
RBP: 0000000000000008 R8: 0000000000000008 R9: 0000000000000001
R10: 000000000040a261 R11: 0000000000000246 R12: ffffffff810e596b
R13: ffff8803fd85bf78 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b
crash>

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility

Dave Anderson 02-14-2012 06:07 PM

bt: cannot determine starting stack pointer
 
----- Original Message -----
> Hi,
>
> I need the stack traces of the tasks that are on-proc as well as the
> tasks that are not. "bt" fails for the on-proc tasks, even though there
> is a backup mechanism for finding the stack: the "stack" field of the
> task structure. Even if it is a bit out-of-date, it is better than an
> "I dunno" message. Perhaps augment the stack trace with a "this
> might be slightly out-of-date because the task was running when
> the kernel crashed" message.
>
> Example:
>
> crash> foreach bt
> [...]
> PID: 20311 TASK: ffff8803ff654140 CPU: 9 COMMAND: "xtnhc"
> bt: cannot determine starting stack pointer
> [...]
> crash> ps | egrep '^>'
> > 0 0 4 ffff880205f6b0c0 RU 0.0 0 0 [swapper]
> > 0 0 5 ffff880205f77870 RU 0.0 0 0 [swapper]
> > 0 0 7 ffff880205d557f0 RU 0.0 0 0 [swapper]
> > 0 0 10 ffff880205d5c080 RU 0.0 0 0 [swapper]
> > 2982 2 11 ffff8801fd3b07f0 RU 0.0 0 0 [ldlm_cb_00]
> > 2983 2 8 ffff880205548080 RU 0.0 0 0 [ldlm_cb_01]
> > 20250 20245 1 ffff880202deb0c0 RU 0.0 82388 2372 fcntl17
> > 20251 20245 2 ffff88020537b7b0 RU 0.0 82388 2396 fcntl17
> > 20252 20245 3 ffff8801fd3b4770 RU 0.0 82388 2376 fcntl17
> > 20264 20249 0 ffff8801fd444830 RU 0.0 0 0 fcntl17
> > 20290 1 6 ffff8803fe86f7b0 RU 0.0 14044 516 xtnhc
> > 20311 20305 9 ffff8803ff654140 RU 0.0 14044 516 xtnhc
> crash> set ffff8803ff654140
> PID: 20311
> COMMAND: "xtnhc"
> TASK: ffff8803ff654140 [THREAD_INFO: ffff8803fd85a000]
> CPU: 9
> STATE: TASK_RUNNING (ACTIVE)
> crash> p task->stack
> p: gdb request failed: p task->stack
> crash> task
> PID: 20311 TASK: ffff8803ff654140 CPU: 9 COMMAND: "xtnhc"
> struct task_struct {
> state = 0,
> stack = 0xffff8803fd85a000,
> [...]
> crash> bt -S 0xffff8803fd85a000
> PID: 20311 TASK: ffff8803ff654140 CPU: 9 COMMAND: "xtnhc"
> #0 [ffff8803fd85a000] schedule at ffffffff81297bc5
> #1 [ffff8803fd85b830] ldlm_resource_get at ffffffffa0269380 [ptlrpc]
> #2 [ffff8803fd85b900] ldlm_lock_match at ffffffffa0267359 [ptlrpc]
> #3 [ffff8803fd85ba10] mdc_revalidate_lock at ffffffffa0423a8e [mdc]
> #4 [ffff8803fd85bac0] mdc_intent_lock at ffffffffa042723f [mdc]
> #5 [ffff8803fd85bbc0] __ll_inode_revalidate_it at ffffffffa04a79c2 [lustre]
> #6 [ffff8803fd85bcf0] ll_inode_permission at ffffffffa04a8266 [lustre]
> #7 [ffff8803fd85bd90] inode_permission at ffffffff810f0a09
> #8 [ffff8803fd85bda0] may_open at ffffffff810f14d7
> #9 [ffff8803fd85bdd0] do_filp_open at ffffffff810f5294
> #10 [ffff8803fd85bf20] do_sys_open at ffffffff810e5850
> #11 [ffff8803fd85bf70] sys_open at ffffffff810e596b
> #12 [ffff8803fd85bf80] system_call_fastpath at ffffffff81002eab
> RIP: 00007ffff78f2f80 RSP: 00007fffffffd818 RFLAGS: 00010202
> RAX: 0000000000000002 RBX: ffffffff81002eab RCX: 00000000006130f0
> RDX: 00000000000001b6 RSI: 0000000000000000 RDI: 000000000060f960
> RBP: 0000000000000008 R8: 0000000000000008 R9: 0000000000000001
> R10: 000000000040a261 R11: 0000000000000246 R12: ffffffff810e596b
> R13: ffff8803fd85bf78 R14: 0000000000000000 R15: 0000000000000000
> ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b
> crash>

You could also try "bt -t" or "bt -T".

But what kind of dumpfile was this anyway? I'm wondering why you aren't
getting any stack traces at all for the active tasks?

Dave

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility

Bruce Korb 02-14-2012 07:15 PM

bt: cannot determine starting stack pointer
 
Hi Dave,

On Tue, Feb 14, 2012 at 11:07 AM, Dave Anderson <anderson@redhat.com> wrote:
>> I need the stack traces of the tasks that are on-proc as well as the
>> tasks that are not. *"bt" fails for the on-proc tasks, even though there
>> is a backup mechanism for finding the stack:

> You could also try "bt -t" or "bt -T".

That gets you too much information. You get anything in the stack
that resolves to
some symbol. (assuming I've understood the help text correctly).
Typically, there is a bunch of uninitialized stuff on the stack that
will often be return addresses to procedures that were in the stack
the last time
the stack got up to where you are. Using the task structure's stack
pointer gives
you a better shot at following the stack.

> But what kind of dumpfile was this anyway? *I'm wondering why you aren't
> getting any stack traces at all for the active tasks?

CFS (Cluster File System aka Lustre) appliance. As for why, I don't
exactly know.
I'd have to fetch crash sources and see that is going on where that message
gets emitted.


BTW, I've also tripped over a command parser bug. I wrote a script
intended to be used thus:

crash> !bash live-bt.sh
crash> < cmd
crash> < cmd
crash> < cmd

with the result being the back traces I'm after. For some reason, the
scanner went past the
end of an input line and found left over characters from a previous
input line, with two consequences:
1. an ugly error message saying that garbage was not a valid crash command
2. a message instructing the user to type "< cmd" was interpreted as a
command (sans quotes),
resulting in only needing to type the "< cmd" thing twice instead
of three times.
It's nice in a way, but probably not right. :)
I can send you new command line scanner/lexer code that is about 1/2
the current size tonight.
(Borrowed from my own open source hacking around.)
--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility

Bruce Korb 02-14-2012 07:23 PM

bt: cannot determine starting stack pointer
 
I see the cascading issue now. Too many distractions. Sorry.

On Tue, Feb 14, 2012 at 12:15 PM, Bruce Korb <bruce.korb@gmail.com> wrote:

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility

Dave Anderson 02-14-2012 07:39 PM

bt: cannot determine starting stack pointer
 
----- Original Message -----

> > But what kind of dumpfile was this anyway? *I'm wondering why you aren't
> > getting any stack traces at all for the active tasks?
>
> CFS (Cluster File System aka Lustre) appliance. As for why, I don't exactly know.
> I'd have to fetch crash sources and see that is going on where that message
> gets emitted.

No, I meant what was the dumpfile format, i.e., was it an ELF kdump,
compressed-kdump, Xen dump, kvmdump, etc?

The error message is from here, where the starting stack pointer
could not be determined, or was an address that is not accessible
for some reason:

if (!(bt->flags & BT_USER_SPACE) && (!rsp || !accessible(rsp))) {
error(INFO, "cannot determine starting stack pointer
");
if (KVMDUMP_DUMPFILE())
kvmdump_display_regs(bt->tc->processor, ofp);
else if (ELF_NOTES_VALID() && DISKDUMP_DUMPFILE())
diskdump_display_regs(bt->tc->processor, ofp);
else if (SADUMP_DUMPFILE())
sadump_display_regs(bt->tc->processor, ofp);
return;
}

Dave

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility

Bruce Korb 02-14-2012 08:14 PM

bt: cannot determine starting stack pointer
 
# file *
console-20111031: * data
console.c0-0c0s5n1: ASCII Java program text
dump.000051: * * * *data
hosts: * * * * * * *ASCII English text
live-bt.sh: * * * * Bourne-Again shell script text executable
lnet_kos: * * * * * directory
lustre_kos: * * * * directory
README: * * * * * * ASCII English text
System.map: * * * * ASCII text
vmlinux: * * * * * *ELF 64-bit LSB executable, x86-64, version 1
(SYSV), statically linked, not stripped

> No, I meant what was the dumpfile format, i.e., was it an ELF kdump,
> compressed-kdump, Xen dump, kvmdump, etc?

I don't actually know what the acquisition method was.

> The error message is from here, where the starting stack pointer
> could not be determined, or was an address that is not accessible
> for some reason:
>
> * * * *if (!(bt->flags & BT_USER_SPACE) && (!rsp || !accessible(rsp))) {
> * * * * * * * *error(INFO, "cannot determine starting stack pointer
");
> * * * * * * * *if (KVMDUMP_DUMPFILE())
> * * * * * * * * * * * *kvmdump_display_regs(bt->tc->processor, ofp);
> * * * * * * * *else if (ELF_NOTES_VALID() && DISKDUMP_DUMPFILE())
> * * * * * * * * * * * *diskdump_display_regs(bt->tc->processor, ofp);
> * * * * * * * *else if (SADUMP_DUMPFILE())
> * * * * * * * * * * * *sadump_display_regs(bt->tc->processor, ofp);
> * * * * * * * *return;
> * * * *}

With the dumps we get, it happens essentially all the time.

My bizarre shell loops were a function of writing to the same file
bash was reading from.....With that fixed, I now have a template
for writing multi-pass shell scripts.

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility

Dave Anderson 02-14-2012 08:18 PM

bt: cannot determine starting stack pointer
 
----- Original Message -----
> # file *
> console-20111031: * data
> console.c0-0c0s5n1: ASCII Java program text
> dump.000051: * * * *data
> hosts: * * * * * * *ASCII English text
> live-bt.sh: * * * * Bourne-Again shell script text executable
> lnet_kos: * * * * * directory
> lustre_kos: * * * * directory
> README: * * * * * * ASCII English text
> System.map: * * * * ASCII text
> vmlinux: * * * * * *ELF 64-bit LSB executable, x86-64, version 1
> (SYSV), statically linked, not stripped
>
> > No, I meant what was the dumpfile format, i.e., was it an ELF
> > kdump,
> > compressed-kdump, Xen dump, kvmdump, etc?
>
> I don't actually know what the acquisition method was.

Enter "help -n"



--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility

Bruce Korb 02-14-2012 09:36 PM

bt: cannot determine starting stack pointer
 
On Tue, Feb 14, 2012 at 1:18 PM, Dave Anderson <anderson@redhat.com> wrote:
>> I don't actually know what the acquisition method was.
>
> Enter "help -n"

Here ya go. Doesn't mean much to me. Hope you didn't want 32 hash tables....

crash> help -n
total_pages: 212168
hashed: 2566
compressed: 1783 (69%)
raw: 783 (30%)
cached_reads: 50377 (90%)
hashed_reads: 2615 (4%)
total_reads: 55558 (hashed or cached: 94%)
page_hash[32]:
[......]
page_cache_hdr[16]:
INDEX PG_ADDR PG_BUFPTR PG_HIT_COUNT
[ 0] 3fd849000 1a00a30 48
[ 1] 1fd3a6000 1a01a30 1
[ 2] 2053cf000 1a02a30 1
[ 3] 2075ca000 1a03a30 64
[ 4] 2023f5000 1a04a30 16
[ 5] 3fd84e000 1a05a30 1
[ 6] 405f77000 1a06a30 31
[ 7] 3fd910000 1a07a30 1
[ 8] 405f74000 1a08a30 31
[ 9] 3fd99d000 1a09a30 1
[10] 405f6c000 1a0aa30 35
[11] 1fd456000 1a0ba30 1
[12] 204f65000 1a0ca30 15
[13] 405d7e000 1a0da30 31
[14] 3fd83d000 1a0ea30 1
[15] 405f79000 1a0fa30 31
mb_hdr_offsets: NA
num_zones: 20 / 128
zoned_offsets: 210313
dumpfile_index: (null)
ifd: -1
memory_pages: 4134481
page_offset_max: 442278774
page_index_max: 0
page_offsets: 0

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility

Dave Anderson 02-15-2012 01:36 PM

bt: cannot determine starting stack pointer
 
----- Original Message -----
> On Tue, Feb 14, 2012 at 1:18 PM, Dave Anderson <anderson@redhat.com> wrote:
> >> I don't actually know what the acquisition method was.
> >
> > Enter "help -n"
>
> Here ya go. Doesn't mean much to me. Hope you didn't want 32 hash
> tables....

It means that it's an LKCD-generated dumpfile, or some derivative thereof.

I personally haven't done any LKCD support for many years now, given that
LKCD as a dumping mechanism has pretty much been superceded by kdump.
But every so often somebody forwards an LKCD-related patch that I take
in as long as it compiles.

That being said, it's news to me that backtraces cannot be generated
for the active tasks from LKCD dumpfiles, unless it's some kind of
"live dump" or something? Was there a panic or oops? What's the
last thing shown by the "log" command?

Dave

> crash> help -n
> total_pages: 212168
> hashed: 2566
> compressed: 1783 (69%)
> raw: 783 (30%)
> cached_reads: 50377 (90%)
> hashed_reads: 2615 (4%)
> total_reads: 55558 (hashed or cached: 94%)
> page_hash[32]:
> [......]
> page_cache_hdr[16]:
> INDEX PG_ADDR PG_BUFPTR PG_HIT_COUNT
> [ 0] 3fd849000 1a00a30 48
> [ 1] 1fd3a6000 1a01a30 1
> [ 2] 2053cf000 1a02a30 1
> [ 3] 2075ca000 1a03a30 64
> [ 4] 2023f5000 1a04a30 16
> [ 5] 3fd84e000 1a05a30 1
> [ 6] 405f77000 1a06a30 31
> [ 7] 3fd910000 1a07a30 1
> [ 8] 405f74000 1a08a30 31
> [ 9] 3fd99d000 1a09a30 1
> [10] 405f6c000 1a0aa30 35
> [11] 1fd456000 1a0ba30 1
> [12] 204f65000 1a0ca30 15
> [13] 405d7e000 1a0da30 31
> [14] 3fd83d000 1a0ea30 1
> [15] 405f79000 1a0fa30 31
> mb_hdr_offsets: NA
> num_zones: 20 / 128
> zoned_offsets: 210313
> dumpfile_index: (null)
> ifd: -1
> memory_pages: 4134481
> page_offset_max: 442278774
> page_index_max: 0
> page_offsets: 0
>

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility

Dave Anderson 02-15-2012 03:41 PM

bt: cannot determine starting stack pointer
 
----- Original Message -----
> On 02/15/12 06:36, Dave Anderson wrote:

>
> I'm not too surprised. In the world of back-end clustered storage systems,
> updating systems is a massive security/stability concern. Consequently,
> new fangled stuff from less than a decade ago get incorporated slowly. :)
>
> Analysis tools, however, can be (and are!!) updated.
>
> > That being said, it's news to me that backtraces cannot be generated
> > for the active tasks from LKCD dumpfiles, unless it's some kind of
> > "live dump" or something? Was there a panic or oops? What's the
> > last thing shown by the "log" command?
>
> Yes, it is a live dump, if that's what you mean by a crash dump.

OK, yes that's what I meant. And that's unfortunate...

> Figuring out why ptlrpc_invalidate_import() is struggling is what I signed up for
> learning how to do. Coercing crash into giving me stack traces for live/onproc
> processes is what I was hoping you would please be kind enough to help me figure out.
> My solution is the script (attached) that requires me to type four commands:
>
> > crash> ! bash live-bt.sh
> > crash> < c-cmd
> > crash> < c-cmd
> > crash> < c-cmd

That's about the best you can do. The task->stack pointer holds a
reference to the last time the task blocked in schedule(), but
the active tasks are either in user-space, or have re-entered the
kernel for another purpose. If you can find something useful in
their stacks, then go for it -- and good luck!

Dave


--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility


All times are GMT. The time now is 01:34 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.