Linux Archive

Linux Archive (http://www.linux-archive.org/)
-   Crash Utility (http://www.linux-archive.org/crash-utility/)
-   -   loop in crash (http://www.linux-archive.org/crash-utility/659524-loop-crash.html)

Dave Anderson 04-25-2012 02:42 PM

loop in crash
 
----- Original Message -----
>
> Hi Dave,
>
> I have a corrupt vmcore file (for ARM) that makes crash loop forever.
> The problem is in memory.c, function max_cpudata_limit. The last
> part of that function:
>
> if (VALID_MEMBER(kmem_list3_shared) &&
> VALID_MEMBER(kmem_cache_s_lists) &&
> readmem(kmem_cache_nodelists(cache), KVADDR, &start_address[0],
> sizeof(ulong) * vt->kmem_cache_len_nodes, "array nodelist array",
> RETURN_ON_ERROR)) {
> for (i = 0; i < vt->kmem_cache_len_nodes; i++) {
> if (start_address[i] == 0)
> continue;
> if (readmem(start_address[i] + OFFSET(kmem_list3_shared),
> KVADDR, &shared, sizeof(void *),
> "kmem_list3 shared", RETURN_ON_ERROR|QUIET)) {
> if (!shared)
> break;
> }
> if (readmem(shared + OFFSET(array_cache_limit),
> KVADDR, &limit, sizeof(int), "shared array_cache limit",
> RETURN_ON_ERROR|QUIET)) {
> if (limit > max_limit)
> max_limit = limit;
> break;
> }
> }
> }
> FREEBUF(start_address);
> return max_limit;
>
> bail_out:
> vt->flags |= KMEM_CACHE_UNAVAIL;
> error(INFO, "unable to initialize kmem slab cache subsystem

");
> *cpus = 0;
> return 0;
>
>
> The problem is that the readmem statement “if
> (readmem(start_address[i] + OFFSET(kmem_list3_shared), …..” fails,
> and then the function max_cpudata_limit is called over and over
> again. I did a patch adding “else goto bail_out;” if the readmem
> fails and then crash managed to continue. I do not know if this is
> really a good idea.
>
> As this seems only to be a problem for corrupt vmcore files I do not
> know if you want to do anything about it.

Maybe -- maybe not...

In the case of corrupted vmcores, it's preferable to avoid a cover-up,
and in fact, the crash utility is often "doing its job" by failing,
i,e., its failure points to the problem at hand.

However, in the specific case of the kmem_cache initialization, that has
been a problem area in the past when the subsystem itself is corrupted,
or perhaps in your case where the vmcore is corrupted. That's why
the "crash --no_kmem_cache" or "crash --kmem_cache_delay" options
were put in place.

Now in your case, I'm guessing that the crash session may have
quietly "hung" during initialization? And with debug turned on you
may have seen the readmem failures?

I tried to reproduce this by injecting a readmem() failure for
that particular readmem(), but it does not result in a loop.
In my test, the readmem() fails, max_cpudata_limit() eventually returns,
and kmem_cache_init() just goes onto the next kmem_cache in the chain.
Also, because that readmem() is explicitly set RETURN_ON_ERROR|QUIET, it can
conceivably fail without max_cpudata_limit() having to set KMEM_CACHE_UNAVAIL.

Anyway, if max_cpudata_limit() returns without setting KMEM_CACHE_UNAVAIL,
kmem_cache_init() should just continue to walk through the kmem_cache
chain:

[ initialize "cache" and "cache_end" ]

do {
... [ cut ] ...

if ((tmp = max_cpudata_limit(cache, &tmp2)) > max_limit)
max_limit = tmp;

/*
* Recognize and bail out on any max_cpudata_limit() failures.
*/
if (vt->flags & KMEM_CACHE_UNAVAIL) {
FREEBUF(cache_buf);
return;
}

... [ cut ] ...

cache = ULONG(cache_buf + next_offset);

switch (vt->flags & (PERCPU_KMALLOC_V1|PERCPU_KMALLOC_V2))
{
case PERCPU_KMALLOC_V1:
cache -= next_offset;
break;
case PERCPU_KMALLOC_V2:
if (cache != cache_end)
cache -= next_offset;
break;
}

} while (cache != cache_end)

So I don't understand how you got into a loop unless the kmem_cache list
walk-through is the real problem. If you were to print out the "cache"
address each time through the do-while loop, does the list start repeating
itself?

And if that's true, perhaps the kmem_cache_init() should use the
hq_open()/hq_enter()/hq_close() facility on each cache address to
catch a duplicate (false) entry.

Dave


--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility

Dave Anderson 04-25-2012 02:57 PM

loop in crash
 
----- Original Message -----
>
> So I don't understand how you got into a loop unless the kmem_cache list
> walk-through is the real problem. If you were to print out the "cache"
> address each time through the do-while loop, does the list start repeating
> itself?
>
> And if that's true, perhaps the kmem_cache_init() should use the
> hq_open()/hq_enter()/hq_close() facility on each cache address to
> catch a duplicate (false) entry.
>
> Dave

As a side issue, you have pinpointed a potential problem
area if the first readmem() does fail, because in that case it
should "continue" instead of using the invalid "shared" value
in the second readmem():

if (readmem(start_address[i] + OFFSET(kmem_list3_shared),
KVADDR, &shared, sizeof(void *),
"kmem_list3 shared", RETURN_ON_ERROR|QUIET)) {
if (!shared)
break;
}
if (readmem(shared + OFFSET(array_cache_limit),
KVADDR, &limit, sizeof(int), "shared array_cache limit",
RETURN_ON_ERROR|QUIET)) {
if (limit > max_limit)
max_limit = limit;
break;
}

But again, I don't see that having anything to do with your problem.
And in all practical circumstances, that first readmem() should
never fail, even though it is allowable.

I'll fix that...

Dave

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility

Dave Anderson 04-25-2012 06:29 PM

loop in crash
 
----- Original Message -----

>
> So I don't understand how you got into a loop unless the kmem_cache list
> walk-through is the real problem. If you were to print out the "cache"
> address each time through the do-while loop, does the list start repeating
> itself?
>
> And if that's true, perhaps the kmem_cache_init() should use the
> hq_open()/hq_enter()/hq_close() facility on each cache address to
> catch a duplicate (false) entry.

And if that's true, does the attached patch help?

Thanks,
Dave




--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility

"Karlsson, Jan" 04-26-2012 06:06 AM

loop in crash
 
Hi

and thanks for your work with this problem.

As you expected crash silently just loops and I spotted the problem by turning on debug printouts.
If I include printouts for the "cache" address, the first value seems reasonable, but then it starts to repeat with the value 0x00000001.
Last, your patch solves the problem nicely. I get a warning about duplicate kmem_slab entry and crash continues to execute and issues other warnings indicating a corrupt vmcore file.

Jan


Jan Karlsson
Senior Software Engineer
MIB
*
Sony Mobile Communications
Tel: +46703062174
sonymobile.com
*


-----Original Message-----
From: crash-utility-bounces@redhat.com [mailto:crash-utility-bounces@redhat.com] On Behalf Of Dave Anderson
Sent: onsdag den 25 april 2012 20:30
To: Discussion list for crash utility usage, maintenance and development
Subject: Re: [Crash-utility] loop in crash



----- Original Message -----

>
> So I don't understand how you got into a loop unless the kmem_cache
> list walk-through is the real problem. If you were to print out the "cache"
> address each time through the do-while loop, does the list start
> repeating itself?
>
> And if that's true, perhaps the kmem_cache_init() should use the
> hq_open()/hq_enter()/hq_close() facility on each cache address to
> catch a duplicate (false) entry.

And if that's true, does the attached patch help?

Thanks,
Dave





--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility

Dave Anderson 04-26-2012 01:08 PM

loop in crash
 
----- Original Message -----
> Hi
>
> and thanks for your work with this problem.
>
> As you expected crash silently just loops and I spotted the problem
> by turning on debug printouts.
> If I include printouts for the "cache" address, the first value seems
> reasonable, but then it starts to repeat with the value 0x00000001.
> Last, your patch solves the problem nicely. I get a warning about
> duplicate kmem_slab entry and crash continues to execute and issues
> other warnings indicating a corrupt vmcore file.
>
> Jan

OK good -- I should have hq_xxx()'d that loop a long time ago.

Queued for crash-6.0.6.

Thanks,
Dave



--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility

"Karlsson, Jan" 04-27-2012 07:15 AM

loop in crash
 
Thanks Dave.

I found one more issue with a somewhat "corrupt" vmcore. In this case it is ARM-specific in unwind_arm.c, so maybe Mika will also look at it.

In the case I am investigating I get a readmem error while reading the unwind tables. The way unwinding currently is implemented Crash then stops and no further analysis is possible. When I patched Crash to continue anyhow, every command I tried worked nicely including bt, so there is no reason to stop at this kind of problem.

When investigating further I found that the problem occurs in init_module_unwind_tables. It is in the call to do_list(&ld) that the readmem error is found. I also looked in the code for do_list and saw that it could be configured to return even if errors were found, by setting ld.flags.

/*
* Iterate through unwind table list and store start address of each
* table in table_list.
*/
ld.flags += RETURN_ON_LIST_ERROR; /* added line */
hq_open();
cnt = do_list(&ld);
if (cnt == -1) { /* added if statement, 3 lines */
return FALSE;
}
table_list = (ulong *)GETBUF(cnt * sizeof(ulong));
cnt = retrieve_list(table_list, cnt);
hq_close();

By adding the lines indicated above I get an appropriate warning that the unwind tables cannot be read, and then Crash works as usual.

Jan

Jan Karlsson
Senior Software Engineer
MIB
*
Sony Mobile Communications
Tel: +46703062174
sonymobile.com
*


-----Original Message-----
From: crash-utility-bounces@redhat.com [mailto:crash-utility-bounces@redhat.com] On Behalf Of Dave Anderson
Sent: torsdag den 26 april 2012 15:09
To: Discussion list for crash utility usage, maintenance and development
Cc: Fnge, Thomas
Subject: Re: [Crash-utility] loop in crash



----- Original Message -----
> Hi
>
> and thanks for your work with this problem.
>
> As you expected crash silently just loops and I spotted the problem
> by turning on debug printouts.
> If I include printouts for the "cache" address, the first value seems
> reasonable, but then it starts to repeat with the value 0x00000001.
> Last, your patch solves the problem nicely. I get a warning about
> duplicate kmem_slab entry and crash continues to execute and issues
> other warnings indicating a corrupt vmcore file.
>
> Jan

OK good -- I should have hq_xxx()'d that loop a long time ago.

Queued for crash-6.0.6.

Thanks,
Dave



--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility

Dave Anderson 04-27-2012 01:28 PM

loop in crash
 
----- Original Message -----
> Thanks Dave.
>
> I found one more issue with a somewhat "corrupt" vmcore. In this case
> it is ARM-specific in unwind_arm.c, so maybe Mika will also look at
> it.
>
> In the case I am investigating I get a readmem error while reading
> the unwind tables. The way unwinding currently is implemented Crash
> then stops and no further analysis is possible. When I patched Crash
> to continue anyhow, every command I tried worked nicely including
> bt, so there is no reason to stop at this kind of problem.
>
> When investigating further I found that the problem occurs in
> init_module_unwind_tables. It is in the call to do_list(&ld) that
> the readmem error is found. I also looked in the code for do_list
> and saw that it could be configured to return even if errors were
> found, by setting ld.flags.
>
> /*
> * Iterate through unwind table list and store start address of each
> * table in table_list.
> */
> ld.flags += RETURN_ON_LIST_ERROR; /* added line */
> hq_open();
> cnt = do_list(&ld);
> if (cnt == -1) { /* added if statement, 3 lines */
> return FALSE;
> }
> table_list = (ulong *)GETBUF(cnt * sizeof(ulong));
> cnt = retrieve_list(table_list, cnt);
> hq_close();
>
> By adding the lines indicated above I get an appropriate warning that
> the unwind tables cannot be read, and then Crash works as usual.
>
> Jan

Your patch makes perfect sense. Any error(FATAL, ...) call prior to
RUNTIME being set kills the whole session. But if it is possible for
the session to continue, then it should be allowed to.

I'll also add an unwind-specific warning message, and make the same
change to the x86_64 populate_local_tables() function, upon which it
appears that the ARM version was based.

Queued for crash-6.0.6. (Later today...)

Thanks,
Dave

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility


All times are GMT. The time now is 03:06 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.