FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Crash Utility

 
 
LinkBack Thread Tools
 
Old 08-25-2010, 11:23 PM
Bob Montgomery
 
Default Missing PID 1 is crash problem with losing tasks

(Was Re: [Crash-utility] mount cmd crashes crash)

On Thu, 2010-08-19 at 12:45 +0000, Dave Anderson wrote:
> ----- "Bob Montgomery" <bob.montgomery@hp.com> wrote:

> > > Yeah, it's not important to use the context of pid 1, but it just needs
> > > some context, and I had presumed that init would always exist. I thought
> > > that the panic("Attempted to kill the idle task!") in do_exit() would
> > > prevent pid 1 from ever going away -- but apparently your kernel figured
> > > out how to do it elsewhere... ;-)
> >
> > That test is for PID 0, not PID 1 (at least on the kernel I'm
> > debugging.) However, there is this also:
> >
> > if (unlikely(tsk == child_reaper))
> > panic("Attempted to kill init!");
>
> That's the one I *meant*... ;-)
>
> >
> > And child_reaper in the dump points to a task struct for init that isn't
> > in the ps listing. Hmmm. Maybe that part *is* interesting in this dump...

Well, I've been picking at this some more. PID 1 is in the system, but
crash misses it when it's building its table of tasks in
refresh_hlist_task_table_v2(). In fact, on my particular dump, it loses
track of at least 3 processes.

The attached patch changes that behavior. It has to do with collisions
on the pid_hash table where an early item on the chain has a NULL task
pointer which causes the code to ignore subsequent items on that
collision chain. I'm not sure what it means when the tasks[0].first
pointer in the struct pid is NULL, but that's what triggers the problem
and keeps crash from following the pid_chain pointer to the next struct
pid. I am not confident that this whole area is correct yet, just
closer to correct than it was.

These now appear in the ps output:

crash-5.0.6-fix2> ps 1 8144 998
PID PPID CPU TASK ST %MEM VSZ RSS COMM
1 0 1 ffff81012bd3c780 IN 0.0 6124 688 init
8144 6257 0 ffff81011996e140 RU 0.7 108876 35016 mirrorclient
998 11 0 ffff81012a9cd780 IN 0.0 0 0 [fc_dl_1]

where before:

crash-5.0.6-fix> ps 1 8144 998
ps: invalid task or pid value: 1

ps: invalid task or pid value: 8144

ps: invalid task or pid value: 998

This might have been some transition behavior of the pid hash design in
the kernel, because I've got two dumps based on 2.6.18 kernels that show
missing processes (this one had 3 out of 532, the other had 1 out of
146), but my new patched crash doesn't reveal any missing processes in
2.6.29 and newer dumps (I checked 4 dumps, with process counts ranging
from 362 to 926). Only my recent 2.6.18 dump was lucky enough to be
missing PID 1, with me being lucky enough to try crash's mount command,
or we'd still not know about it :-)

The patch is simple, but has lots of lines because I moved the indent.

Bob Montgomery
Working at HP




--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 
Old 08-26-2010, 01:31 PM
Dave Anderson
 
Default Missing PID 1 is crash problem with losing tasks

----- "Bob Montgomery" <bob.montgomery@hp.com> wrote:

> Well, I've been picking at this some more. PID 1 is in the system, but
> crash misses it when it's building its table of tasks in
> refresh_hlist_task_table_v2(). In fact, on my particular dump, it loses
> track of at least 3 processes.
>
> The attached patch changes that behavior. It has to do with collisions
> on the pid_hash table where an early item on the chain has a NULL task
> pointer which causes the code to ignore subsequent items on that
> collision chain. I'm not sure what it means when the tasks[0].first
> pointer in the struct pid is NULL, but that's what triggers the problem
> and keeps crash from following the pid_chain pointer to the next struct
> pid. I am not confident that this whole area is correct yet, just
> closer to correct than it was.
>
> These now appear in the ps output:
>
> crash-5.0.6-fix2> ps 1 8144 998
> PID PPID CPU TASK ST %MEM VSZ RSS COMM
> 1 0 1 ffff81012bd3c780 IN 0.0 6124 688 init
> 8144 6257 0 ffff81011996e140 RU 0.7 108876 35016 mirrorclient
> 998 11 0 ffff81012a9cd780 IN 0.0 0 0 [fc_dl_1]
>
> where before:
>
> crash-5.0.6-fix> ps 1 8144 998
> ps: invalid task or pid value: 1
>
> ps: invalid task or pid value: 8144
>
> ps: invalid task or pid value: 998
>
> This might have been some transition behavior of the pid hash design in
> the kernel, because I've got two dumps based on 2.6.18 kernels that show
> missing processes (this one had 3 out of 532, the other had 1 out of
> 146), but my new patched crash doesn't reveal any missing processes in
> 2.6.29 and newer dumps (I checked 4 dumps, with process counts ranging
> from 362 to 926). Only my recent 2.6.18 dump was lucky enough to be
> missing PID 1, with me being lucky enough to try crash's mount command,
> or we'd still not know about it :-)

Yeah, I agree that it must be catching a kernel transition.

And it's probably not being seen in your 2.6.29-and-newer dumps because
2.6.24-and-later kernels use refresh_hlist_task_table_v3().

> The patch is simple, but has lots of lines because I moved the indent.

The patch looks reasonable and safe. I'll run it against my stable of
sample dumpfiles to see if I can find one...

Anyway, nice catch Bob -- and thanks again for tracking down yet another
gnarly issue,
Dave

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 

Thread Tools




All times are GMT. The time now is 04:07 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org