FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Crash Utility

 
 
LinkBack Thread Tools
 
Old 02-20-2012, 06:34 PM
Guy Streeter
 
Default crash endlessly looping on stdout error

We have a recurring problem in our crash analysis system, where remote users
get disconnected and crash starts endlessly looping trying to write to stdout.
An strace of a recent instance is looping on:

write(1, " JIFFIES
", 10) = -1 EIO (Input/output error)

but that isn't always the output string.

this is a problem in out shared environment because the orphaned crash tasks
eat up the CPUs, and we don't have the privilege to kill each others tasks.

thanks,
--Guy

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 
Old 02-21-2012, 03:44 PM
Dave Anderson
 
Default crash endlessly looping on stdout error

----- Original Message -----
> We have a recurring problem in our crash analysis system, where remote users
> get disconnected and crash starts endlessly looping trying to write to stdout.
> An strace of a recent instance is looping on:
>
> write(1, " JIFFIES
", 10) = -1 EIO (Input/output error)
>
> but that isn't always the output string.
>
> this is a problem in out shared environment because the orphaned crash tasks
> eat up the CPUs, and we don't have the privilege to kill each others tasks.
>
> thanks,
> --Guy

Hmmm, upon initial glance, this seemed to be related to the crash-5.0.2
fix that you guys reported:

- Fix to prevent a crash session that is run over a network connection
that is killed/removed from going into 100% cpu-time loop. Without
the patch, the behavior of the built-in readline() library call in
gdb-7.0 has changed such that the function returns when the EOF is
encountered on /dev/tty, and the crash session goes into an endless
loop; whereas in gdb-6.1, the readline() call never returns because
the crash session gets killed while running in the library code.
(anderson@redhat.com)

But if the orphaned task is repetetively writing the same thing, it
would never get to the next readline() call, where it would kill
itself. Taking your example, the "JIFFIES" write() is part of a "timer"
command, but I'm trying to understand how/why the command is not just
completing a series of (failed) fprintf's, and then falling into
the next readline() -- where it should kill itself? By any chance
was the remote caller doing a "repeat" command on the live system,
or something like that? (sounds doubtful since you'd have to have
root privileges to do that...)

Dave

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 
Old 02-22-2012, 04:01 PM
Guy Streeter
 
Default crash endlessly looping on stdout error

On 02/21/2012 10:44 AM, Dave Anderson wrote:
>
>
> ----- Original Message -----
>> We have a recurring problem in our crash analysis system, where remote users
>> get disconnected and crash starts endlessly looping trying to write to stdout.
>> An strace of a recent instance is looping on:
>>
>> write(1, " JIFFIES
", 10) = -1 EIO (Input/output error)
>>
>> but that isn't always the output string.
>>
>> this is a problem in out shared environment because the orphaned crash tasks
>> eat up the CPUs, and we don't have the privilege to kill each others tasks.
>>
>> thanks,
>> --Guy
>
> Hmmm, upon initial glance, this seemed to be related to the crash-5.0.2
> fix that you guys reported:
>
> - Fix to prevent a crash session that is run over a network connection
> that is killed/removed from going into 100% cpu-time loop. Without
> the patch, the behavior of the built-in readline() library call in
> gdb-7.0 has changed such that the function returns when the EOF is
> encountered on /dev/tty, and the crash session goes into an endless
> loop; whereas in gdb-6.1, the readline() call never returns because
> the crash session gets killed while running in the library code.
> (anderson@redhat.com)
>
> But if the orphaned task is repetetively writing the same thing, it
> would never get to the next readline() call, where it would kill
> itself. Taking your example, the "JIFFIES" write() is part of a "timer"
> command, but I'm trying to understand how/why the command is not just
> completing a series of (failed) fprintf's, and then falling into
> the next readline() -- where it should kill itself? By any chance
> was the remote caller doing a "repeat" command on the live system,
> or something like that? (sounds doubtful since you'd have to have
> root privileges to do that...)
>

This is not a live system. This is the setup where we analyze vmcores sent in
by our customers.
I don't understand how it happens either, unless for some reason fprintf is
re-trying the failed write().
This is not the only failure scenario. I just saw another one repeating on
this sequence:

rt_sigaction(SIGFPE, {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0},
{0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0}, 8) = 0
rt_sigreturn(0x8) = -1 ENETDOWN (Network is down)
--- SIGFPE (Floating point exception) @ 0 (0) ---
rt_sigaction(SIGFPE, {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0},
{0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0}, 8) = 0
rt_sigreturn(0x8) = -1 ENETDOWN (Network is down)
--- SIGFPE (Floating point exception) @ 0 (0) ---
rt_sigaction(SIGFPE, {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0},
{0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0}, 8) = 0
rt_sigreturn(0x8) = -1 ENETDOWN (Network is down)
--- SIGFPE (Floating point exception) @ 0 (0) ---

Perhaps it isn't a crash program issue at all. Maybe it's at the system
library level.

--Guy

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 
Old 02-22-2012, 06:39 PM
Dave Anderson
 
Default crash endlessly looping on stdout error

----- Original Message -----
> On 02/21/2012 10:44 AM, Dave Anderson wrote:
> >
> >
> > ----- Original Message -----
> >> We have a recurring problem in our crash analysis system, where remote users
> >> get disconnected and crash starts endlessly looping trying to write to stdout.
> >> An strace of a recent instance is looping on:
> >>
> >> write(1, " JIFFIES
", 10) = -1 EIO (Input/output error)
> >>
> >> but that isn't always the output string.
> >>
> >> this is a problem in out shared environment because the orphaned crash tasks
> >> eat up the CPUs, and we don't have the privilege to kill each others tasks.
> >>
> >> thanks,
> >> --Guy
> >
> > Hmmm, upon initial glance, this seemed to be related to the crash-5.0.2
> > fix that you guys reported:
> >
> > - Fix to prevent a crash session that is run over a network connection
> > that is killed/removed from going into 100% cpu-time loop. Without
> > the patch, the behavior of the built-in readline() library call in
> > gdb-7.0 has changed such that the function returns when the EOF is
> > encountered on /dev/tty, and the crash session goes into an endless
> > loop; whereas in gdb-6.1, the readline() call never returns because
> > the crash session gets killed while running in the library code.
> > (anderson@redhat.com)
> >
> > But if the orphaned task is repetetively writing the same thing, it
> > would never get to the next readline() call, where it would kill
> > itself. Taking your example, the "JIFFIES" write() is part of a "timer"
> > command, but I'm trying to understand how/why the command is not just
> > completing a series of (failed) fprintf's, and then falling into
> > the next readline() -- where it should kill itself? By any chance
> > was the remote caller doing a "repeat" command on the live system,
> > or something like that? (sounds doubtful since you'd have to have
> > root privileges to do that...)
> >
>
> This is not a live system. This is the setup where we analyze vmcores sent in
> by our customers. I don't understand how it happens either, unless for some reason
> fprintf is re-trying the failed write(). This is not the only failure scenario.
> I just saw another one repeating on this sequence:
>
> rt_sigaction(SIGFPE, {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0},
> {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0}, 8) = 0
> rt_sigreturn(0x8) = -1 ENETDOWN (Network is down)
> --- SIGFPE (Floating point exception) @ 0 (0) ---
> rt_sigaction(SIGFPE, {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0},
> {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0}, 8) = 0
> rt_sigreturn(0x8) = -1 ENETDOWN (Network is down)
> --- SIGFPE (Floating point exception) @ 0 (0) ---
> rt_sigaction(SIGFPE, {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0},
> {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0}, 8) = 0
> rt_sigreturn(0x8) = -1 ENETDOWN (Network is down)
> --- SIGFPE (Floating point exception) @ 0 (0) ---
>
> Perhaps it isn't a crash program issue at all. Maybe it's at the
> system library level.

About the closest I can come to reproducing it so far is to run
"kmem -S" on a dumpfile I created with the snap.so extension
module, where the slab subsystem was churning underneath the
snapshot process (a live dump). Anyway, the command gets into
an endless readmem() loop because of invalid kmem slab bookkeeping
values, and if I kill the network connection I can catch it in a
readmem() loop.

Now I could check for a parent pid of 1 each time in readmem(),
and kill it there, given readmem() is so regularly called, but
since you're seeing scenarios that don't show a readmem() in
the loop, that's not going to fly. Perhaps a better plan would
be to set up prctl(PR_SET_PDEATHSIG, SIGKILL) during initialization,
and hope there's no unwanted side effects.

Dave


--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 
Old 02-22-2012, 07:00 PM
Dave Anderson
 
Default crash endlessly looping on stdout error

----- Original Message -----

> Now I could check for a parent pid of 1 each time in readmem(),
> and kill it there, given readmem() is so regularly called, but
> since you're seeing scenarios that don't show a readmem() in
> the loop, that's not going to fly. Perhaps a better plan would
> be to set up prctl(PR_SET_PDEATHSIG, SIGKILL) during initialization,
> and hope there's no unwanted side effects.
>
> Dave

The prctl() works for both a readmem() loop, and for repeating signal
loop I was able to force that is somewhat similar to your second
example.

Killing the network connection on the two scenarios showed this:

$ strace -p 18992
... [ cut ] ...
--- {si_signo=SIGTTOU, si_code=SI_KERNEL, si_value={int=1653142270, ptr=0x326288f2fe}} (Stopped (tty output)) ---
--- Stopped (tty output) by SIGTTOU ---
ioctl(10, SNDCTL_TMR_START or SNDRV_TIMER_IOCTL_TREAD or TCSETS, {B38400 opost isig -icanon -echo ...}) = ? ERESTARTSYS (To be restarted)
--- {si_signo=SIGTTOU, si_code=SI_KERNEL, si_value={int=1653142270, ptr=0x326288f2fe}} (Stopped (tty output)) ---
--- Stopped (tty output) by SIGTTOU ---
ioctl(10, SNDCTL_TMR_START or SNDRV_TIMER_IOCTL_TREAD or TCSETS, {B38400 opost isig -icanon -echo ...}) = ? ERESTARTSYS (To be restarted)
--- {si_signo=SIGTTOU, si_code=SI_KERNEL, si_value={int=1653142270, ptr=0x326288f2fe}} (Stopped (tty output)) ---
--- Stopped (tty output) by SIGTTOU ---
ioctl(10, SNDCTL_TMR_START or SNDRV_TIMER_IOCTL_TREAD or TCSETS, {B38400 opost isig -icanon -echo ...}) = ? ERESTARTSYS (To be restarted)
+++ killed by SIGKILL +++
$

$ strace -p 19607
... [ cut ] ...
lseek(3, 934634248, SEEK_SET) = 934634248
read(3, "@2739210377377", 8) = 8
lseek(3, 968508152, SEEK_SET) = 968508152
read(3, "20H048210377377", 8) = 8
lseek(3, 939739912, SEEK_SET) = 939739912
read(3, "20`2667210377377", 8) = 8
lseek(3, 934634248, SEEK_SET) = 934634248
read(3, "@2739210377377", 8) = 8
+++ killed by SIGKILL +++
$

Seems like the way to go...

Dave

--
Crash-utility mailing list
Crash-utility@redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
 

Thread Tools




All times are GMT. The time now is 07:32 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org