On Fri, Sep 10, 2010 at 19:11, Stephen John Smoogen <smooge@gmail.com> wrote:
> The fas servers seem to be going into a repeatable OOPS. At present
> all I can see doing is
>
> /usr/sbin/xm destroy fasXX
> /usr/sbin/xm create fasXX
>
> on their master server.
>
--
Stephen J Smoogen.
“The core skill of innovators is error recovery, not failure avoidance.”
Randy Nelson, President of Pixar University.
"We have a strategic plan. It's called doing things.""
— Herb Kelleher, founder Southwest Airlines
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure
09-11-2010, 06:51 AM
Jon Masters
PROBLEM alert - Host fas03 is DOWN
On Fri, 2010-09-10 at 19:24 -0600, Stephen John Smoogen wrote:
> Sep 11 01:10:23 fas03 kernel: WARNING: at block/blk-core.c:338
The code in block/blk-core:338 contains an explicit check to ensure that
interrupts have been disabled, but this not true since blkif_interrupt
is not registered with IRQF_DISABLED set at the time of the setup in
bind_evtchn_to_irqhandler. Thus it might be that interrupts are still on
when we get to kick_pending_request_queues. Does this always happen?
This perhaps happened because upstream removed IRQF_DISABLED and now
runs with interrupts disabled in handle_IRQ_event, so Xen won't see
this. But on 2.6.32 this change had not yet happened. It's also 2:50am
and I might be reading this wrong, but I at least suggest you open a
RHEL6 bug and try a more recent kernel build.
Jon.
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure
09-11-2010, 07:41 AM
Jon Masters
PROBLEM alert - Host fas03 is DOWN
On Sat, 2010-09-11 at 02:51 -0400, Jon Masters wrote:
> On Fri, 2010-09-10 at 19:24 -0600, Stephen John Smoogen wrote:
>
> > Sep 11 01:10:23 fas03 kernel: WARNING: at block/blk-core.c:338
>
> > Sep 11 01:10:23 fas03 kernel: [<c044fc97>] ? warn_slowpath_common+0x77/0xb0
> > Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70
> > Sep 11 01:10:23 fas03 kernel: [<c044fce3>] ? warn_slowpath_null+0x13/0x20
> > Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70
> > Sep 11 01:10:23 fas03 kernel: [<ed63896b>] ?
> > kick_pending_request_queues+0x1b/0x30 [xen_blkfront]
> > Sep 11 01:10:23 fas03 kernel: [<ed638b80>] ?
> > blkif_interrupt+0x200/0x220 [xen_blkfront]
> > Sep 11 01:10:23 fas03 kernel: [<c04ad7c5>] ? handle_IRQ_event+0x45/0x140
>
> The code in block/blk-core:338 contains an explicit check to ensure that
> interrupts have been disabled, but this not true since blkif_interrupt
> is not registered with IRQF_DISABLED set at the time of the setup in
> bind_evtchn_to_irqhandler. Thus it might be that interrupts are still on
> when we get to kick_pending_request_queues. Does this always happen?
>
> This perhaps happened because upstream removed IRQF_DISABLED and now
> runs with interrupts disabled in handle_IRQ_event, so Xen won't see
> this. But on 2.6.32 this change had not yet happened. It's also 2:50am
> and I might be reading this wrong, but I at least suggest you open a
> RHEL6 bug and try a more recent kernel build.
Ah, of course I shouldn't email before bed. There's an obvious giant
spin_lock_irqsave/restore there, but as noted on xen-devel (when they
were mulling over moving all of the blkif_interrupt bits into a tasklet
jut a couple of weeks ago): "It looks like __blk_end_request_all...is
returning with interrupts enabled sometimes". I pinged some folks.
Jon.
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure
09-11-2010, 02:02 PM
Mike McGrath
PROBLEM alert - Host fas03 is DOWN
On Sat, 11 Sep 2010, Jon Masters wrote:
> On Sat, 2010-09-11 at 02:51 -0400, Jon Masters wrote:
> > On Fri, 2010-09-10 at 19:24 -0600, Stephen John Smoogen wrote:
> >
> > > Sep 11 01:10:23 fas03 kernel: WARNING: at block/blk-core.c:338
> >
> > > Sep 11 01:10:23 fas03 kernel: [<c044fc97>] ? warn_slowpath_common+0x77/0xb0
> > > Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70
> > > Sep 11 01:10:23 fas03 kernel: [<c044fce3>] ? warn_slowpath_null+0x13/0x20
> > > Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70
> > > Sep 11 01:10:23 fas03 kernel: [<ed63896b>] ?
> > > kick_pending_request_queues+0x1b/0x30 [xen_blkfront]
> > > Sep 11 01:10:23 fas03 kernel: [<ed638b80>] ?
> > > blkif_interrupt+0x200/0x220 [xen_blkfront]
> > > Sep 11 01:10:23 fas03 kernel: [<c04ad7c5>] ? handle_IRQ_event+0x45/0x140
> >
> > The code in block/blk-core:338 contains an explicit check to ensure that
> > interrupts have been disabled, but this not true since blkif_interrupt
> > is not registered with IRQF_DISABLED set at the time of the setup in
> > bind_evtchn_to_irqhandler. Thus it might be that interrupts are still on
> > when we get to kick_pending_request_queues. Does this always happen?
> >
> > This perhaps happened because upstream removed IRQF_DISABLED and now
> > runs with interrupts disabled in handle_IRQ_event, so Xen won't see
> > this. But on 2.6.32 this change had not yet happened. It's also 2:50am
> > and I might be reading this wrong, but I at least suggest you open a
> > RHEL6 bug and try a more recent kernel build.
>
> Ah, of course I shouldn't email before bed. There's an obvious giant
> spin_lock_irqsave/restore there, but as noted on xen-devel (when they
> were mulling over moving all of the blkif_interrupt bits into a tasklet
> jut a couple of weeks ago): "It looks like __blk_end_request_all...is
> returning with interrupts enabled sometimes". I pinged some folks.
>
Thanks for looking into this Jon, we happened to have 3 hosts die of this
within about 2 hours last night. Here's the bug report Smooge opened:
I'll take a look around for a more recent RHEL6 kernel
-Mike
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure
09-11-2010, 04:40 PM
Mike McGrath
PROBLEM alert - Host fas03 is DOWN
On Sat, 11 Sep 2010, Jon Masters wrote:
> On Sat, 2010-09-11 at 02:51 -0400, Jon Masters wrote:
> > On Fri, 2010-09-10 at 19:24 -0600, Stephen John Smoogen wrote:
> >
> > > Sep 11 01:10:23 fas03 kernel: WARNING: at block/blk-core.c:338
> >
> > > Sep 11 01:10:23 fas03 kernel: [<c044fc97>] ? warn_slowpath_common+0x77/0xb0
> > > Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70
> > > Sep 11 01:10:23 fas03 kernel: [<c044fce3>] ? warn_slowpath_null+0x13/0x20
> > > Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70
> > > Sep 11 01:10:23 fas03 kernel: [<ed63896b>] ?
> > > kick_pending_request_queues+0x1b/0x30 [xen_blkfront]
> > > Sep 11 01:10:23 fas03 kernel: [<ed638b80>] ?
> > > blkif_interrupt+0x200/0x220 [xen_blkfront]
> > > Sep 11 01:10:23 fas03 kernel: [<c04ad7c5>] ? handle_IRQ_event+0x45/0x140
> >
> > The code in block/blk-core:338 contains an explicit check to ensure that
> > interrupts have been disabled, but this not true since blkif_interrupt
> > is not registered with IRQF_DISABLED set at the time of the setup in
> > bind_evtchn_to_irqhandler. Thus it might be that interrupts are still on
> > when we get to kick_pending_request_queues. Does this always happen?
> >
> > This perhaps happened because upstream removed IRQF_DISABLED and now
> > runs with interrupts disabled in handle_IRQ_event, so Xen won't see
> > this. But on 2.6.32 this change had not yet happened. It's also 2:50am
> > and I might be reading this wrong, but I at least suggest you open a
> > RHEL6 bug and try a more recent kernel build.
>
> Ah, of course I shouldn't email before bed. There's an obvious giant
> spin_lock_irqsave/restore there, but as noted on xen-devel (when they
> were mulling over moving all of the blkif_interrupt bits into a tasklet
> jut a couple of weeks ago): "It looks like __blk_end_request_all...is
> returning with interrupts enabled sometimes". I pinged some folks.
>
Just so everyone else knows, I've set kernel.panic to 10 on these hosts so
at least they'll reboot when they panic. Hopefully we can avoid a few
wake-and-reboot issues like we had last night :-/
-Mike
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure
09-11-2010, 05:12 PM
Jon Masters
PROBLEM alert - Host fas03 is DOWN
On Sat, 2010-09-11 at 11:40 -0500, Mike McGrath wrote:
> On Sat, 11 Sep 2010, Jon Masters wrote:
> > > The code in block/blk-core:338 contains an explicit check to ensure that
> > > interrupts have been disabled, but this not true since blkif_interrupt
> > > is not registered with IRQF_DISABLED set at the time of the setup in
> > > bind_evtchn_to_irqhandler. Thus it might be that interrupts are still on
> > > when we get to kick_pending_request_queues. Does this always happen?
> > >
> > > This perhaps happened because upstream removed IRQF_DISABLED and now
> > > runs with interrupts disabled in handle_IRQ_event, so Xen won't see
> > > this. But on 2.6.32 this change had not yet happened. It's also 2:50am
> > > and I might be reading this wrong, but I at least suggest you open a
> > > RHEL6 bug and try a more recent kernel build.
> > Ah, of course I shouldn't email before bed. There's an obvious giant
> > spin_lock_irqsave/restore there, but as noted on xen-devel (when they
> > were mulling over moving all of the blkif_interrupt bits into a tasklet
> > jut a couple of weeks ago): "It looks like __blk_end_request_all...is
> > returning with interrupts enabled sometimes". I pinged some folks.
> Just so everyone else knows, I've set kernel.panic to 10 on these hosts so
> at least they'll reboot when they panic. Hopefully we can avoid a few
> wake-and-reboot issues like we had last night :-/
I pinged some folks about it last night. I would hope there will be a
fix for that soon. I suspect it's reproducible on the 70+ kernels, but
can you check that for us and update the BZ?
Jon.
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure
09-11-2010, 11:09 PM
Stephen John Smoogen
PROBLEM alert - Host fas03 is DOWN
On Sat, Sep 11, 2010 at 11:12, Jon Masters <jcm@redhat.com> wrote:
> On Sat, 2010-09-11 at 11:40 -0500, Mike McGrath wrote:
>> On Sat, 11 Sep 2010, Jon Masters wrote:
>
>> > > The code in block/blk-core:338 contains an explicit check to ensure that
>> > > interrupts have been disabled, but this not true since blkif_interrupt
>> > > is not registered with IRQF_DISABLED set at the time of the setup in
>> > > bind_evtchn_to_irqhandler. Thus it might be that interrupts are still on
>> > > when we get to kick_pending_request_queues. Does this always happen?
>> > >
>> > > This perhaps happened because upstream removed IRQF_DISABLED and now
>> > > runs with interrupts disabled in handle_IRQ_event, so Xen won't see
>> > > this. But on 2.6.32 this change had not yet happened. It's also 2:50am
>> > > and I might be reading this wrong, but I at least suggest you open a
>> > > RHEL6 bug and try a more recent kernel build.
>
>> > Ah, of course I shouldn't email before bed. There's an obvious giant
>> > spin_lock_irqsave/restore there, but as noted on xen-devel (when they
>> > were mulling over moving all of the blkif_interrupt bits into a tasklet
>> > jut a couple of weeks ago): "It looks like __blk_end_request_all...is
>> > returning with interrupts enabled sometimes". I pinged some folks.
>
>> Just so everyone else knows, I've set kernel.panic to 10 on these hosts so
>> at least they'll reboot when they panic. *Hopefully we can avoid a few
>> wake-and-reboot issues like we had last night :-/
>
> I pinged some folks about it last night. I would hope there will be a
> fix for that soon. I suspect it's reproducible on the 70+ kernels, but
> can you check that for us and update the BZ?
>
I have fas3 on a .71 kernel. Since they seem to occur at the same time
I have kept the others at older versions to see if it fixes or misses.
fas02 will reboot into a .71 if it needs to. I haven't done anything
to fas01 to keep it prime test grounds.
--
Stephen J Smoogen.
“The core skill of innovators is error recovery, not failure avoidance.”
Randy Nelson, President of Pixar University.
"We have a strategic plan. It's called doing things.""
— Herb Kelleher, founder Southwest Airlines
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure
09-12-2010, 12:14 AM
Jon Masters
PROBLEM alert - Host fas03 is DOWN
On Sat, 2010-09-11 at 17:09 -0600, Stephen John Smoogen wrote:
> On Sat, Sep 11, 2010 at 11:12, Jon Masters <jcm@redhat.com> wrote:
> > On Sat, 2010-09-11 at 11:40 -0500, Mike McGrath wrote:
> >> On Sat, 11 Sep 2010, Jon Masters wrote:
> >
> >> > > The code in block/blk-core:338 contains an explicit check to ensure that
> >> > > interrupts have been disabled, but this not true since blkif_interrupt
> >> > > is not registered with IRQF_DISABLED set at the time of the setup in
> >> > > bind_evtchn_to_irqhandler. Thus it might be that interrupts are still on
> >> > > when we get to kick_pending_request_queues. Does this always happen?
> >> > >
> >> > > This perhaps happened because upstream removed IRQF_DISABLED and now
> >> > > runs with interrupts disabled in handle_IRQ_event, so Xen won't see
> >> > > this. But on 2.6.32 this change had not yet happened. It's also 2:50am
> >> > > and I might be reading this wrong, but I at least suggest you open a
> >> > > RHEL6 bug and try a more recent kernel build.
> >
> >> > Ah, of course I shouldn't email before bed. There's an obvious giant
> >> > spin_lock_irqsave/restore there, but as noted on xen-devel (when they
> >> > were mulling over moving all of the blkif_interrupt bits into a tasklet
> >> > jut a couple of weeks ago): "It looks like __blk_end_request_all...is
> >> > returning with interrupts enabled sometimes". I pinged some folks.
> >
> >> Just so everyone else knows, I've set kernel.panic to 10 on these hosts so
> >> at least they'll reboot when they panic. Hopefully we can avoid a few
> >> wake-and-reboot issues like we had last night :-/
> >
> > I pinged some folks about it last night. I would hope there will be a
> > fix for that soon. I suspect it's reproducible on the 70+ kernels, but
> > can you check that for us and update the BZ?
> I have fas3 on a .71 kernel. Since they seem to occur at the same time
> I have kept the others at older versions to see if it fixes or misses.
> fas02 will reboot into a .71 if it needs to. I haven't done anything
> to fas01 to keep it prime test grounds.
Well, it makes sense that they'd fire at the same time. There's clearly
some underlying IO path that causes the return with interrupts still on
- perhaps an error path, who knows, I will let others poke or find some
time to dig perhaps next week
Jon.
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure
09-12-2010, 03:46 PM
Jon Masters
PROBLEM alert - Host fas03 is DOWN
On Sat, 2010-09-11 at 11:40 -0500, Mike McGrath wrote:
> On Sat, 11 Sep 2010, Jon Masters wrote:
>
> > On Sat, 2010-09-11 at 02:51 -0400, Jon Masters wrote:
> > > On Fri, 2010-09-10 at 19:24 -0600, Stephen John Smoogen wrote:
> > >
> > > > Sep 11 01:10:23 fas03 kernel: WARNING: at block/blk-core.c:338
> > >
> > > > Sep 11 01:10:23 fas03 kernel: [<c044fc97>] ? warn_slowpath_common+0x77/0xb0
> > > > Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70
> > > > Sep 11 01:10:23 fas03 kernel: [<c044fce3>] ? warn_slowpath_null+0x13/0x20
> > > > Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70
> > > > Sep 11 01:10:23 fas03 kernel: [<ed63896b>] ?
> > > > kick_pending_request_queues+0x1b/0x30 [xen_blkfront]
> > > > Sep 11 01:10:23 fas03 kernel: [<ed638b80>] ?
> > > > blkif_interrupt+0x200/0x220 [xen_blkfront]
> > > > Sep 11 01:10:23 fas03 kernel: [<c04ad7c5>] ? handle_IRQ_event+0x45/0x140
> > >
> > > The code in block/blk-core:338 contains an explicit check to ensure that
> > > interrupts have been disabled, but this not true since blkif_interrupt
> > > is not registered with IRQF_DISABLED set at the time of the setup in
> > > bind_evtchn_to_irqhandler. Thus it might be that interrupts are still on
> > > when we get to kick_pending_request_queues. Does this always happen?
> > >
> > > This perhaps happened because upstream removed IRQF_DISABLED and now
> > > runs with interrupts disabled in handle_IRQ_event, so Xen won't see
> > > this. But on 2.6.32 this change had not yet happened. It's also 2:50am
> > > and I might be reading this wrong, but I at least suggest you open a
> > > RHEL6 bug and try a more recent kernel build.
> >
> > Ah, of course I shouldn't email before bed. There's an obvious giant
> > spin_lock_irqsave/restore there, but as noted on xen-devel (when they
> > were mulling over moving all of the blkif_interrupt bits into a tasklet
> > jut a couple of weeks ago): "It looks like __blk_end_request_all...is
> > returning with interrupts enabled sometimes". I pinged some folks.
> >
>
> Just so everyone else knows, I've set kernel.panic to 10 on these hosts so
> at least they'll reboot when they panic. Hopefully we can avoid a few
> wake-and-reboot issues like we had last night :-/
Mike, is there any chance you could boot the -debug kernel on one of
these affected systems? Also, can you let us know about the host?
Jon.
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure
09-12-2010, 04:12 PM
Stephen John Smoogen
PROBLEM alert - Host fas03 is DOWN
On Sun, Sep 12, 2010 at 09:46, Jon Masters <jonathan@jonmasters.org> wrote:
> On Sat, 2010-09-11 at 11:40 -0500, Mike McGrath wrote:
>> On Sat, 11 Sep 2010, Jon Masters wrote:
>>
>> > On Sat, 2010-09-11 at 02:51 -0400, Jon Masters wrote:
>> > > On Fri, 2010-09-10 at 19:24 -0600, Stephen John Smoogen wrote:
>> > >
>> > > > Sep 11 01:10:23 fas03 kernel: WARNING: at block/blk-core.c:338
>> > >
>> > > > Sep 11 01:10:23 fas03 kernel: [<c044fc97>] ? warn_slowpath_common+0x77/0xb0
>> > > > Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70
>> > > > Sep 11 01:10:23 fas03 kernel: [<c044fce3>] ? warn_slowpath_null+0x13/0x20
>> > > > Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70
>> > > > Sep 11 01:10:23 fas03 kernel: [<ed63896b>] ?
>> > > > kick_pending_request_queues+0x1b/0x30 [xen_blkfront]
>> > > > Sep 11 01:10:23 fas03 kernel: [<ed638b80>] ?
>> > > > blkif_interrupt+0x200/0x220 [xen_blkfront]
>> > > > Sep 11 01:10:23 fas03 kernel: [<c04ad7c5>] ? handle_IRQ_event+0x45/0x140
>> > >
>> > > The code in block/blk-core:338 contains an explicit check to ensure that
>> > > interrupts have been disabled, but this not true since blkif_interrupt
>> > > is not registered with IRQF_DISABLED set at the time of the setup in
>> > > bind_evtchn_to_irqhandler. Thus it might be that interrupts are still on
>> > > when we get to kick_pending_request_queues. Does this always happen?
>> > >
>> > > This perhaps happened because upstream removed IRQF_DISABLED and now
>> > > runs with interrupts disabled in handle_IRQ_event, so Xen won't see
>> > > this. But on 2.6.32 this change had not yet happened. It's also 2:50am
>> > > and I might be reading this wrong, but I at least suggest you open a
>> > > RHEL6 bug and try a more recent kernel build.
>> >
>> > Ah, of course I shouldn't email before bed. There's an obvious giant
>> > spin_lock_irqsave/restore there, but as noted on xen-devel (when they
>> > were mulling over moving all of the blkif_interrupt bits into a tasklet
>> > jut a couple of weeks ago): "It looks like __blk_end_request_all...is
>> > returning with interrupts enabled sometimes". I pinged some folks.
>> >
>>
>> Just so everyone else knows, I've set kernel.panic to 10 on these hosts so
>> at least they'll reboot when they panic. *Hopefully we can avoid a few
>> wake-and-reboot issues like we had last night :-/
>
> Mike, is there any chance you could boot the -debug kernel on one of
> these affected systems? Also, can you let us know about the host?
>
kernel.panic set to 10 did not reboot the systems. What and where is a
debug kernel?
--
Stephen J Smoogen.
“The core skill of innovators is error recovery, not failure avoidance.”
Randy Nelson, President of Pixar University.
"We have a strategic plan. It's called doing things.""
— Herb Kelleher, founder Southwest Airlines
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure