Today I was trying to boot a new 3.1 RC kernel on one of our server
machines and ran into a weird block-related crash.
Whenever I was trying to access some filesystem on the external RAID, I
would soon run into the crash seen below, followed by a total lockup a
bit later.
The external RAID is accessed via multipath and simple logical volumes
(both device-mapper) on top. The local hard disks work just fine (LVM on
top of an Areca hardware RAID).
The bug seems to be introduced between rc2 and rc3. Up to rc2 everything
is fine, and rc3 and also the latest kernel shows this bug. From the git
log I can see that the block/ directory has seen some larger update, but
this is just a wild guess.
The crash has been seen with ext3 and GFS2, apparently when processing
write requests (writing files or on umount). All kinds of crashes are
identical up to generic_make_request.
(oh, and this machine runs as Dom0 under XEN, but I can reproduce the
issue when booting natively too, so that's unrelated)
I think you guys should know - maybe you have an idea or have something
for me to test. I would have tried to file a bug somewhere, but half of
the infrastructure is down at the moment.
(half of the stuff ended up on the console and no in the logfile, dunno why)
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
09-30-2011, 06:02 PM
Christophe Saout
Block regression since 3.1-rc3
Hello again,
> Today I was trying to boot a new 3.1 RC kernel on one of our server
> machines and ran into a weird block-related crash.
>
> Whenever I was trying to access some filesystem on the external RAID, I
> would soon run into the crash seen below, followed by a total lockup a
> bit later.
>
> The external RAID is accessed via multipath and simple logical volumes
> on top. [...]
>
> The bug seems to be introduced between rc2 and rc3. Up to rc2 everything
> is fine, and rc3 and also the latest kernel shows this bug. From the git
> log I can see that the block/ directory has seen some larger update, but
> this is just a wild guess.
Actually, I just found out that reverting
commit 4853abaae7e4a2af938115ce9071ef8684fb7af4
Author: Jeff Moyer <jmoyer@redhat.com>
Date: Mon Aug 15 21:37:25 2011 +0200
block: fix flush machinery for stacking drivers with differring flush flags
makes the problem disappear.
Let me know if I can help with further information.
Cheers,
Christophe
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
10-03-2011, 02:43 PM
Jeff Moyer
Block regression since 3.1-rc3
Christophe Saout <christophe@saout.de> writes:
> I think you guys should know - maybe you have an idea or have something
> for me to test. I would have tried to file a bug somewhere, but half of
> the infrastructure is down at the moment.
You didn't include the top few lines which tell us what went wrong. I
tried to reproduce this myself, but I ran into other issues with GFS2.
Anyway, it would help a great deal if you could retrigger the failure
and provide the full failure output. You can get that by issuing the
'dmesg' command and redirecting it to a file. And please CC me on any
future correspondences on this issue.
Thanks!
Jeff
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
10-04-2011, 12:02 PM
Christophe Saout
Block regression since 3.1-rc3
Hi Jeff,
> Anyway, it would help a great deal if you could retrigger the failure
> and provide the full failure output. You can get that by issuing the
> 'dmesg' command and redirecting it to a file.
Oh, sorry, yes, there's a line missing.
Line 323 is this one: BUG_ON(!rq->bio || rq->bio != rq->biotail);
The bug is not specific to GFS2, I also get it when using an ext3 mount
on that multipath storage pool when writing to a file.
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
10-04-2011, 01:32 PM
Jeff Moyer
Block regression since 3.1-rc3
Christophe Saout <christophe@saout.de> writes:
> Hi Jeff,
>
>> Anyway, it would help a great deal if you could retrigger the failure
>> and provide the full failure output. You can get that by issuing the
>> 'dmesg' command and redirecting it to a file.
>
> Oh, sorry, yes, there's a line missing.
>
> Line 323 is this one: BUG_ON(!rq->bio || rq->bio != rq->biotail);
>
> The bug is not specific to GFS2, I also get it when using an ext3 mount
> on that multipath storage pool when writing to a file.
Great, thanks! I tried ext3 and ext4 as well, and was unable to trigger
this problem. I setup a linear volume on top of a multipath device. Is
that the configuration you are using? Anyway, this may be enough to get
me on the right path...
Cheers,
Jeff
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
10-04-2011, 03:06 PM
Christophe Saout
Block regression since 3.1-rc3
Hi Jeff,
> > Line 323 is this one: BUG_ON(!rq->bio || rq->bio != rq->biotail);
> >
> > The bug is not specific to GFS2, I also get it when using an ext3 mount
> > on that multipath storage pool when writing to a file.
>
> Great, thanks! I tried ext3 and ext4 as well, and was unable to trigger
> this problem. I setup a linear volume on top of a multipath device. Is
> that the configuration you are using? Anyway, this may be enough to get
> me on the right path...
Yes - the device-mapper tables looks like:
vserv81:/root # dmsetup table RAID_A-test.root
0 20971520 linear 253:6 384
(this is one the FS which causes the crash when mounting via ext3)
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
10-08-2011, 11:02 AM
Shaohua Li
Block regression since 3.1-rc3
2011/10/7 Jeff Moyer <jmoyer@redhat.com>:
> Christophe Saout <christophe@saout.de> writes:
>
>> Hi Jeff,
>>
>>> Anyway, it would help a great deal if you could retrigger the failure
>>> and provide the full failure output. *You can get that by issuing the
>>> 'dmesg' command and redirecting it to a file.
>>
>> Oh, sorry, yes, there's a line missing.
>>
>> Line 323 is this one: BUG_ON(!rq->bio || rq->bio != rq->biotail);
>
> OK, it turns out my testing was incomplete. *I only tested targets that
> had a write-through cache, so I didn't hit this problem. *It reproduces
> pretty easily with just multipath involved (no linear target on top) when
> running against the right storage.
>
> So, here's a patch, but I don't have a full explanation for it just yet.
> What I observed was that, on fsync, blkdev_issue_flush was called.
> Eventually, the flush request gets cloned, and blk_insert_cloned_request
> is called. *This cloned request never actually gets issued to the
> q->requst_fn (scsi_request_fn in my case). *So, it may be that there is
> no plug list for this, so the queue isn't goosed? *I'll try to come up
> with a better explanation, or Tejun may just know off the top of his
> head what's going on.
>
> So, the patch works for me, but is very much just an RFC.
>
> Cheers,
> Jeff
>
> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
>
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index 491eb30..7aa4736 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -320,7 +320,7 @@ void blk_insert_flush(struct request *rq)
> * * * * * * * *return;
> * * * *}
>
> - * * * BUG_ON(!rq->bio || rq->bio != rq->biotail);
> + * * * BUG_ON(rq->bio && rq->bio != rq->biotail);
>
> * * * */*
> * * * * * If there's data but flush is not necessary, the request can be
> @@ -345,6 +345,12 @@ void blk_insert_flush(struct request *rq)
> * * * *rq->end_io = flush_data_end_io;
>
> * * * *blk_flush_complete_seq(rq, REQ_FSEQ_ACTIONS & ~policy, 0);
> +
> + * * * /*
> + * * * ** A cloned empty flush needs a queue kick to make progress.
> + * * * **/
> + * * * if (!rq->bio)
> + * * * * * * * blk_run_queue_async(q);
> *}
Looks the dm request based flush logic is broken.
saved_make_request_fn
__make_request
blk_insert_flush
but blk_insert_flush doesn't put the original request to list, instead, the
q->flush_rq is in list.
then
dm_request_fn
blk_peek_request
dm_prep_fn
clone_rq
map_request
blk_insert_cloned_request
so q->flush_rq is cloned, and get dispatched. but we can't clone q->flush_rq
and use it to do flush. map_request even could assign a different blockdev to
the cloned request.
Clone q->flush_rq is absolutely wrong. we can clear ogirinal request's flush bit
and set cloned request flush bit, so we can skip the blk_insert_flush logic for
original request.
can you please try this patch. Only compilation tested.
Thanks,
Shaohua
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
10-08-2011, 12:05 PM
Christophe Saout
Block regression since 3.1-rc3
Hi Shaohua,
> >> Line 323 is this one: BUG_ON(!rq->bio || rq->bio != rq->biotail);
> >
> > OK, it turns out my testing was incomplete. I only tested targets that
> > had a write-through cache, so I didn't hit this problem. It reproduces
> > pretty easily with just multipath involved (no linear target on top) when
> > running against the right storage.
> >
> > So, here's a patch, but I don't have a full explanation for it just yet.
> > What I observed was that, on fsync, blkdev_issue_flush was called.
> > Eventually, the flush request gets cloned, and blk_insert_cloned_request
> > is called. This cloned request never actually gets issued to the
> > q->requst_fn (scsi_request_fn in my case). So, it may be that there is
> > no plug list for this, so the queue isn't goosed? I'll try to come up
> > with a better explanation, or Tejun may just know off the top of his
> > head what's going on.
> >
> > So, the patch works for me, but is very much just an RFC.
> [...]
> Looks the dm request based flush logic is broken.
>
> saved_make_request_fn
> __make_request
> blk_insert_flush
> but blk_insert_flush doesn't put the original request to list, instead, the
> q->flush_rq is in list.
> then
> dm_request_fn
> blk_peek_request
> dm_prep_fn
> clone_rq
> map_request
> blk_insert_cloned_request
> so q->flush_rq is cloned, and get dispatched. but we can't clone q->flush_rq
> and use it to do flush. map_request even could assign a different blockdev to
> the cloned request.
>
> Clone q->flush_rq is absolutely wrong. we can clear ogirinal request's flush bit
> and set cloned request flush bit, so we can skip the blk_insert_flush logic for
> original request.
>
> can you please try this patch. Only compilation tested.
Now I get a different BUG when doing what triggered the original BUG
(e.g. unmounting the filesystem):
(followed by an oops and the machine is dead in an instant)
Christophe
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
10-08-2011, 04:14 PM
Mike Snitzer
Block regression since 3.1-rc3
On Sat, Oct 08 2011 at 7:02am -0400,
Shaohua Li <shli@kernel.org> wrote:
> Looks the dm request based flush logic is broken.
>
> saved_make_request_fn
> __make_request
> blk_insert_flush
> but blk_insert_flush doesn't put the original request to list, instead, the
> q->flush_rq is in list.
> then
> dm_request_fn
> blk_peek_request
> dm_prep_fn
> clone_rq
> map_request
> blk_insert_cloned_request
> so q->flush_rq is cloned, and get dispatched. but we can't clone q->flush_rq
> and use it to do flush. map_request even could assign a different blockdev to
> the cloned request.
You haven't explained why cloning q->flush_rq is broken. What is the
problem with map_request changing the blockdev? For the purposes of
request-based DM the flush machinery has already managed the processing
of the flush at the higher level request_queue.
By the time request-based DM is cloning a flush request it really has no
need to reenter the flush machinery (even though Tejun wants it to --
but in practice it doesn't buy us anything because we never stack
request-based DM at the moment. Instead it showcases how brittle this
path is).
> Clone q->flush_rq is absolutely wrong.
I'm still missing the _why_.
Taking a step back:
Unless others have an immediate ah-ha moment, I'd suggest we revert
commit 4853abaae7e4a2a (block: fix flush machinery for stacking drivers
with differring flush flags). Whereby avoiding unnecessarily reentering
the flush machinery.
If commit ed8b752bccf256 (dm table: set flush capability based on
underlying devices) is in place the flush gets fed directly to
scsi_request_fn, which is fine because the request-based DM's
request_queue's flush_flags reflect the flush capabilities of the
underlying device(s).
We are then covered relative to the only request-based DM use-case
people care about (e.g. dm-multipath, which doesn't use stacked
request-based DM).
We can revisit upholding the purity of the flush machinery for stacked
devices in >= 3.2.
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
10-09-2011, 04:26 AM
Shaohua Li
Block regression since 3.1-rc3
2011/10/8 Christophe Saout <christophe@saout.de>:
> Hi Shaohua,
>
>> >> Line 323 is this one: BUG_ON(!rq->bio || rq->bio != rq->biotail);
>> >
>> > OK, it turns out my testing was incomplete. *I only tested targets that
>> > had a write-through cache, so I didn't hit this problem. *It reproduces
>> > pretty easily with just multipath involved (no linear target on top) when
>> > running against the right storage.
>> >
>> > So, here's a patch, but I don't have a full explanation for it just yet.
>> > What I observed was that, on fsync, blkdev_issue_flush was called.
>> > Eventually, the flush request gets cloned, and blk_insert_cloned_request
>> > is called. *This cloned request never actually gets issued to the
>> > q->requst_fn (scsi_request_fn in my case). *So, it may be that there is
>> > no plug list for this, so the queue isn't goosed? *I'll try to come up
>> > with a better explanation, or Tejun may just know off the top of his
>> > head what's going on.
>> >
>> > So, the patch works for me, but is very much just an RFC.
>> [...]
>> Looks the dm request based flush logic is broken.
>>
>> saved_make_request_fn
>> * __make_request
>> * * blk_insert_flush
>> but blk_insert_flush doesn't put the original request to list, instead, the
>> q->flush_rq is in list.
>> then
>> dm_request_fn
>> * blk_peek_request
>> * * dm_prep_fn
>> * * * clone_rq
>> * map_request
>> * * blk_insert_cloned_request
>> so q->flush_rq is cloned, and get dispatched. but we can't clone q->flush_rq
>> and use it to do flush. map_request even could assign a different blockdev to
>> the cloned request.
>>
>> Clone q->flush_rq is absolutely wrong. we can clear ogirinal request's flush bit
>> and set cloned request flush bit, so we can skip the blk_insert_flush logic for
>> original request.
>>
>> can you please try this patch. Only compilation tested.
>
> Now I get a different BUG when doing what triggered the original BUG
> (e.g. unmounting the filesystem):
>
> ------------[ cut here ]------------
> kernel BUG at drivers/scsi/scsi_lib.c:1152!
> invalid opcode: 0000 [#1] PREEMPT SMP
> CPU 7
> Modules linked in: xenfs gfs2 fuse dlm configfs dummy bonding xt_TPROXY nf_tproxy_core xt_socket xt_owner xt_connmark iptable_mangle ipt_MASQUERADE iptable_nat ipt_ULOG ipt_REJECT xt_mark iptable_filter iptable_raw ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state xt_tcpudp xt_physdev xt_hashlimit xt_multiport ip6table_filter ip6_tables ebt_redirect ebt_ip ebt_mark_m ebtable_broute bridge stp llc ebtable_nat ext4 jbd2 dm_round_robin dm_multipath ebtable_filter ebtables x_tables nf_nat_tftp nf_conntrack_tftp nf_nat_sip nf_conntrack_sip nf_nat_irc nf_conntrack_irc nf_nat_h323 nf_conntrack_h323 nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack xen_netback xen_blkback xen_gntdev ipmi_si ipmi_devintf ipmi_msghandler loop ohci_hcd ehci_hcd e1000e tpm_tis igb qla2xxx tpm tpm_bios scsi_transport_fc pata_atiixp i2c_piix4
>
> Pid: 5056, comm: kworker/7:2 Not tainted 3.1.0-rc8 #1 Supermicro H8DG6/H8DGi/H8DG6/H8DGi
> RIP: e030:[<ffffffff812f7eae>] *[<ffffffff812f7eae>] scsi_setup_fs_cmnd+0x9e/0xe0
> RSP: e02b:ffff88009b299b70 *EFLAGS: 00010046
> RAX: 0000000000000000 RBX: ffff88009b9cc018 RCX: 0000000000000000
> RDX: ffff88009f193c00 RSI: ffff88009b9cc018 RDI: ffff88009b905800
> RBP: ffff88009b299b80 R08: 0000000000000000 R09: 0000000000000001
> R10: 0000000000000001 R11: ffff8800a0e40700 R12: ffff88009b905800
> R13: 0000000000000001 R14: ffff88009b905800 R15: 0000000000000000
> FS: *00007fc76ab77700(0000) GS:ffff8800d767f000(0000) knlGS:0000000000000000
> CS: *e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007fc76a1d0beb CR3: 0000000001805000 CR4: 0000000000000660
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process kworker/7:2 (pid: 5056, threadinfo ffff88009b298000, task ffff88009ed68000)
> Stack:
> *ffff88009b9cc018 ffff88009e5f1a60 ffff88009b299bf0 ffffffff8130675f
> *ffff88009b299ba0 ffffffff810ceef0 ffff88009f193c00 ffffffff00000000
> *ffffffff8100ab6d ffff88009b299c60 ffff88009ed68000 ffff88009b9cc018
> Call Trace:
> *[<ffffffff8130675f>] sd_prep_fn+0x14f/0xa70
> *[<ffffffff810ceef0>] ? mempool_alloc_slab+0x10/0x20
> *[<ffffffff8100ab6d>] ? xen_force_evtchn_callback+0xd/0x10
> *[<ffffffff81215db2>] blk_peek_request+0xd2/0x270
> *[<ffffffff812f8aeb>] scsi_request_fn+0x4b/0x3c0
> *[<ffffffff8120fa46>] __blk_run_queue+0x16/0x20
> *[<ffffffff8120ef03>] __elv_add_request+0xb3/0x270
> *[<ffffffff81213494>] add_acct_request+0x34/0x40
> *[<ffffffff8121350b>] blk_insert_cloned_request+0x6b/0x90
> *[<ffffffff8138c049>] dm_dispatch_request+0x39/0x70
> *[<ffffffff8138db9c>] dm_request_fn+0x17c/0x280
> *[<ffffffff8100b322>] ? check_events+0x12/0x20
> *[<ffffffff81222db0>] ? cfq_init_queue+0x430/0x430
> *[<ffffffff8120fa46>] __blk_run_queue+0x16/0x20
> *[<ffffffff81222de4>] cfq_kick_queue+0x34/0x50
> *[<ffffffff8106e039>] process_one_work+0x129/0x4a0
> *[<ffffffff81070401>] worker_thread+0x161/0x340
> *[<ffffffff810702a0>] ? manage_workers.clone.22+0x230/0x230
> *[<ffffffff81074d36>] kthread+0x96/0xa0
> *[<ffffffff814cb7b4>] kernel_thread_helper+0x4/0x10
> *[<ffffffff814c9873>] ? int_ret_from_sys_call+0x7/0x1b
> *[<ffffffff814c2dc0>] ? retint_restore_args+0x5/0x6
> *[<ffffffff814cb7b0>] ? gs_change+0x13/0x13
> Code: ff 5b 41 5c c9 c3 0f 1f 80 00 00 00 00 4c 89 e7 be 20 00 00 00 e8 a3 a2 ff ff 48 85 c0 48 89 c7 74 36 48 89 83 e0 00 00 00 eb 9b <0f> 0b 48 8b 00 48 85 c0 0f 84 77 ff ff ff 48 8b 40 50 48 85 c0
> RIP *[<ffffffff812f7eae>] scsi_setup_fs_cmnd+0x9e/0xe0
> *RSP <ffff88009b299b70>
> ---[ end trace 35f29bc1fdb897a6 ]---
> note: kworker/7:2[5056] exited with preempt_count 1
>
> (followed by an oops and the machine is dead in an instant)
>
there is a typo there, please check the new one.
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel