FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Device-mapper Development

 
 
LinkBack Thread Tools
 
Old 03-09-2011, 11:58 PM
Mike Snitzer
 
Default block: remove per-queue plugging

On Tue, Mar 08 2011 at 5:05pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Tue, Mar 08 2011 at 3:27pm -0500,
> Jens Axboe <jaxboe@fusionio.com> wrote:
>
> > On 2011-03-08 21:21, Mike Snitzer wrote:
> > > On Tue, Mar 08 2011 at 7:16am -0500,
> > > Jens Axboe <jaxboe@fusionio.com> wrote:
> > >
> > >> On 2011-03-03 23:13, Mike Snitzer wrote:
> > >>> I'm now hitting a lockdep issue, while running a 'for-2.6.39/stack-plug'
> > >>> kernel, when I try an fsync heavy workload to a request-based mpath
> > >>> device (the kernel ultimately goes down in flames, I've yet to look at
> > >>> the crashdump I took)
> > >>
> > >> Mike, can you re-run with the current stack-plug branch? I've fixed the
> > >> !CONFIG_BLOCK and rebase issues, and also added a change for this flush
> > >> on schedule event. It's run outside of the runqueue lock now, so
> > >> hopefully that should solve this one.
> > >
> > > Works for me, thanks.
> >
> > Super, thanks! Out of curiousity, did you use dm/md?
>
> Yes, I've been using a request-based DM multipath device.

Hi Jens,

I just got to reviewing your onstack plugging DM changes (I looked at
the core block layer changes for additional context and also had a brief
look at MD).

I need to put more time to the review of all this code but one thing
that is immediately apparent is that after these changes DM only has one
onstack plug/unplug -- in drivers/md/dm-kcopyd.c:do_work()

You've removed a considerable amount of implicit plug/explicit unplug
code from DM (and obviously elsewhere but I have my DM hat on .

First question: is relying on higher-level (aio, fs, read-ahead)
explicit plugging/unplugging sufficient? Seems odd to not have the
control/need to unplug the DM device upon resume (after a suspend).

(this naive question/concern stems from me needing to understand the
core block layer's onstack plugging changes better)

(but if those higher-level explicit onstack plug changes make all this
code removal possible shouldn't those commits come before changing
underlying block drivers like DM, MD, etc?)

I noticed that driver/md/dm-raid1.c:do_mirror() seems to follow the same
pattern of drivers/md/dm-kcopyd.c:do_work().. so rather than remove
dm_table_unplug_all() shouldn't it be replaced with a
blk_start_plug/blk_finish_plug?

Also, in your MD changes, you removed all calls to md_unplug() but
didn't remove md_unplug(). Seems it should be removed along with the
'plug' member of 'struct mddev_t'? Neil?

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 04-05-2011, 03:05 AM
NeilBrown
 
Default block: remove per-queue plugging

On Wed, 9 Mar 2011 19:58:10 -0500 Mike Snitzer <snitzer@redhat.com> wrote:

> Also, in your MD changes, you removed all calls to md_unplug() but
> didn't remove md_unplug(). Seems it should be removed along with the
> 'plug' member of 'struct mddev_t'? Neil?

I've been distracted by other things and only just managed to have a look at
this.

The new plugging code seems to completely ignore the needs of stacked devices
- or at least my needs in md.

For RAID1 with a write-intent-bitmap, I queue all write requests and then on
an unplug I update the write-intent-bitmap to mark all the relevant blocks
and then release the writes.

With the new code there is no way for an unplug event to wake up the raid1d
thread to start the writeout - I haven't tested it but I suspect it will just
hang.

Similarly for RAID5 I gather write bios (long before they become 'struct
request' which is what the plugging code understands) and on an unplug event
I release the writes - hopefully with enough bios per stripe so that we don't
need to pre-read.

Possibly the simplest fix would be to have a second list_head in 'struct
blk_plug' which contained callbacks (a function pointer a list_head in a
struct which is passed as an arg to the function!).
blk_finish_plug could then walk the list and call the call-backs.
It would be quite easy to hook into that.


I suspect I also need to add blk_start_plug/blk_finish_plug around the loop
in raid1d/raid5d/raid10d, but that is pretty straight forward.

Am I missing something important?
Is there a better way to get an unplug event to md?

Thanks,
NeilBrown

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 04-11-2011, 04:50 AM
NeilBrown
 
Default block: remove per-queue plugging

On Tue, 5 Apr 2011 13:05:41 +1000 NeilBrown <neilb@suse.de> wrote:

> On Wed, 9 Mar 2011 19:58:10 -0500 Mike Snitzer <snitzer@redhat.com> wrote:
>
> > Also, in your MD changes, you removed all calls to md_unplug() but
> > didn't remove md_unplug(). Seems it should be removed along with the
> > 'plug' member of 'struct mddev_t'? Neil?
>
> I've been distracted by other things and only just managed to have a look at
> this.
>
> The new plugging code seems to completely ignore the needs of stacked devices
> - or at least my needs in md.
>
> For RAID1 with a write-intent-bitmap, I queue all write requests and then on
> an unplug I update the write-intent-bitmap to mark all the relevant blocks
> and then release the writes.
>
> With the new code there is no way for an unplug event to wake up the raid1d
> thread to start the writeout - I haven't tested it but I suspect it will just
> hang.
>
> Similarly for RAID5 I gather write bios (long before they become 'struct
> request' which is what the plugging code understands) and on an unplug event
> I release the writes - hopefully with enough bios per stripe so that we don't
> need to pre-read.
>
> Possibly the simplest fix would be to have a second list_head in 'struct
> blk_plug' which contained callbacks (a function pointer a list_head in a
> struct which is passed as an arg to the function!).
> blk_finish_plug could then walk the list and call the call-backs.
> It would be quite easy to hook into that.

I've implemented this and it seems to work.
Jens: could you please review and hopefully ack the patch below, and let
me know if you will submit it or should I?

My testing of this combined with some other patches which cause various md
personalities to use it shows up a bug somewhere.

The symptoms are crashes in various places in blk-core and sometimes
elevator.c
list_sort occurs fairly often included in the stack but not always.

This patch

diff --git a/block/blk-core.c b/block/blk-core.c
index 273d60b..903ce8d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2674,19 +2674,23 @@ static void flush_plug_list(struct blk_plug *plug)
struct request_queue *q;
unsigned long flags;
struct request *rq;
+ struct list_head head;

BUG_ON(plug->magic != PLUG_MAGIC);

if (list_empty(&plug->list))
return;
+ list_add(&head, &plug->list);
+ list_del_init(&plug->list);

if (plug->should_sort)
- list_sort(NULL, &plug->list, plug_rq_cmp);
+ list_sort(NULL, &head, plug_rq_cmp);
+ plug->should_sort = 0;

q = NULL;
local_irq_save(flags);
- while (!list_empty(&plug->list)) {
- rq = list_entry_rq(plug->list.next);
+ while (!list_empty(&head)) {
+ rq = list_entry_rq(head.next);
list_del_init(&rq->queuelist);
BUG_ON(!(rq->cmd_flags & REQ_ON_PLUG));
BUG_ON(!rq->q);


makes the symptom go away. It simply moves the plug list onto a separate
list head before sorting and processing it.
My test was simply writing to a RAID1 with dd:
while true; do dd if=/dev/zero of=/dev/md0 size=4k; done

Obviously all writes go to two devices so the plug list will always need
sorting.

The only explanation I can come up with is that very occasionally schedule on
2 separate cpus calls blk_flush_plug for the same task. I don't understand
the scheduler nearly well enough to know if or how that can happen.
However with this patch in place I can write to a RAID1 constantly for half
an hour, and without it, the write rarely lasts for 3 minutes.

If you want to reproduce my experiment, you can pull from
git://neil.brown.name/md plug-test
to get my patches for plugging in md (which are quite ready for submission
but seem to work), create a RAID1 using e.g.
mdadm -C /dev/md0 --level=1 --raid-disks=2 /dev/device1 /dev/device2
while true; do dd if=/dev/zero of=/dev/md0 bs=4K ; done


Thanks,
NeilBrown



>From 687b189c02276887dd7d5b87a817da9f67ed3c2c Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Thu, 7 Apr 2011 13:16:59 +1000
Subject: [PATCH] Enhance new plugging support to support general callbacks.

md/raid requires an unplug callback, but as it does not uses
requests the current code cannot provide one.

So allow arbitrary callbacks to be attached to the blk_plug.

Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---
block/blk-core.c | 13 +++++++++++++
include/linux/blkdev.h | 7 ++++++-
2 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 725091d..273d60b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2644,6 +2644,7 @@ void blk_start_plug(struct blk_plug *plug)

plug->magic = PLUG_MAGIC;
INIT_LIST_HEAD(&plug->list);
+ INIT_LIST_HEAD(&plug->cb_list);
plug->should_sort = 0;

/*
@@ -2717,9 +2718,21 @@ static void flush_plug_list(struct blk_plug *plug)
local_irq_restore(flags);
}

+static void flush_plug_callbacks(struct blk_plug *plug)
+{
+ while (!list_empty(&plug->cb_list)) {
+ struct blk_plug_cb *cb = list_first_entry(&plug->cb_list,
+ struct blk_plug_cb,
+ list);
+ list_del(&cb->list);
+ cb->callback(cb);
+ }
+}
+
static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug)
{
flush_plug_list(plug);
+ flush_plug_callbacks(plug);

if (plug == tsk->plug)
tsk->plug = NULL;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 32176cc..3e5e604 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -857,8 +857,13 @@ extern void blk_put_queue(struct request_queue *);
struct blk_plug {
unsigned long magic;
struct list_head list;
+ struct list_head cb_list;
unsigned int should_sort;
};
+struct blk_plug_cb {
+ struct list_head list;
+ void (*callback)(struct blk_plug_cb *);
+};

extern void blk_start_plug(struct blk_plug *);
extern void blk_finish_plug(struct blk_plug *);
@@ -876,7 +881,7 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
{
struct blk_plug *plug = tsk->plug;

- return plug && !list_empty(&plug->list);
+ return plug && (!list_empty(&plug->list) || !list_empty(&plug->cb_list));
}

/*
--
1.7.3.4


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 04-11-2011, 09:19 AM
Jens Axboe
 
Default block: remove per-queue plugging

On 2011-04-11 06:50, NeilBrown wrote:
> On Tue, 5 Apr 2011 13:05:41 +1000 NeilBrown <neilb@suse.de> wrote:
>
>> On Wed, 9 Mar 2011 19:58:10 -0500 Mike Snitzer <snitzer@redhat.com> wrote:
>>
>>> Also, in your MD changes, you removed all calls to md_unplug() but
>>> didn't remove md_unplug(). Seems it should be removed along with the
>>> 'plug' member of 'struct mddev_t'? Neil?
>>
>> I've been distracted by other things and only just managed to have a look at
>> this.
>>
>> The new plugging code seems to completely ignore the needs of stacked devices
>> - or at least my needs in md.
>>
>> For RAID1 with a write-intent-bitmap, I queue all write requests and then on
>> an unplug I update the write-intent-bitmap to mark all the relevant blocks
>> and then release the writes.
>>
>> With the new code there is no way for an unplug event to wake up the raid1d
>> thread to start the writeout - I haven't tested it but I suspect it will just
>> hang.
>>
>> Similarly for RAID5 I gather write bios (long before they become 'struct
>> request' which is what the plugging code understands) and on an unplug event
>> I release the writes - hopefully with enough bios per stripe so that we don't
>> need to pre-read.
>>
>> Possibly the simplest fix would be to have a second list_head in 'struct
>> blk_plug' which contained callbacks (a function pointer a list_head in a
>> struct which is passed as an arg to the function!).
>> blk_finish_plug could then walk the list and call the call-backs.
>> It would be quite easy to hook into that.
>
> I've implemented this and it seems to work.
> Jens: could you please review and hopefully ack the patch below, and let
> me know if you will submit it or should I?
>
> My testing of this combined with some other patches which cause various md
> personalities to use it shows up a bug somewhere.
>
> The symptoms are crashes in various places in blk-core and sometimes
> elevator.c
> list_sort occurs fairly often included in the stack but not always.
>
> This patch
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 273d60b..903ce8d 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2674,19 +2674,23 @@ static void flush_plug_list(struct blk_plug *plug)
> struct request_queue *q;
> unsigned long flags;
> struct request *rq;
> + struct list_head head;
>
> BUG_ON(plug->magic != PLUG_MAGIC);
>
> if (list_empty(&plug->list))
> return;
> + list_add(&head, &plug->list);
> + list_del_init(&plug->list);
>
> if (plug->should_sort)
> - list_sort(NULL, &plug->list, plug_rq_cmp);
> + list_sort(NULL, &head, plug_rq_cmp);
> + plug->should_sort = 0;
>
> q = NULL;
> local_irq_save(flags);
> - while (!list_empty(&plug->list)) {
> - rq = list_entry_rq(plug->list.next);
> + while (!list_empty(&head)) {
> + rq = list_entry_rq(head.next);
> list_del_init(&rq->queuelist);
> BUG_ON(!(rq->cmd_flags & REQ_ON_PLUG));
> BUG_ON(!rq->q);
>
>
> makes the symptom go away. It simply moves the plug list onto a separate
> list head before sorting and processing it.
> My test was simply writing to a RAID1 with dd:
> while true; do dd if=/dev/zero of=/dev/md0 size=4k; done
>
> Obviously all writes go to two devices so the plug list will always need
> sorting.
>
> The only explanation I can come up with is that very occasionally schedule on
> 2 separate cpus calls blk_flush_plug for the same task. I don't understand
> the scheduler nearly well enough to know if or how that can happen.
> However with this patch in place I can write to a RAID1 constantly for half
> an hour, and without it, the write rarely lasts for 3 minutes.

Or perhaps if the request_fn blocks, that would be problematic. So the
patch is likely a good idea even for that case.

I'll merge it, changing it to list_splice_init() as I think that would
be more clear.

> From 687b189c02276887dd7d5b87a817da9f67ed3c2c Mon Sep 17 00:00:00 2001
> From: NeilBrown <neilb@suse.de>
> Date: Thu, 7 Apr 2011 13:16:59 +1000
> Subject: [PATCH] Enhance new plugging support to support general callbacks.
>
> md/raid requires an unplug callback, but as it does not uses
> requests the current code cannot provide one.
>
> So allow arbitrary callbacks to be attached to the blk_plug.
>
> Cc: Jens Axboe <jaxboe@fusionio.com>
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
> block/blk-core.c | 13 +++++++++++++
> include/linux/blkdev.h | 7 ++++++-
> 2 files changed, 19 insertions(+), 1 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 725091d..273d60b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2644,6 +2644,7 @@ void blk_start_plug(struct blk_plug *plug)
>
> plug->magic = PLUG_MAGIC;
> INIT_LIST_HEAD(&plug->list);
> + INIT_LIST_HEAD(&plug->cb_list);
> plug->should_sort = 0;
>
> /*
> @@ -2717,9 +2718,21 @@ static void flush_plug_list(struct blk_plug *plug)
> local_irq_restore(flags);
> }
>
> +static void flush_plug_callbacks(struct blk_plug *plug)
> +{
> + while (!list_empty(&plug->cb_list)) {
> + struct blk_plug_cb *cb = list_first_entry(&plug->cb_list,
> + struct blk_plug_cb,
> + list);
> + list_del(&cb->list);
> + cb->callback(cb);
> + }
> +}
> +
> static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug)
> {
> flush_plug_list(plug);
> + flush_plug_callbacks(plug);
>
> if (plug == tsk->plug)
> tsk->plug = NULL;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 32176cc..3e5e604 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -857,8 +857,13 @@ extern void blk_put_queue(struct request_queue *);
> struct blk_plug {
> unsigned long magic;
> struct list_head list;
> + struct list_head cb_list;
> unsigned int should_sort;
> };
> +struct blk_plug_cb {
> + struct list_head list;
> + void (*callback)(struct blk_plug_cb *);
> +};
>
> extern void blk_start_plug(struct blk_plug *);
> extern void blk_finish_plug(struct blk_plug *);
> @@ -876,7 +881,7 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
> {
> struct blk_plug *plug = tsk->plug;
>
> - return plug && !list_empty(&plug->list);
> + return plug && (!list_empty(&plug->list) || !list_empty(&plug->cb_list));
> }
>
> /*

Maybe I'm missing something, but why do you need those callbacks? If
it's to use plugging yourself, perhaps we can just ensure that those
don't get assigned in the task - so it would be have to used with care.

It's not that I disagree to these callbacks, I just want to ensure I
understand why you need them.

--
Jens Axboe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 04-11-2011, 10:59 AM
NeilBrown
 
Default block: remove per-queue plugging

On Mon, 11 Apr 2011 11:19:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:

> On 2011-04-11 06:50, NeilBrown wrote:

> > The only explanation I can come up with is that very occasionally schedule on
> > 2 separate cpus calls blk_flush_plug for the same task. I don't understand
> > the scheduler nearly well enough to know if or how that can happen.
> > However with this patch in place I can write to a RAID1 constantly for half
> > an hour, and without it, the write rarely lasts for 3 minutes.
>
> Or perhaps if the request_fn blocks, that would be problematic. So the
> patch is likely a good idea even for that case.
>
> I'll merge it, changing it to list_splice_init() as I think that would
> be more clear.

OK - though I'm not 100% the patch fixes the problem - just that it hides the
symptom for me.
I might try instrumenting the code a bit more and see if I can find exactly
where it is re-entering flush_plug_list - as that seems to be what is
happening.

And yeah - list_split_init is probably better. I just never remember exactly
what list_split means and have to look it up every time, where as
list_add/list_del are very clear to me.


>
> > From 687b189c02276887dd7d5b87a817da9f67ed3c2c Mon Sep 17 00:00:00 2001
> > From: NeilBrown <neilb@suse.de>
> > Date: Thu, 7 Apr 2011 13:16:59 +1000
> > Subject: [PATCH] Enhance new plugging support to support general callbacks.
> >
> > md/raid requires an unplug callback, but as it does not uses
> > requests the current code cannot provide one.
> >
> > So allow arbitrary callbacks to be attached to the blk_plug.
> >
> > Cc: Jens Axboe <jaxboe@fusionio.com>
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> > block/blk-core.c | 13 +++++++++++++
> > include/linux/blkdev.h | 7 ++++++-
> > 2 files changed, 19 insertions(+), 1 deletions(-)
> >
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 725091d..273d60b 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -2644,6 +2644,7 @@ void blk_start_plug(struct blk_plug *plug)
> >
> > plug->magic = PLUG_MAGIC;
> > INIT_LIST_HEAD(&plug->list);
> > + INIT_LIST_HEAD(&plug->cb_list);
> > plug->should_sort = 0;
> >
> > /*
> > @@ -2717,9 +2718,21 @@ static void flush_plug_list(struct blk_plug *plug)
> > local_irq_restore(flags);
> > }
> >
> > +static void flush_plug_callbacks(struct blk_plug *plug)
> > +{
> > + while (!list_empty(&plug->cb_list)) {
> > + struct blk_plug_cb *cb = list_first_entry(&plug->cb_list,
> > + struct blk_plug_cb,
> > + list);
> > + list_del(&cb->list);
> > + cb->callback(cb);
> > + }
> > +}
> > +
> > static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug)
> > {
> > flush_plug_list(plug);
> > + flush_plug_callbacks(plug);
> >
> > if (plug == tsk->plug)
> > tsk->plug = NULL;
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 32176cc..3e5e604 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -857,8 +857,13 @@ extern void blk_put_queue(struct request_queue *);
> > struct blk_plug {
> > unsigned long magic;
> > struct list_head list;
> > + struct list_head cb_list;
> > unsigned int should_sort;
> > };
> > +struct blk_plug_cb {
> > + struct list_head list;
> > + void (*callback)(struct blk_plug_cb *);
> > +};
> >
> > extern void blk_start_plug(struct blk_plug *);
> > extern void blk_finish_plug(struct blk_plug *);
> > @@ -876,7 +881,7 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
> > {
> > struct blk_plug *plug = tsk->plug;
> >
> > - return plug && !list_empty(&plug->list);
> > + return plug && (!list_empty(&plug->list) || !list_empty(&plug->cb_list));
> > }
> >
> > /*
>
> Maybe I'm missing something, but why do you need those callbacks? If
> it's to use plugging yourself, perhaps we can just ensure that those
> don't get assigned in the task - so it would be have to used with care.
>
> It's not that I disagree to these callbacks, I just want to ensure I
> understand why you need them.
>

I'm sure one of us is missing something (probably both) but I'm not sure what.

The callback is central.

It is simply to use plugging in md.
Just like blk-core, md will notice that a blk_plug is active and will put
requests aside. I then need something to call in to md when blk_finish_plug
is called so that put-aside requests can be released.
As md can be built as a module, that call must be a call-back of some sort.
blk-core doesn't need to register blk_plug_flush because that is never in a
module, so it can be called directly. But the md equivalent could be in a
module, so I need to be able to register a call back.

Does that help?

Thanks,
NeilBrown


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 04-11-2011, 11:04 AM
Jens Axboe
 
Default block: remove per-queue plugging

On 2011-04-11 12:59, NeilBrown wrote:
> On Mon, 11 Apr 2011 11:19:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
>
>> On 2011-04-11 06:50, NeilBrown wrote:
>
>>> The only explanation I can come up with is that very occasionally schedule on
>>> 2 separate cpus calls blk_flush_plug for the same task. I don't understand
>>> the scheduler nearly well enough to know if or how that can happen.
>>> However with this patch in place I can write to a RAID1 constantly for half
>>> an hour, and without it, the write rarely lasts for 3 minutes.
>>
>> Or perhaps if the request_fn blocks, that would be problematic. So the
>> patch is likely a good idea even for that case.
>>
>> I'll merge it, changing it to list_splice_init() as I think that would
>> be more clear.
>
> OK - though I'm not 100% the patch fixes the problem - just that it hides the
> symptom for me.
> I might try instrumenting the code a bit more and see if I can find exactly
> where it is re-entering flush_plug_list - as that seems to be what is
> happening.

It's definitely a good thing to add, to avoid the list fudging on
schedule. Whether it's your exact problem, I can't tell.

> And yeah - list_split_init is probably better. I just never remember exactly
> what list_split means and have to look it up every time, where as
> list_add/list_del are very clear to me.

splice, no split :-)

>>> From 687b189c02276887dd7d5b87a817da9f67ed3c2c Mon Sep 17 00:00:00 2001
>>> From: NeilBrown <neilb@suse.de>
>>> Date: Thu, 7 Apr 2011 13:16:59 +1000
>>> Subject: [PATCH] Enhance new plugging support to support general callbacks.
>>>
>>> md/raid requires an unplug callback, but as it does not uses
>>> requests the current code cannot provide one.
>>>
>>> So allow arbitrary callbacks to be attached to the blk_plug.
>>>
>>> Cc: Jens Axboe <jaxboe@fusionio.com>
>>> Signed-off-by: NeilBrown <neilb@suse.de>
>>> ---
>>> block/blk-core.c | 13 +++++++++++++
>>> include/linux/blkdev.h | 7 ++++++-
>>> 2 files changed, 19 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 725091d..273d60b 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -2644,6 +2644,7 @@ void blk_start_plug(struct blk_plug *plug)
>>>
>>> plug->magic = PLUG_MAGIC;
>>> INIT_LIST_HEAD(&plug->list);
>>> + INIT_LIST_HEAD(&plug->cb_list);
>>> plug->should_sort = 0;
>>>
>>> /*
>>> @@ -2717,9 +2718,21 @@ static void flush_plug_list(struct blk_plug *plug)
>>> local_irq_restore(flags);
>>> }
>>>
>>> +static void flush_plug_callbacks(struct blk_plug *plug)
>>> +{
>>> + while (!list_empty(&plug->cb_list)) {
>>> + struct blk_plug_cb *cb = list_first_entry(&plug->cb_list,
>>> + struct blk_plug_cb,
>>> + list);
>>> + list_del(&cb->list);
>>> + cb->callback(cb);
>>> + }
>>> +}
>>> +
>>> static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug)
>>> {
>>> flush_plug_list(plug);
>>> + flush_plug_callbacks(plug);
>>>
>>> if (plug == tsk->plug)
>>> tsk->plug = NULL;
>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>> index 32176cc..3e5e604 100644
>>> --- a/include/linux/blkdev.h
>>> +++ b/include/linux/blkdev.h
>>> @@ -857,8 +857,13 @@ extern void blk_put_queue(struct request_queue *);
>>> struct blk_plug {
>>> unsigned long magic;
>>> struct list_head list;
>>> + struct list_head cb_list;
>>> unsigned int should_sort;
>>> };
>>> +struct blk_plug_cb {
>>> + struct list_head list;
>>> + void (*callback)(struct blk_plug_cb *);
>>> +};
>>>
>>> extern void blk_start_plug(struct blk_plug *);
>>> extern void blk_finish_plug(struct blk_plug *);
>>> @@ -876,7 +881,7 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
>>> {
>>> struct blk_plug *plug = tsk->plug;
>>>
>>> - return plug && !list_empty(&plug->list);
>>> + return plug && (!list_empty(&plug->list) || !list_empty(&plug->cb_list));
>>> }
>>>
>>> /*
>>
>> Maybe I'm missing something, but why do you need those callbacks? If
>> it's to use plugging yourself, perhaps we can just ensure that those
>> don't get assigned in the task - so it would be have to used with care.
>>
>> It's not that I disagree to these callbacks, I just want to ensure I
>> understand why you need them.
>>
>
> I'm sure one of us is missing something (probably both) but I'm not
> sure what.
>
> The callback is central.
>
> It is simply to use plugging in md.
> Just like blk-core, md will notice that a blk_plug is active and will put
> requests aside. I then need something to call in to md when blk_finish_plug

But this is done in __make_request(), so md devices should not be
affected at all. This is the part of your explanation that I do not
connect with the code.

If md itself is putting things on the plug list, why is it doing that?

> is called so that put-aside requests can be released.
> As md can be built as a module, that call must be a call-back of some sort.
> blk-core doesn't need to register blk_plug_flush because that is never in a
> module, so it can be called directly. But the md equivalent could be in a
> module, so I need to be able to register a call back.
>
> Does that help?

Not really. Is the problem that _you_ would like to stash things aside,
not the fact that __make_request() puts things on a task plug list?

--
Jens Axboe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 04-11-2011, 11:26 AM
NeilBrown
 
Default block: remove per-queue plugging

On Mon, 11 Apr 2011 13:04:26 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:

> >
> > I'm sure one of us is missing something (probably both) but I'm not
> > sure what.
> >
> > The callback is central.
> >
> > It is simply to use plugging in md.
> > Just like blk-core, md will notice that a blk_plug is active and will put
> > requests aside. I then need something to call in to md when blk_finish_plug
>
> But this is done in __make_request(), so md devices should not be
> affected at all. This is the part of your explanation that I do not
> connect with the code.
>
> If md itself is putting things on the plug list, why is it doing that?

Yes. Exactly. md itself want to put things aside on some list.
e.g. in RAID1 when using a write-intent bitmap I want to gather as many write
requests as possible so I can update the bits for all of them at once.
So when a plug is in effect I just queue the bios somewhere and record the
bits that need to be set.
Then when the unplug happens I write out the bitmap updates in a single write
and when that completes, I write out the data (to all devices).

Also in RAID5 it is good if I can wait for lots of write request to arrive
before committing any of them to increase the possibility of getting a
full-stripe write.

Previously I used ->unplug_fn to release the queued requests. Now that has
gone I need a different way to register a callback when an unplug happens.

>
> > is called so that put-aside requests can be released.
> > As md can be built as a module, that call must be a call-back of some sort.
> > blk-core doesn't need to register blk_plug_flush because that is never in a
> > module, so it can be called directly. But the md equivalent could be in a
> > module, so I need to be able to register a call back.
> >
> > Does that help?
>
> Not really. Is the problem that _you_ would like to stash things aside,
> not the fact that __make_request() puts things on a task plug list?
>

Yes, exactly. I (in md) want to stash things aside.

(I don't actually put the stashed things on the blk_plug, though it might
make sense to do that later in some cases - I'm not sure. Currently I stash
things in my own internal lists and just need a call back to say "ok, flush
those lists now").

Thanks,
NeilBrown

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 04-11-2011, 11:37 AM
Jens Axboe
 
Default block: remove per-queue plugging

On 2011-04-11 13:26, NeilBrown wrote:
> On Mon, 11 Apr 2011 13:04:26 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
>
>>>
>>> I'm sure one of us is missing something (probably both) but I'm not
>>> sure what.
>>>
>>> The callback is central.
>>>
>>> It is simply to use plugging in md.
>>> Just like blk-core, md will notice that a blk_plug is active and will put
>>> requests aside. I then need something to call in to md when blk_finish_plug
>>
>> But this is done in __make_request(), so md devices should not be
>> affected at all. This is the part of your explanation that I do not
>> connect with the code.
>>
>> If md itself is putting things on the plug list, why is it doing that?
>
> Yes. Exactly. md itself want to put things aside on some list.
> e.g. in RAID1 when using a write-intent bitmap I want to gather as many write
> requests as possible so I can update the bits for all of them at once.
> So when a plug is in effect I just queue the bios somewhere and record the
> bits that need to be set.
> Then when the unplug happens I write out the bitmap updates in a single write
> and when that completes, I write out the data (to all devices).
>
> Also in RAID5 it is good if I can wait for lots of write request to arrive
> before committing any of them to increase the possibility of getting a
> full-stripe write.
>
> Previously I used ->unplug_fn to release the queued requests. Now that has
> gone I need a different way to register a callback when an unplug happens.

Ah, so this is what I was hinting at. But why use the task->plug for
that? Seems a bit counter intuitive. Why can't you just store these
internally?

>
>>
>>> is called so that put-aside requests can be released.
>>> As md can be built as a module, that call must be a call-back of some sort.
>>> blk-core doesn't need to register blk_plug_flush because that is never in a
>>> module, so it can be called directly. But the md equivalent could be in a
>>> module, so I need to be able to register a call back.
>>>
>>> Does that help?
>>
>> Not really. Is the problem that _you_ would like to stash things aside,
>> not the fact that __make_request() puts things on a task plug list?
>>
>
> Yes, exactly. I (in md) want to stash things aside.
>
> (I don't actually put the stashed things on the blk_plug, though it might
> make sense to do that later in some cases - I'm not sure. Currently I stash
> things in my own internal lists and just need a call back to say "ok, flush
> those lists now").

So we are making some progress... The thing I then don't understand is
why you want to make it associated with the plug? Seems you don't have
any scheduling restrictions, and in which case just storing them in md
seems like a much better option.

--
Jens Axboe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 04-11-2011, 11:55 AM
NeilBrown
 
Default block: remove per-queue plugging

On Mon, 11 Apr 2011 20:59:28 +1000 NeilBrown <neilb@suse.de> wrote:

> On Mon, 11 Apr 2011 11:19:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
>
> > On 2011-04-11 06:50, NeilBrown wrote:
>
> > > The only explanation I can come up with is that very occasionally schedule on
> > > 2 separate cpus calls blk_flush_plug for the same task. I don't understand
> > > the scheduler nearly well enough to know if or how that can happen.
> > > However with this patch in place I can write to a RAID1 constantly for half
> > > an hour, and without it, the write rarely lasts for 3 minutes.
> >
> > Or perhaps if the request_fn blocks, that would be problematic. So the
> > patch is likely a good idea even for that case.
> >
> > I'll merge it, changing it to list_splice_init() as I think that would
> > be more clear.
>
> OK - though I'm not 100% the patch fixes the problem - just that it hides the
> symptom for me.
> I might try instrumenting the code a bit more and see if I can find exactly
> where it is re-entering flush_plug_list - as that seems to be what is
> happening.

OK, I found how it re-enters.

The request_fn doesn't exactly block, but when scsi_request_fn calls
spin_unlock_irq, this calls preempt_enable which can call schedule, which is
a recursive call.

The patch I provided will stop that from recursing again as the blk_plug.list
will be empty.

So it is almost what you suggested, however the request_fn doesn't block, it
just enabled preempt.


So the comment I would put at the top of that patch would be something like:


From: NeilBrown <neilb@suse.de>

As request_fn called by __blk_run_queue is allowed to 'schedule()' (after
dropping the queue lock of course), it is possible to get a recursive call:

schedule -> blk_flush_plug -> __blk_finish_plug -> flush_plug_list
-> __blk_run_queue -> request_fn -> schedule

We must make sure that the second schedule does not call into blk_flush_plug
again. So instead of leaving the list of requests on blk_plug->list, move
them to a separate list leaving blk_plug->list empty.

Signed-off-by: NeilBrown <neilb@suse.de>

Thanks,
NeilBrown

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 04-11-2011, 12:05 PM
NeilBrown
 
Default block: remove per-queue plugging

On Mon, 11 Apr 2011 13:37:20 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:

> On 2011-04-11 13:26, NeilBrown wrote:
> > On Mon, 11 Apr 2011 13:04:26 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
> >
> >>>
> >>> I'm sure one of us is missing something (probably both) but I'm not
> >>> sure what.
> >>>
> >>> The callback is central.
> >>>
> >>> It is simply to use plugging in md.
> >>> Just like blk-core, md will notice that a blk_plug is active and will put
> >>> requests aside. I then need something to call in to md when blk_finish_plug
> >>
> >> But this is done in __make_request(), so md devices should not be
> >> affected at all. This is the part of your explanation that I do not
> >> connect with the code.
> >>
> >> If md itself is putting things on the plug list, why is it doing that?
> >
> > Yes. Exactly. md itself want to put things aside on some list.
> > e.g. in RAID1 when using a write-intent bitmap I want to gather as many write
> > requests as possible so I can update the bits for all of them at once.
> > So when a plug is in effect I just queue the bios somewhere and record the
> > bits that need to be set.
> > Then when the unplug happens I write out the bitmap updates in a single write
> > and when that completes, I write out the data (to all devices).
> >
> > Also in RAID5 it is good if I can wait for lots of write request to arrive
> > before committing any of them to increase the possibility of getting a
> > full-stripe write.
> >
> > Previously I used ->unplug_fn to release the queued requests. Now that has
> > gone I need a different way to register a callback when an unplug happens.
>
> Ah, so this is what I was hinting at. But why use the task->plug for
> that? Seems a bit counter intuitive. Why can't you just store these
> internally?
>
> >
> >>
> >>> is called so that put-aside requests can be released.
> >>> As md can be built as a module, that call must be a call-back of some sort.
> >>> blk-core doesn't need to register blk_plug_flush because that is never in a
> >>> module, so it can be called directly. But the md equivalent could be in a
> >>> module, so I need to be able to register a call back.
> >>>
> >>> Does that help?
> >>
> >> Not really. Is the problem that _you_ would like to stash things aside,
> >> not the fact that __make_request() puts things on a task plug list?
> >>
> >
> > Yes, exactly. I (in md) want to stash things aside.
> >
> > (I don't actually put the stashed things on the blk_plug, though it might
> > make sense to do that later in some cases - I'm not sure. Currently I stash
> > things in my own internal lists and just need a call back to say "ok, flush
> > those lists now").
>
> So we are making some progress... The thing I then don't understand is
> why you want to make it associated with the plug? Seems you don't have
> any scheduling restrictions, and in which case just storing them in md
> seems like a much better option.
>

Yes. But I need to know when to release the requests that I have stored.
I need to know when ->write_pages or ->read_pages or whatever has finished
submitting a pile of pages so that I can start processing the request that I
have put aside. So I need a callback from blk_finish_plug.

(and I also need to know if a thread that was plugging schedules for the same
reason that you do).

NeilBrown


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 

Thread Tools




All times are GMT. The time now is 12:18 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org