FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Device-mapper Development

 
 
LinkBack Thread Tools
 
Old 08-29-2012, 12:57 PM
Vivek Goyal
 
Default block: Avoid deadlocks with bio allocation by stacking drivers

On Tue, Aug 28, 2012 at 08:25:58PM -0700, Kent Overstreet wrote:

[..]
> Except that when thread a goes to punt those blocked bios to its
> rescuer, it punts _all_ the bios on current->bio_list. Even those
> generated by/belonging to other bio_sets.
>
> So thread 1 in device b punts bios to its rescuer, thread 2
>
> But thread 2 ends up with bios for both device a and b - because they're
> stacked.

Ok, just to add more details to above example. Say we have device A stacked
on top of B and B stacked on top of C. Say a bio a is submitted to device A
and is splitted in two bios b1 and b2. Now b1 is sumbitted to device B and is
splitted in c1 and c2. Now current bio list has three bios. b2, c1 and c2.
If submitter is now about to block on any bio set, then all tree bios
b2, c1, c2 will punted to rescue thread and submssion of b2 will again block
resulting in blocking rescue thread itself.

I would say keep all the bio splitting patches and any fixes w.r.t
deadlocks in a seprate series. As this is little complicated and a lot
of is just theoritical corner cases. If you limit this series to just
bio_set related cleanups, it becomes more acceptable for inclusion.

Thanks
Vivek

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-29-2012, 04:24 PM
Mikulas Patocka
 
Default block: Avoid deadlocks with bio allocation by stacking drivers

Hi

This fixes the bio allocation problems, but doesn't fix a similar deadlock
in device mapper when allocating from md->io_pool or other mempools in
the target driver.

The problem is that majority of device mapper code assumes that if we
submit a bio, that bio will be finished in a finite time. The commit
d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 in 2.6.22 broke this assumption.

I suggest - instead of writing workarounds for this current->bio_list
misbehavior, why not remove current->bio_list at all? We could revert
d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1, allocate a per-device workqueue,
test stack usage in generic_make_request, and if it is too high (more than
half of the stack used, or so), put the bio to the target device's
blockqueue.

That could be simpler than allocating per-bioset workqueue and it also
solves more problems (possible deadlocks in dm).

Mikulas



On Tue, 28 Aug 2012, Kent Overstreet wrote:

> Previously, if we ever try to allocate more than once from the same bio
> set while running under generic_make_request() (i.e. a stacking block
> driver), we risk deadlock.
>
> This is because of the code in generic_make_request() that converts
> recursion to iteration; any bios we submit won't actually be submitted
> (so they can complete and eventually be freed) until after we return -
> this means if we allocate a second bio, we're blocking the first one
> from ever being freed.
>
> Thus if enough threads call into a stacking block driver at the same
> time with bios that need multiple splits, and the bio_set's reserve gets
> used up, we deadlock.
>
> This can be worked around in the driver code - we could check if we're
> running under generic_make_request(), then mask out __GFP_WAIT when we
> go to allocate a bio, and if the allocation fails punt to workqueue and
> retry the allocation.
>
> But this is tricky and not a generic solution. This patch solves it for
> all users by inverting the previously described technique. We allocate a
> rescuer workqueue for each bio_set, and then in the allocation code if
> there are bios on current->bio_list we would be blocking, we punt them
> to the rescuer workqueue to be submitted.
>
> Tested it by forcing the rescue codepath to be taken (by disabling the
> first GFP_NOWAIT) attempt, and then ran it with bcache (which does a lot
> of arbitrary bio splitting) and verified that the rescuer was being
> invoked.
>
> Signed-off-by: Kent Overstreet <koverstreet@google.com>
> CC: Jens Axboe <axboe@kernel.dk>
> ---
> fs/bio.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++---
> include/linux/bio.h | 9 +++++++
> 2 files changed, 79 insertions(+), 3 deletions(-)
>
> diff --git a/fs/bio.c b/fs/bio.c
> index 31e637a..5d46318 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -285,6 +285,23 @@ void bio_reset(struct bio *bio)
> }
> EXPORT_SYMBOL(bio_reset);
>
> +static void bio_alloc_rescue(struct work_struct *work)
> +{
> + struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
> + struct bio *bio;
> +
> + while (1) {
> + spin_lock(&bs->rescue_lock);
> + bio = bio_list_pop(&bs->rescue_list);
> + spin_unlock(&bs->rescue_lock);
> +
> + if (!bio)
> + break;
> +
> + generic_make_request(bio);
> + }
> +}
> +
> /**
> * bio_alloc_bioset - allocate a bio for I/O
> * @gfp_mask: the GFP_ mask given to the slab allocator
> @@ -307,6 +324,7 @@ EXPORT_SYMBOL(bio_reset);
> */
> struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> {
> + gfp_t saved_gfp = gfp_mask;
> unsigned front_pad;
> unsigned inline_vecs;
> unsigned long idx = BIO_POOL_NONE;
> @@ -324,13 +342,37 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> front_pad = 0;
> inline_vecs = nr_iovecs;
> } else {
> + /*
> + * generic_make_request() converts recursion to iteration; this
> + * means if we're running beneath it, any bios we allocate and
> + * submit will not be submitted (and thus freed) until after we
> + * return.
> + *
> + * This exposes us to a potential deadlock if we allocate
> + * multiple bios from the same bio_set() while running
> + * underneath generic_make_request(). If we were to allocate
> + * multiple bios (say a stacking block driver that was splitting
> + * bios), we would deadlock if we exhausted the mempool's
> + * reserve.
> + *
> + * We solve this, and guarantee forward progress, with a rescuer
> + * workqueue per bio_set. If we go to allocate and there are
> + * bios on current->bio_list, we first try the allocation
> + * without __GFP_WAIT; if that fails, we punt those bios we
> + * would be blocking to the rescuer workqueue before we retry
> + * with the original gfp_flags.
> + */
> +
> + if (current->bio_list && !bio_list_empty(current->bio_list))
> + gfp_mask &= ~__GFP_WAIT;
> +retry:
> p = mempool_alloc(bs->bio_pool, gfp_mask);
> front_pad = bs->front_pad;
> inline_vecs = BIO_INLINE_VECS;
> }
>
> if (unlikely(!p))
> - return NULL;
> + goto err;
>
> bio = p + front_pad;
> bio_init(bio);
> @@ -351,6 +393,19 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>
> err_free:
> mempool_free(p, bs->bio_pool);
> +err:
> + if (gfp_mask != saved_gfp) {
> + gfp_mask = saved_gfp;
> +
> + spin_lock(&bs->rescue_lock);
> + bio_list_merge(&bs->rescue_list, current->bio_list);
> + bio_list_init(current->bio_list);
> + spin_unlock(&bs->rescue_lock);
> +
> + queue_work(bs->rescue_workqueue, &bs->rescue_work);
> + goto retry;
> + }
> +
> return NULL;
> }
> EXPORT_SYMBOL(bio_alloc_bioset);
> @@ -1562,6 +1617,9 @@ static void biovec_free_pools(struct bio_set *bs)
>
> void bioset_free(struct bio_set *bs)
> {
> + if (bs->rescue_workqueue)
> + destroy_workqueue(bs->rescue_workqueue);
> +
> if (bs->bio_pool)
> mempool_destroy(bs->bio_pool);
>
> @@ -1597,6 +1655,10 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
>
> bs->front_pad = front_pad;
>
> + spin_lock_init(&bs->rescue_lock);
> + bio_list_init(&bs->rescue_list);
> + INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
> +
> bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
> if (!bs->bio_slab) {
> kfree(bs);
> @@ -1607,9 +1669,14 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
> if (!bs->bio_pool)
> goto bad;
>
> - if (!biovec_create_pools(bs, pool_size))
> - return bs;
> + if (biovec_create_pools(bs, pool_size))
> + goto bad;
> +
> + bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
> + if (!bs->rescue_workqueue)
> + goto bad;
>
> + return bs;
> bad:
> bioset_free(bs);
> return NULL;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 3a8345e..84fdaac 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -492,6 +492,15 @@ struct bio_set {
> mempool_t *bio_integrity_pool;
> #endif
> mempool_t *bvec_pool;
> +
> + /*
> + * Deadlock avoidance for stacking block drivers: see comments in
> + * bio_alloc_bioset() for details
> + */
> + spinlock_t rescue_lock;
> + struct bio_list rescue_list;
> + struct work_struct rescue_work;
> + struct workqueue_struct *rescue_workqueue;
> };
>
> struct biovec_slab {
> --
> 1.7.12
>

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-29-2012, 04:26 PM
Kent Overstreet
 
Default block: Avoid deadlocks with bio allocation by stacking drivers

On Wed, Aug 29, 2012 at 03:39:14PM +0100, Alasdair G Kergon wrote:
> It's also instructive to remember why the code is the way it is: it used
> to process bios for underlying devices immediately, but this sometimes
> meant too much recursive stack growth. If a per-device rescuer thread
> is to be made available (as well as the mempool), the option of
> reinstating recursion is there too - only punting to workqueue when the
> stack actually becomes "too big". (Also bear in mind that some dm
> targets may have dependencies on their own mempools - submission can
> block there too.) I find it helpful only to consider splitting into two
> pieces - it must always be possible to process the first piece (i.e.
> process it at the next layer down in the stack) and complete it
> independently of what happens to the second piece (which might require
> further splitting and block until the first piece has completed).

I'm sure it could be made to work (and it may well simpler), but it
seems problematic from a performance pov.

With stacked devices you'd then have to switch stacks on _every_ bio.
That could be made fast enough I'm sure, but it wouldn't be free and I
don't know of any existing code in the kernel that implements what we'd
need (though if you know how you'd go about doing that, I'd love to
know! Would be useful for other things).

The real problem is that because we'd need these extra stacks for
handling all bios we couldn't get by with just one per bio_set. We'd
only need one to make forward progress so the rest could be allocated
on demand (i.e. what the workqueue code does) but that sounds like it's
starting to get expensive.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-29-2012, 04:50 PM
Kent Overstreet
 
Default block: Avoid deadlocks with bio allocation by stacking drivers

On Wed, Aug 29, 2012 at 12:24:43PM -0400, Mikulas Patocka wrote:
> Hi
>
> This fixes the bio allocation problems, but doesn't fix a similar deadlock
> in device mapper when allocating from md->io_pool or other mempools in
> the target driver.

Ick. That is a problem.

There are a bunch of mempools in drivers/md too

Though honestly bio_sets have the front_pad thing for a reason, for per
bio state it makes sense to use that anyways - simpler code because
you're doing fewer explicit allocations and more efficient too.

So I wonder if we fixed that if it'd still be a problem.

The other thing we could do is make the rescuer thread per block device
(which arguably makes more sense than per bio_set anyways), then
associate bio_sets with specific block devices/rescuers.

Then the rescuer code can be abstracted out from the bio allocation
code, and we just create a thin wrapper around mempool_alloc() that does
the same dance bio_alloc_bioset() does now.

I think that'd be pretty easy and work out really nicely.

> The problem is that majority of device mapper code assumes that if we
> submit a bio, that bio will be finished in a finite time. The commit
> d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 in 2.6.22 broke this assumption.
>
> I suggest - instead of writing workarounds for this current->bio_list
> misbehavior, why not remove current->bio_list at all? We could revert
> d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1, allocate a per-device workqueue,
> test stack usage in generic_make_request, and if it is too high (more than
> half of the stack used, or so), put the bio to the target device's
> blockqueue.
>
> That could be simpler than allocating per-bioset workqueue and it also
> solves more problems (possible deadlocks in dm).

It certainly would be simpler, but honestly the potential for
performance regressions scares me (and bcache at least is used on fast
enough devices where it's going to matter). Also it's not so much the
performance overhead - we can just measure that - it's that if we're
just using the workqueue code the scheduler's getting involved and we
can't just measure what the effects of that are going to be in
production.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-29-2012, 05:07 PM
Vivek Goyal
 
Default block: Avoid deadlocks with bio allocation by stacking drivers

On Wed, Aug 29, 2012 at 09:50:06AM -0700, Kent Overstreet wrote:

[..]
> > The problem is that majority of device mapper code assumes that if we
> > submit a bio, that bio will be finished in a finite time. The commit
> > d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 in 2.6.22 broke this assumption.
> >
> > I suggest - instead of writing workarounds for this current->bio_list
> > misbehavior, why not remove current->bio_list at all? We could revert
> > d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1, allocate a per-device workqueue,
> > test stack usage in generic_make_request, and if it is too high (more than
> > half of the stack used, or so), put the bio to the target device's
> > blockqueue.
> >
> > That could be simpler than allocating per-bioset workqueue and it also
> > solves more problems (possible deadlocks in dm).
>
> It certainly would be simpler, but honestly the potential for
> performance regressions scares me (and bcache at least is used on fast
> enough devices where it's going to matter). Also it's not so much the
> performance overhead - we can just measure that - it's that if we're
> just using the workqueue code the scheduler's getting involved and we
> can't just measure what the effects of that are going to be in
> production.

Are workqueues not getting involved already in your solution of punting
to rescuer thread. In the proposal above also, workers get involved
only if stack depth is too deep. So for normal stack usage performance
should not be impacted.

Performance aside, punting submission to per device worker in case of deep
stack usage sounds cleaner solution to me.

Thanks
Vivek

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-29-2012, 05:13 PM
Kent Overstreet
 
Default block: Avoid deadlocks with bio allocation by stacking drivers

On Wed, Aug 29, 2012 at 01:07:11PM -0400, Vivek Goyal wrote:
> On Wed, Aug 29, 2012 at 09:50:06AM -0700, Kent Overstreet wrote:
>
> [..]
> > > The problem is that majority of device mapper code assumes that if we
> > > submit a bio, that bio will be finished in a finite time. The commit
> > > d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 in 2.6.22 broke this assumption.
> > >
> > > I suggest - instead of writing workarounds for this current->bio_list
> > > misbehavior, why not remove current->bio_list at all? We could revert
> > > d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1, allocate a per-device workqueue,
> > > test stack usage in generic_make_request, and if it is too high (more than
> > > half of the stack used, or so), put the bio to the target device's
> > > blockqueue.
> > >
> > > That could be simpler than allocating per-bioset workqueue and it also
> > > solves more problems (possible deadlocks in dm).
> >
> > It certainly would be simpler, but honestly the potential for
> > performance regressions scares me (and bcache at least is used on fast
> > enough devices where it's going to matter). Also it's not so much the
> > performance overhead - we can just measure that - it's that if we're
> > just using the workqueue code the scheduler's getting involved and we
> > can't just measure what the effects of that are going to be in
> > production.
>
> Are workqueues not getting involved already in your solution of punting
> to rescuer thread.

Only on allocation failure.

> In the proposal above also, workers get involved
> only if stack depth is too deep. So for normal stack usage performance
> should not be impacted.
>
> Performance aside, punting submission to per device worker in case of deep
> stack usage sounds cleaner solution to me.

Agreed, but performance tends to matter in the real world. And either
way the tricky bits are going to be confined to a few functions, so I
don't think it matters that much.

If someone wants to code up the workqueue version and test it, they're
more than welcome...

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-29-2012, 05:32 PM
Kent Overstreet
 
Default block: Avoid deadlocks with bio allocation by stacking drivers

On Wed, Aug 29, 2012 at 06:23:36PM +0100, Alasdair G Kergon wrote:
> On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote:
> > Only on allocation failure.
>
> Which as you already said assumes wrapping together the other mempools in the
> submission path first...

Yes? I'm not sure what you're getting at.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-29-2012, 09:01 PM
"John Stoffel"
 
Default block: Avoid deadlocks with bio allocation by stacking drivers

>>>>> "Kent" == Kent Overstreet <koverstreet@google.com> writes:

Kent> On Wed, Aug 29, 2012 at 03:39:14PM +0100, Alasdair G Kergon wrote:
>> It's also instructive to remember why the code is the way it is: it used
>> to process bios for underlying devices immediately, but this sometimes
>> meant too much recursive stack growth. If a per-device rescuer thread
>> is to be made available (as well as the mempool), the option of
>> reinstating recursion is there too - only punting to workqueue when the
>> stack actually becomes "too big". (Also bear in mind that some dm
>> targets may have dependencies on their own mempools - submission can
>> block there too.) I find it helpful only to consider splitting into two
>> pieces - it must always be possible to process the first piece (i.e.
>> process it at the next layer down in the stack) and complete it
>> independently of what happens to the second piece (which might require
>> further splitting and block until the first piece has completed).

Kent> I'm sure it could be made to work (and it may well simpler), but it
Kent> seems problematic from a performance pov.

Kent> With stacked devices you'd then have to switch stacks on _every_ bio.
Kent> That could be made fast enough I'm sure, but it wouldn't be free and I
Kent> don't know of any existing code in the kernel that implements what we'd
Kent> need (though if you know how you'd go about doing that, I'd love to
Kent> know! Would be useful for other things).

Kent> The real problem is that because we'd need these extra stacks for
Kent> handling all bios we couldn't get by with just one per bio_set. We'd
Kent> only need one to make forward progress so the rest could be allocated
Kent> on demand (i.e. what the workqueue code does) but that sounds like it's
Kent> starting to get expensive.

Maybe we need to limit the size of BIOs to that of the bottom-most
underlying device instead? Or maybe limit BIOs to some small
multiple? As you stack up DM targets one on top of each other, they
should respect the limits of the underlying devices and pass those
limits up the chain.

Or maybe I'm speaking giberish...

John

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-29-2012, 09:08 PM
Kent Overstreet
 
Default block: Avoid deadlocks with bio allocation by stacking drivers

On Wed, Aug 29, 2012 at 05:01:09PM -0400, John Stoffel wrote:
> >>>>> "Kent" == Kent Overstreet <koverstreet@google.com> writes:
>
> Kent> On Wed, Aug 29, 2012 at 03:39:14PM +0100, Alasdair G Kergon wrote:
> >> It's also instructive to remember why the code is the way it is: it used
> >> to process bios for underlying devices immediately, but this sometimes
> >> meant too much recursive stack growth. If a per-device rescuer thread
> >> is to be made available (as well as the mempool), the option of
> >> reinstating recursion is there too - only punting to workqueue when the
> >> stack actually becomes "too big". (Also bear in mind that some dm
> >> targets may have dependencies on their own mempools - submission can
> >> block there too.) I find it helpful only to consider splitting into two
> >> pieces - it must always be possible to process the first piece (i.e.
> >> process it at the next layer down in the stack) and complete it
> >> independently of what happens to the second piece (which might require
> >> further splitting and block until the first piece has completed).
>
> Kent> I'm sure it could be made to work (and it may well simpler), but it
> Kent> seems problematic from a performance pov.
>
> Kent> With stacked devices you'd then have to switch stacks on _every_ bio.
> Kent> That could be made fast enough I'm sure, but it wouldn't be free and I
> Kent> don't know of any existing code in the kernel that implements what we'd
> Kent> need (though if you know how you'd go about doing that, I'd love to
> Kent> know! Would be useful for other things).
>
> Kent> The real problem is that because we'd need these extra stacks for
> Kent> handling all bios we couldn't get by with just one per bio_set. We'd
> Kent> only need one to make forward progress so the rest could be allocated
> Kent> on demand (i.e. what the workqueue code does) but that sounds like it's
> Kent> starting to get expensive.
>
> Maybe we need to limit the size of BIOs to that of the bottom-most
> underlying device instead? Or maybe limit BIOs to some small
> multiple? As you stack up DM targets one on top of each other, they
> should respect the limits of the underlying devices and pass those
> limits up the chain.

That's the approach the block layer currently tries to take. It's
brittle, tricky and inefficient, and it completely breaks down when the
limits are dynamic - so things like dm and bcache are always going to
have to split anyways.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-30-2012, 10:07 PM
Vivek Goyal
 
Default block: Avoid deadlocks with bio allocation by stacking drivers

On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote:

[..]
> > Performance aside, punting submission to per device worker in case of deep
> > stack usage sounds cleaner solution to me.
>
> Agreed, but performance tends to matter in the real world. And either
> way the tricky bits are going to be confined to a few functions, so I
> don't think it matters that much.
>
> If someone wants to code up the workqueue version and test it, they're
> more than welcome...

Here is one quick and dirty proof of concept patch. It checks for stack
depth and if remaining space is less than 20% of stack size, then it
defers the bio submission to per queue worker.

Thanks
Vivek


---
block/blk-core.c | 171 ++++++++++++++++++++++++++++++++++------------
block/blk-sysfs.c | 1
include/linux/blk_types.h | 1
include/linux/blkdev.h | 8 ++
4 files changed, 138 insertions(+), 43 deletions(-)

Index: linux-2.6/include/linux/blkdev.h
================================================== =================
--- linux-2.6.orig/include/linux/blkdev.h 2012-09-01 17:44:51.686485550 -0400
+++ linux-2.6/include/linux/blkdev.h 2012-09-01 18:09:58.805577658 -0400
@@ -430,6 +430,14 @@ struct request_queue {
/* Throttle data */
struct throtl_data *td;
#endif
+
+ /*
+ * Bio submission to queue can be deferred to a workqueue if stack
+ * usage of submitter is high.
+ */
+ struct bio_list deferred_bios;
+ struct work_struct deferred_bio_work;
+ struct workqueue_struct *deferred_bio_workqueue;
};

#define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */
Index: linux-2.6/block/blk-core.c
================================================== =================
--- linux-2.6.orig/block/blk-core.c 2012-09-01 17:44:51.686485550 -0400
+++ linux-2.6/block/blk-core.c 2012-09-02 00:34:55.204091269 -0400
@@ -211,6 +211,23 @@ static void blk_delay_work(struct work_s
spin_unlock_irq(q->queue_lock);
}

+static void blk_deferred_bio_work(struct work_struct *work)
+{
+ struct request_queue *q;
+ struct bio *bio = NULL;
+
+ q = container_of(work, struct request_queue, deferred_bio_work);
+
+ do {
+ spin_lock_irq(q->queue_lock);
+ bio = bio_list_pop(&q->deferred_bios);
+ spin_unlock_irq(q->queue_lock);
+ if (!bio)
+ break;
+ generic_make_request(bio);
+ } while (1);
+}
+
/**
* blk_delay_queue - restart queueing after defined interval
* @q: The &struct request_queue in question
@@ -289,6 +306,7 @@ void blk_sync_queue(struct request_queue
{
del_timer_sync(&q->timeout);
cancel_delayed_work_sync(&q->delay_work);
+ cancel_work_sync(&q->deferred_bio_work);
}
EXPORT_SYMBOL(blk_sync_queue);

@@ -351,6 +369,29 @@ void blk_put_queue(struct request_queue
EXPORT_SYMBOL(blk_put_queue);

/**
+ * blk_drain_deferred_bios - drain deferred bios
+ * @q: request_queue to drain deferred bios for
+ *
+ * Dispatch all currently deferred bios on @q through ->make_request_fn().
+ */
+static void blk_drain_deferred_bios(struct request_queue *q)
+{
+ struct bio_list bl;
+ struct bio *bio;
+ unsigned long flags;
+
+ bio_list_init(&bl);
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ bio_list_merge(&bl, &q->deferred_bios);
+ bio_list_init(&q->deferred_bios);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+
+ while ((bio = bio_list_pop(&bl)))
+ generic_make_request(bio);
+}
+
+/**
* blk_drain_queue - drain requests from request_queue
* @q: queue to drain
* @drain_all: whether to drain all requests or only the ones w/ ELVPRIV
@@ -358,6 +399,10 @@ EXPORT_SYMBOL(blk_put_queue);
* Drain requests from @q. If @drain_all is set, all requests are drained.
* If not, only ELVPRIV requests are drained. The caller is responsible
* for ensuring that no new requests which need to be drained are queued.
+ *
+ * Note: It does not drain bios on q->deferred_bios list.
+ * Call blk_drain_deferred_bios() if need be.
+ *
*/
void blk_drain_queue(struct request_queue *q, bool drain_all)
{
@@ -505,6 +550,9 @@ void blk_cleanup_queue(struct request_qu
spin_unlock_irq(lock);
mutex_unlock(&q->sysfs_lock);

+ /* First drain all deferred bios. */
+ blk_drain_deferred_bios(q);
+
/* drain all requests queued before DEAD marking */
blk_drain_queue(q, true);

@@ -614,11 +662,19 @@ struct request_queue *blk_alloc_queue_no
q->bypass_depth = 1;
__set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags);

- if (blkcg_init_queue(q))
+ bio_list_init(&q->deferred_bios);
+ INIT_WORK(&q->deferred_bio_work, blk_deferred_bio_work);
+ q->deferred_bio_workqueue = alloc_workqueue("kdeferbiod", WQ_MEM_RECLAIM, 0);
+ if (!q->deferred_bio_workqueue)
goto fail_id;

+ if (blkcg_init_queue(q))
+ goto fail_deferred_bio_wq;
+
return q;

+fail_deferred_bio_wq:
+ destroy_workqueue(q->deferred_bio_workqueue);
fail_id:
ida_simple_remove(&blk_queue_ida, q->id);
fail_q:
@@ -1635,8 +1691,10 @@ static inline int bio_check_eod(struct b
return 0;
}

+
+
static noinline_for_stack bool
-generic_make_request_checks(struct bio *bio)
+generic_make_request_checks_early(struct bio *bio)
{
struct request_queue *q;
int nr_sectors = bio_sectors(bio);
@@ -1715,9 +1773,6 @@ generic_make_request_checks(struct bio *
*/
create_io_context(GFP_ATOMIC, q->node);

- if (blk_throtl_bio(q, bio))
- return false; /* throttled, will be resubmitted later */
-
trace_block_bio_queue(q, bio);
return true;

@@ -1726,6 +1781,56 @@ end_io:
return false;
}

+static noinline_for_stack bool
+generic_make_request_checks_late(struct bio *bio)
+{
+ struct request_queue *q;
+
+ q = bdev_get_queue(bio->bi_bdev);
+
+ /*
+ * Various block parts want %current->io_context and lazy ioc
+ * allocation ends up trading a lot of pain for a small amount of
+ * memory. Just allocate it upfront. This may fail and block
+ * layer knows how to live with it.
+ */
+ create_io_context(GFP_ATOMIC, q->node);
+
+ if (blk_throtl_bio(q, bio))
+ return false; /* throttled, will be resubmitted later */
+
+ return true;
+}
+
+static void __generic_make_request(struct bio *bio)
+{
+ struct request_queue *q;
+
+ if (!generic_make_request_checks_late(bio))
+ return;
+ q = bdev_get_queue(bio->bi_bdev);
+ q->make_request_fn(q, bio);
+}
+
+static void generic_make_request_defer_bio(struct bio *bio)
+{
+ struct request_queue *q;
+ unsigned long flags;
+
+ q = bdev_get_queue(bio->bi_bdev);
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ if (unlikely(blk_queue_dead(q))) {
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ bio_endio(bio, -ENODEV);
+ return;
+ }
+ set_bit(BIO_DEFERRED, &bio->bi_flags);
+ bio_list_add(&q->deferred_bios, bio);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ queue_work(q->deferred_bio_workqueue, &q->deferred_bio_work);
+}
+
/**
* generic_make_request - hand a buffer to its device driver for I/O
* @bio: The bio describing the location in memory and on the device.
@@ -1752,51 +1857,31 @@ end_io:
*/
void generic_make_request(struct bio *bio)
{
- struct bio_list bio_list_on_stack;
+ unsigned long sp = 0;
+ unsigned int threshold = (THREAD_SIZE * 2)/10;

- if (!generic_make_request_checks(bio))
- return;
+ BUG_ON(bio->bi_next);

- /*
- * We only want one ->make_request_fn to be active at a time, else
- * stack usage with stacked devices could be a problem. So use
- * current->bio_list to keep a list of requests submited by a
- * make_request_fn function. current->bio_list is also used as a
- * flag to say if generic_make_request is currently active in this
- * task or not. If it is NULL, then no make_request is active. If
- * it is non-NULL, then a make_request is active, and new requests
- * should be added at the tail
- */
- if (current->bio_list) {
- bio_list_add(current->bio_list, bio);
+ /* Submitteing deferred bio from worker context. */
+ if (bio_flagged(bio, BIO_DEFERRED)) {
+ clear_bit(BIO_DEFERRED, &bio->bi_flags);
+ __generic_make_request(bio);
return;
}

- /* following loop may be a bit non-obvious, and so deserves some
- * explanation.
- * Before entering the loop, bio->bi_next is NULL (as all callers
- * ensure that) so we have a list with a single bio.
- * We pretend that we have just taken it off a longer list, so
- * we assign bio_list to a pointer to the bio_list_on_stack,
- * thus initialising the bio_list of new bios to be
- * added. ->make_request() may indeed add some more bios
- * through a recursive call to generic_make_request. If it
- * did, we find a non-NULL value in bio_list and re-enter the loop
- * from the top. In this case we really did just take the bio
- * of the top of the list (no pretending) and so remove it from
- * bio_list, and call into ->make_request() again.
- */
- BUG_ON(bio->bi_next);
- bio_list_init(&bio_list_on_stack);
- current->bio_list = &bio_list_on_stack;
- do {
- struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+ if (!generic_make_request_checks_early(bio))
+ return;

- q->make_request_fn(q, bio);
+ /*
+ * FIXME. Provide an arch dependent function to return left stack
+ * space for current task. This is hack for x86_64.
+ */
+ asm volatile("movq %%rsp,%0" : "=m"(sp));

- bio = bio_list_pop(current->bio_list);
- } while (bio);
- current->bio_list = NULL; /* deactivate */
+ if ((sp - (unsigned long)end_of_stack(current)) < threshold)
+ generic_make_request_defer_bio(bio);
+ else
+ __generic_make_request(bio);
}
EXPORT_SYMBOL(generic_make_request);

Index: linux-2.6/block/blk-sysfs.c
================================================== =================
--- linux-2.6.orig/block/blk-sysfs.c 2012-09-01 17:44:51.686485550 -0400
+++ linux-2.6/block/blk-sysfs.c 2012-09-01 18:09:58.808577661 -0400
@@ -505,6 +505,7 @@ static void blk_release_queue(struct kob

ida_simple_remove(&blk_queue_ida, q->id);
kmem_cache_free(blk_requestq_cachep, q);
+ destroy_workqueue(q->deferred_bio_workqueue);
}

static const struct sysfs_ops queue_sysfs_ops = {
Index: linux-2.6/include/linux/blk_types.h
================================================== =================
--- linux-2.6.orig/include/linux/blk_types.h 2012-09-02 00:34:17.607086696 -0400
+++ linux-2.6/include/linux/blk_types.h 2012-09-02 00:34:21.997087104 -0400
@@ -105,6 +105,7 @@ struct bio {
#define BIO_FS_INTEGRITY 9 /* fs owns integrity data, not block layer */
#define BIO_QUIET 10 /* Make BIO Quiet */
#define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
+#define BIO_DEFERRED 12 /* Bio was deferred for submission by worker */
#define bio_flagged(bio, flag) ((bio)->bi_flags & (1 << (flag)))

/*

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Fri Aug 31 00:30:01 2012
Return-Path: <advisory-board-bounces@lists.fedoraproject.org>
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on
eagle542.startdedicated.com
X-Spam-Level:
X-Spam-Status: No, score=-2.4 required=5.0 tests=DKIM_ADSP_CUSTOM_MED,
DKIM_SIGNED,FREEMAIL_FROM,RCVD_IN_DNSWL_MED,RP_MAT CHES_RCVD,SPF_PASS,
T_DKIM_INVALID autolearn=ham version=3.3.2
X-Original-To: tom@linux-archive.org
Delivered-To: tom-linux-archive.org@eagle542.startdedicated.com
Received: from bastion.fedoraproject.org (bastion01.fedoraproject.org [209.132.181.2])
by eagle542.startdedicated.com (Postfix) with ESMTP id 1FDE620E0032
for <tom@linux-archive.org>; Fri, 31 Aug 2012 00:12:12 +0200 (CEST)
Received: from lists.fedoraproject.org (collab03.vpn.fedoraproject.org [192.168.1.70])
by bastion01.phx2.fedoraproject.org (Postfix) with ESMTP id 768E320F03;
Thu, 30 Aug 2012 22:12:11 +0000 (UTC)
Received: from collab03.fedoraproject.org (localhost [127.0.0.1])
by lists.fedoraproject.org (Postfix) with ESMTP id 219A340A58;
Thu, 30 Aug 2012 22:12:11 +0000 (UTC)
X-Original-To: advisory-board@lists.fedoraproject.org
Delivered-To: advisory-board@lists.fedoraproject.org
Received: from smtp-mm03.fedoraproject.org (vm4.fedora.ibiblio.org
[152.19.134.143])
by lists.fedoraproject.org (Postfix) with ESMTP id 8B9934075F
for <advisory-board@lists.fedoraproject.org>;
Thu, 30 Aug 2012 22:12:09 +0000 (UTC)
Received: from mail-wi0-f173.google.com (mail-wi0-f173.google.com
[209.85.212.173])
by smtp-mm03.fedoraproject.org (Postfix) with ESMTP id EDD724018E
for <advisory-board@lists.fedoraproject.org>;
Thu, 30 Aug 2012 22:12:08 +0000 (UTC)
Received: by wibhm6 with SMTP id hm6so690349wib.2
for <advisory-board@lists.fedoraproject.org>;
Thu, 30 Aug 2012 15:12:08 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
h=message-id:date:from:user-agent:mime-version:to:subject:references
:in-reply-to:content-type:content-transfer-encoding;
bh=bcsws5/pdbO7DkBisWj+yYdbj4KGqG8XEcUKNuNI6V4=;
b=uCvQBBWGdHos4QljrW/qY1SZk6aIIri7ixdJALJidB2KU5CO4Tj+nb+CdS+AjOYXqx
wA3/aLQeEBTEXS+Lf/LxfSIs/rGl6ecAYGtzAK9lgLoS1eHEe4XN2Gwdl3RDv3stkCQb
nkfOAqDULFjeebvThksmEOgQvDik8MaxVzrtKl7t1TSVCtNGrb wX1Fis69xDSlmUrU5Z
KJx9LMJDP7NMJVTkDPfy449tINJ7FmA3m+10eytvuqEW8MJXKi 3YkSA2p1PhSdzfOUan
XS9gqweROAOwWFh2lADtXXRy6+Ma2CYyA/ABDk/0YD9SViW57mYs1aFUA2awcpKB0W+C
phCQ==
Received: by 10.216.93.6 with SMTP id k6mr3962601wef.86.1346364728322;
Thu, 30 Aug 2012 15:12:08 -0700 (PDT)
Received: from localhost.localdomain (85-220-55-128.dsl.dynamic.simnet.is.
[85.220.55.128])
by mx.google.com with ESMTPS id fp3sm2950754wib.7.2012.08.30.15.12.07
(version=TLSv1/SSLv3 cipher=OTHER);
Thu, 30 Aug 2012 15:12:07 -0700 (PDT)
Message-ID: <503FE4E8.7020209@gmail.com>
Date: Thu, 30 Aug 2012 22:10:48 +0000
From: =?UTF-8?B?IkrDs2hhbm4gQi4gR3XDsG11bmRzc29uIg==?=
<johannbg@gmail.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
rv:14.0) Gecko/20120717 Thunderbird/14.0
MIME-Version: 1.0
To: advisory-board@lists.fedoraproject.org
Subject: Re: CWG Recharter Strawman
References: <20120830171304.GW3275@unaka.lan>
In-Reply-To: <20120830171304.GW3275@unaka.lan>
X-BeenThere: advisory-board@lists.fedoraproject.org
X-Mailman-Version: 2.1.12
Precedence: list
Reply-To: Fedora community advisory board
<advisory-board@lists.fedoraproject.org>
List-Id: Fedora community advisory board
<advisory-board.lists.fedoraproject.org>
List-Unsubscribe: <https://admin.fedoraproject.org/mailman/options/advisory-board>,
<mailto:advisory-board-request@lists.fedoraproject.org?subject=unsubscrib e>
List-Archive: <http://lists.fedoraproject.org/pipermail/advisory-board/>
List-Post: <mailto:advisory-board@lists.fedoraproject.org>
List-Help: <mailto:advisory-board-request@lists.fedoraproject.org?subject=help>
List-Subscribe: <https://admin.fedoraproject.org/mailman/listinfo/advisory-board>,
<mailto:advisory-board-request@lists.fedoraproject.org?subject=subscribe>
Content-Transfer-Encoding: base64
Content-Type: text/plain; charset="utf-8"; Format="flowed"
Sender: advisory-board-bounces@lists.fedoraproject.org
Errors-To: advisory-board-bounces@lists.fedoraproject.org

T24gMDgvMzAvMjAxMiAwNToxMyBQTSwgVG9zaGlvIEt1cmF0b2 1pIHdyb3RlOgo+IEZvbGxvd2lu
ZyB1cCBvbiBteSBhY3Rpb24gaXRlbSBmcm9tIHRoZSBJUkMgQm 9hcmQgTWVldGluZywgaGVyZSdz
IGEgcHJvcG9zYWwKPiBmb3IgYSBuZXcgQ29tbXVuaXR5IFdvcm tpbmcgR3JvdXAgQ2hhcnRlci4K
CkV4Y2VsbGVudCBwcm9wb3NhbAoKSkJHCl9fX19fX19fX19fX1 9fX19fX19fX19fX19fX19fX19f
X19fX19fX19fX19fX19fCmFkdmlzb3J5LWJvYXJkIG1haWxpbm cgbGlzdAphZHZpc29yeS1ib2Fy
ZEBsaXN0cy5mZWRvcmFwcm9qZWN0Lm9yZwpodHRwczovL2FkbW luLmZlZG9yYXByb2plY3Qub3Jn
L21haWxtYW4vbGlzdGluZm8vYWR2aXNvcnktYm9hcmQ=
 

Thread Tools




All times are GMT. The time now is 07:35 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org