block: Avoid deadlocks with bio allocation by stacking drivers
On Tue, Aug 28, 2012 at 08:25:58PM -0700, Kent Overstreet wrote:
[..]
> Except that when thread a goes to punt those blocked bios to its
> rescuer, it punts _all_ the bios on current->bio_list. Even those
> generated by/belonging to other bio_sets.
>
> So thread 1 in device b punts bios to its rescuer, thread 2
>
> But thread 2 ends up with bios for both device a and b - because they're
> stacked.
Ok, just to add more details to above example. Say we have device A stacked
on top of B and B stacked on top of C. Say a bio a is submitted to device A
and is splitted in two bios b1 and b2. Now b1 is sumbitted to device B and is
splitted in c1 and c2. Now current bio list has three bios. b2, c1 and c2.
If submitter is now about to block on any bio set, then all tree bios
b2, c1, c2 will punted to rescue thread and submssion of b2 will again block
resulting in blocking rescue thread itself.
I would say keep all the bio splitting patches and any fixes w.r.t
deadlocks in a seprate series. As this is little complicated and a lot
of is just theoritical corner cases. If you limit this series to just
bio_set related cleanups, it becomes more acceptable for inclusion.
Thanks
Vivek
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
08-29-2012, 04:24 PM
Mikulas Patocka
block: Avoid deadlocks with bio allocation by stacking drivers
Hi
This fixes the bio allocation problems, but doesn't fix a similar deadlock
in device mapper when allocating from md->io_pool or other mempools in
the target driver.
The problem is that majority of device mapper code assumes that if we
submit a bio, that bio will be finished in a finite time. The commit
d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 in 2.6.22 broke this assumption.
I suggest - instead of writing workarounds for this current->bio_list
misbehavior, why not remove current->bio_list at all? We could revert
d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1, allocate a per-device workqueue,
test stack usage in generic_make_request, and if it is too high (more than
half of the stack used, or so), put the bio to the target device's
blockqueue.
That could be simpler than allocating per-bioset workqueue and it also
solves more problems (possible deadlocks in dm).
Mikulas
On Tue, 28 Aug 2012, Kent Overstreet wrote:
> Previously, if we ever try to allocate more than once from the same bio
> set while running under generic_make_request() (i.e. a stacking block
> driver), we risk deadlock.
>
> This is because of the code in generic_make_request() that converts
> recursion to iteration; any bios we submit won't actually be submitted
> (so they can complete and eventually be freed) until after we return -
> this means if we allocate a second bio, we're blocking the first one
> from ever being freed.
>
> Thus if enough threads call into a stacking block driver at the same
> time with bios that need multiple splits, and the bio_set's reserve gets
> used up, we deadlock.
>
> This can be worked around in the driver code - we could check if we're
> running under generic_make_request(), then mask out __GFP_WAIT when we
> go to allocate a bio, and if the allocation fails punt to workqueue and
> retry the allocation.
>
> But this is tricky and not a generic solution. This patch solves it for
> all users by inverting the previously described technique. We allocate a
> rescuer workqueue for each bio_set, and then in the allocation code if
> there are bios on current->bio_list we would be blocking, we punt them
> to the rescuer workqueue to be submitted.
>
> Tested it by forcing the rescue codepath to be taken (by disabling the
> first GFP_NOWAIT) attempt, and then ran it with bcache (which does a lot
> of arbitrary bio splitting) and verified that the rescuer was being
> invoked.
>
> Signed-off-by: Kent Overstreet <koverstreet@google.com>
> CC: Jens Axboe <axboe@kernel.dk>
> ---
> fs/bio.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++---
> include/linux/bio.h | 9 +++++++
> 2 files changed, 79 insertions(+), 3 deletions(-)
>
> diff --git a/fs/bio.c b/fs/bio.c
> index 31e637a..5d46318 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -285,6 +285,23 @@ void bio_reset(struct bio *bio)
> }
> EXPORT_SYMBOL(bio_reset);
>
> +static void bio_alloc_rescue(struct work_struct *work)
> +{
> + struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
> + struct bio *bio;
> +
> + while (1) {
> + spin_lock(&bs->rescue_lock);
> + bio = bio_list_pop(&bs->rescue_list);
> + spin_unlock(&bs->rescue_lock);
> +
> + if (!bio)
> + break;
> +
> + generic_make_request(bio);
> + }
> +}
> +
> /**
> * bio_alloc_bioset - allocate a bio for I/O
> * @gfp_mask: the GFP_ mask given to the slab allocator
> @@ -307,6 +324,7 @@ EXPORT_SYMBOL(bio_reset);
> */
> struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> {
> + gfp_t saved_gfp = gfp_mask;
> unsigned front_pad;
> unsigned inline_vecs;
> unsigned long idx = BIO_POOL_NONE;
> @@ -324,13 +342,37 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> front_pad = 0;
> inline_vecs = nr_iovecs;
> } else {
> + /*
> + * generic_make_request() converts recursion to iteration; this
> + * means if we're running beneath it, any bios we allocate and
> + * submit will not be submitted (and thus freed) until after we
> + * return.
> + *
> + * This exposes us to a potential deadlock if we allocate
> + * multiple bios from the same bio_set() while running
> + * underneath generic_make_request(). If we were to allocate
> + * multiple bios (say a stacking block driver that was splitting
> + * bios), we would deadlock if we exhausted the mempool's
> + * reserve.
> + *
> + * We solve this, and guarantee forward progress, with a rescuer
> + * workqueue per bio_set. If we go to allocate and there are
> + * bios on current->bio_list, we first try the allocation
> + * without __GFP_WAIT; if that fails, we punt those bios we
> + * would be blocking to the rescuer workqueue before we retry
> + * with the original gfp_flags.
> + */
> +
> + if (current->bio_list && !bio_list_empty(current->bio_list))
> + gfp_mask &= ~__GFP_WAIT;
> +retry:
> p = mempool_alloc(bs->bio_pool, gfp_mask);
> front_pad = bs->front_pad;
> inline_vecs = BIO_INLINE_VECS;
> }
>
> if (unlikely(!p))
> - return NULL;
> + goto err;
>
> bio = p + front_pad;
> bio_init(bio);
> @@ -351,6 +393,19 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>
> err_free:
> mempool_free(p, bs->bio_pool);
> +err:
> + if (gfp_mask != saved_gfp) {
> + gfp_mask = saved_gfp;
> +
> + spin_lock(&bs->rescue_lock);
> + bio_list_merge(&bs->rescue_list, current->bio_list);
> + bio_list_init(current->bio_list);
> + spin_unlock(&bs->rescue_lock);
> +
> + queue_work(bs->rescue_workqueue, &bs->rescue_work);
> + goto retry;
> + }
> +
> return NULL;
> }
> EXPORT_SYMBOL(bio_alloc_bioset);
> @@ -1562,6 +1617,9 @@ static void biovec_free_pools(struct bio_set *bs)
>
> void bioset_free(struct bio_set *bs)
> {
> + if (bs->rescue_workqueue)
> + destroy_workqueue(bs->rescue_workqueue);
> +
> if (bs->bio_pool)
> mempool_destroy(bs->bio_pool);
>
> @@ -1597,6 +1655,10 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
>
> bs->front_pad = front_pad;
>
> + spin_lock_init(&bs->rescue_lock);
> + bio_list_init(&bs->rescue_list);
> + INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
> +
> bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
> if (!bs->bio_slab) {
> kfree(bs);
> @@ -1607,9 +1669,14 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
> if (!bs->bio_pool)
> goto bad;
>
> - if (!biovec_create_pools(bs, pool_size))
> - return bs;
> + if (biovec_create_pools(bs, pool_size))
> + goto bad;
> +
> + bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
> + if (!bs->rescue_workqueue)
> + goto bad;
>
> + return bs;
> bad:
> bioset_free(bs);
> return NULL;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 3a8345e..84fdaac 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -492,6 +492,15 @@ struct bio_set {
> mempool_t *bio_integrity_pool;
> #endif
> mempool_t *bvec_pool;
> +
> + /*
> + * Deadlock avoidance for stacking block drivers: see comments in
> + * bio_alloc_bioset() for details
> + */
> + spinlock_t rescue_lock;
> + struct bio_list rescue_list;
> + struct work_struct rescue_work;
> + struct workqueue_struct *rescue_workqueue;
> };
>
> struct biovec_slab {
> --
> 1.7.12
>
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
08-29-2012, 04:26 PM
Kent Overstreet
block: Avoid deadlocks with bio allocation by stacking drivers
On Wed, Aug 29, 2012 at 03:39:14PM +0100, Alasdair G Kergon wrote:
> It's also instructive to remember why the code is the way it is: it used
> to process bios for underlying devices immediately, but this sometimes
> meant too much recursive stack growth. If a per-device rescuer thread
> is to be made available (as well as the mempool), the option of
> reinstating recursion is there too - only punting to workqueue when the
> stack actually becomes "too big". (Also bear in mind that some dm
> targets may have dependencies on their own mempools - submission can
> block there too.) I find it helpful only to consider splitting into two
> pieces - it must always be possible to process the first piece (i.e.
> process it at the next layer down in the stack) and complete it
> independently of what happens to the second piece (which might require
> further splitting and block until the first piece has completed).
I'm sure it could be made to work (and it may well simpler), but it
seems problematic from a performance pov.
With stacked devices you'd then have to switch stacks on _every_ bio.
That could be made fast enough I'm sure, but it wouldn't be free and I
don't know of any existing code in the kernel that implements what we'd
need (though if you know how you'd go about doing that, I'd love to
know! Would be useful for other things).
The real problem is that because we'd need these extra stacks for
handling all bios we couldn't get by with just one per bio_set. We'd
only need one to make forward progress so the rest could be allocated
on demand (i.e. what the workqueue code does) but that sounds like it's
starting to get expensive.
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
08-29-2012, 04:50 PM
Kent Overstreet
block: Avoid deadlocks with bio allocation by stacking drivers
On Wed, Aug 29, 2012 at 12:24:43PM -0400, Mikulas Patocka wrote:
> Hi
>
> This fixes the bio allocation problems, but doesn't fix a similar deadlock
> in device mapper when allocating from md->io_pool or other mempools in
> the target driver.
Ick. That is a problem.
There are a bunch of mempools in drivers/md too
Though honestly bio_sets have the front_pad thing for a reason, for per
bio state it makes sense to use that anyways - simpler code because
you're doing fewer explicit allocations and more efficient too.
So I wonder if we fixed that if it'd still be a problem.
The other thing we could do is make the rescuer thread per block device
(which arguably makes more sense than per bio_set anyways), then
associate bio_sets with specific block devices/rescuers.
Then the rescuer code can be abstracted out from the bio allocation
code, and we just create a thin wrapper around mempool_alloc() that does
the same dance bio_alloc_bioset() does now.
I think that'd be pretty easy and work out really nicely.
> The problem is that majority of device mapper code assumes that if we
> submit a bio, that bio will be finished in a finite time. The commit
> d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 in 2.6.22 broke this assumption.
>
> I suggest - instead of writing workarounds for this current->bio_list
> misbehavior, why not remove current->bio_list at all? We could revert
> d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1, allocate a per-device workqueue,
> test stack usage in generic_make_request, and if it is too high (more than
> half of the stack used, or so), put the bio to the target device's
> blockqueue.
>
> That could be simpler than allocating per-bioset workqueue and it also
> solves more problems (possible deadlocks in dm).
It certainly would be simpler, but honestly the potential for
performance regressions scares me (and bcache at least is used on fast
enough devices where it's going to matter). Also it's not so much the
performance overhead - we can just measure that - it's that if we're
just using the workqueue code the scheduler's getting involved and we
can't just measure what the effects of that are going to be in
production.
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
08-29-2012, 05:07 PM
Vivek Goyal
block: Avoid deadlocks with bio allocation by stacking drivers
On Wed, Aug 29, 2012 at 09:50:06AM -0700, Kent Overstreet wrote:
[..]
> > The problem is that majority of device mapper code assumes that if we
> > submit a bio, that bio will be finished in a finite time. The commit
> > d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 in 2.6.22 broke this assumption.
> >
> > I suggest - instead of writing workarounds for this current->bio_list
> > misbehavior, why not remove current->bio_list at all? We could revert
> > d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1, allocate a per-device workqueue,
> > test stack usage in generic_make_request, and if it is too high (more than
> > half of the stack used, or so), put the bio to the target device's
> > blockqueue.
> >
> > That could be simpler than allocating per-bioset workqueue and it also
> > solves more problems (possible deadlocks in dm).
>
> It certainly would be simpler, but honestly the potential for
> performance regressions scares me (and bcache at least is used on fast
> enough devices where it's going to matter). Also it's not so much the
> performance overhead - we can just measure that - it's that if we're
> just using the workqueue code the scheduler's getting involved and we
> can't just measure what the effects of that are going to be in
> production.
Are workqueues not getting involved already in your solution of punting
to rescuer thread. In the proposal above also, workers get involved
only if stack depth is too deep. So for normal stack usage performance
should not be impacted.
Performance aside, punting submission to per device worker in case of deep
stack usage sounds cleaner solution to me.
Thanks
Vivek
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
08-29-2012, 05:13 PM
Kent Overstreet
block: Avoid deadlocks with bio allocation by stacking drivers
On Wed, Aug 29, 2012 at 01:07:11PM -0400, Vivek Goyal wrote:
> On Wed, Aug 29, 2012 at 09:50:06AM -0700, Kent Overstreet wrote:
>
> [..]
> > > The problem is that majority of device mapper code assumes that if we
> > > submit a bio, that bio will be finished in a finite time. The commit
> > > d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 in 2.6.22 broke this assumption.
> > >
> > > I suggest - instead of writing workarounds for this current->bio_list
> > > misbehavior, why not remove current->bio_list at all? We could revert
> > > d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1, allocate a per-device workqueue,
> > > test stack usage in generic_make_request, and if it is too high (more than
> > > half of the stack used, or so), put the bio to the target device's
> > > blockqueue.
> > >
> > > That could be simpler than allocating per-bioset workqueue and it also
> > > solves more problems (possible deadlocks in dm).
> >
> > It certainly would be simpler, but honestly the potential for
> > performance regressions scares me (and bcache at least is used on fast
> > enough devices where it's going to matter). Also it's not so much the
> > performance overhead - we can just measure that - it's that if we're
> > just using the workqueue code the scheduler's getting involved and we
> > can't just measure what the effects of that are going to be in
> > production.
>
> Are workqueues not getting involved already in your solution of punting
> to rescuer thread.
Only on allocation failure.
> In the proposal above also, workers get involved
> only if stack depth is too deep. So for normal stack usage performance
> should not be impacted.
>
> Performance aside, punting submission to per device worker in case of deep
> stack usage sounds cleaner solution to me.
Agreed, but performance tends to matter in the real world. And either
way the tricky bits are going to be confined to a few functions, so I
don't think it matters that much.
If someone wants to code up the workqueue version and test it, they're
more than welcome...
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
08-29-2012, 05:32 PM
Kent Overstreet
block: Avoid deadlocks with bio allocation by stacking drivers
On Wed, Aug 29, 2012 at 06:23:36PM +0100, Alasdair G Kergon wrote:
> On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote:
> > Only on allocation failure.
>
> Which as you already said assumes wrapping together the other mempools in the
> submission path first...
Yes? I'm not sure what you're getting at.
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
08-29-2012, 09:01 PM
"John Stoffel"
block: Avoid deadlocks with bio allocation by stacking drivers
>>>>> "Kent" == Kent Overstreet <koverstreet@google.com> writes:
Kent> On Wed, Aug 29, 2012 at 03:39:14PM +0100, Alasdair G Kergon wrote:
>> It's also instructive to remember why the code is the way it is: it used
>> to process bios for underlying devices immediately, but this sometimes
>> meant too much recursive stack growth. If a per-device rescuer thread
>> is to be made available (as well as the mempool), the option of
>> reinstating recursion is there too - only punting to workqueue when the
>> stack actually becomes "too big". (Also bear in mind that some dm
>> targets may have dependencies on their own mempools - submission can
>> block there too.) I find it helpful only to consider splitting into two
>> pieces - it must always be possible to process the first piece (i.e.
>> process it at the next layer down in the stack) and complete it
>> independently of what happens to the second piece (which might require
>> further splitting and block until the first piece has completed).
Kent> I'm sure it could be made to work (and it may well simpler), but it
Kent> seems problematic from a performance pov.
Kent> With stacked devices you'd then have to switch stacks on _every_ bio.
Kent> That could be made fast enough I'm sure, but it wouldn't be free and I
Kent> don't know of any existing code in the kernel that implements what we'd
Kent> need (though if you know how you'd go about doing that, I'd love to
Kent> know! Would be useful for other things).
Kent> The real problem is that because we'd need these extra stacks for
Kent> handling all bios we couldn't get by with just one per bio_set. We'd
Kent> only need one to make forward progress so the rest could be allocated
Kent> on demand (i.e. what the workqueue code does) but that sounds like it's
Kent> starting to get expensive.
Maybe we need to limit the size of BIOs to that of the bottom-most
underlying device instead? Or maybe limit BIOs to some small
multiple? As you stack up DM targets one on top of each other, they
should respect the limits of the underlying devices and pass those
limits up the chain.
Or maybe I'm speaking giberish...
John
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
08-29-2012, 09:08 PM
Kent Overstreet
block: Avoid deadlocks with bio allocation by stacking drivers
On Wed, Aug 29, 2012 at 05:01:09PM -0400, John Stoffel wrote:
> >>>>> "Kent" == Kent Overstreet <koverstreet@google.com> writes:
>
> Kent> On Wed, Aug 29, 2012 at 03:39:14PM +0100, Alasdair G Kergon wrote:
> >> It's also instructive to remember why the code is the way it is: it used
> >> to process bios for underlying devices immediately, but this sometimes
> >> meant too much recursive stack growth. If a per-device rescuer thread
> >> is to be made available (as well as the mempool), the option of
> >> reinstating recursion is there too - only punting to workqueue when the
> >> stack actually becomes "too big". (Also bear in mind that some dm
> >> targets may have dependencies on their own mempools - submission can
> >> block there too.) I find it helpful only to consider splitting into two
> >> pieces - it must always be possible to process the first piece (i.e.
> >> process it at the next layer down in the stack) and complete it
> >> independently of what happens to the second piece (which might require
> >> further splitting and block until the first piece has completed).
>
> Kent> I'm sure it could be made to work (and it may well simpler), but it
> Kent> seems problematic from a performance pov.
>
> Kent> With stacked devices you'd then have to switch stacks on _every_ bio.
> Kent> That could be made fast enough I'm sure, but it wouldn't be free and I
> Kent> don't know of any existing code in the kernel that implements what we'd
> Kent> need (though if you know how you'd go about doing that, I'd love to
> Kent> know! Would be useful for other things).
>
> Kent> The real problem is that because we'd need these extra stacks for
> Kent> handling all bios we couldn't get by with just one per bio_set. We'd
> Kent> only need one to make forward progress so the rest could be allocated
> Kent> on demand (i.e. what the workqueue code does) but that sounds like it's
> Kent> starting to get expensive.
>
> Maybe we need to limit the size of BIOs to that of the bottom-most
> underlying device instead? Or maybe limit BIOs to some small
> multiple? As you stack up DM targets one on top of each other, they
> should respect the limits of the underlying devices and pass those
> limits up the chain.
That's the approach the block layer currently tries to take. It's
brittle, tricky and inefficient, and it completely breaks down when the
limits are dynamic - so things like dm and bcache are always going to
have to split anyways.
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
08-30-2012, 10:07 PM
Vivek Goyal
block: Avoid deadlocks with bio allocation by stacking drivers
On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote:
[..]
> > Performance aside, punting submission to per device worker in case of deep
> > stack usage sounds cleaner solution to me.
>
> Agreed, but performance tends to matter in the real world. And either
> way the tricky bits are going to be confined to a few functions, so I
> don't think it matters that much.
>
> If someone wants to code up the workqueue version and test it, they're
> more than welcome...
Here is one quick and dirty proof of concept patch. It checks for stack
depth and if remaining space is less than 20% of stack size, then it
defers the bio submission to per queue worker.
/**
+ * blk_drain_deferred_bios - drain deferred bios
+ * @q: request_queue to drain deferred bios for
+ *
+ * Dispatch all currently deferred bios on @q through ->make_request_fn().
+ */
+static void blk_drain_deferred_bios(struct request_queue *q)
+{
+ struct bio_list bl;
+ struct bio *bio;
+ unsigned long flags;
+
+ bio_list_init(&bl);
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ bio_list_merge(&bl, &q->deferred_bios);
+ bio_list_init(&q->deferred_bios);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+
+ while ((bio = bio_list_pop(&bl)))
+ generic_make_request(bio);
+}
+
+/**
* blk_drain_queue - drain requests from request_queue
* @q: queue to drain
* @drain_all: whether to drain all requests or only the ones w/ ELVPRIV
@@ -358,6 +399,10 @@ EXPORT_SYMBOL(blk_put_queue);
* Drain requests from @q. If @drain_all is set, all requests are drained.
* If not, only ELVPRIV requests are drained. The caller is responsible
* for ensuring that no new requests which need to be drained are queued.
+ *
+ * Note: It does not drain bios on q->deferred_bios list.
+ * Call blk_drain_deferred_bios() if need be.
+ *
*/
void blk_drain_queue(struct request_queue *q, bool drain_all)
{
@@ -505,6 +550,9 @@ void blk_cleanup_queue(struct request_qu
spin_unlock_irq(lock);
mutex_unlock(&q->sysfs_lock);
+ /* First drain all deferred bios. */
+ blk_drain_deferred_bios(q);
+
/* drain all requests queued before DEAD marking */
blk_drain_queue(q, true);
- if (blkcg_init_queue(q))
+ bio_list_init(&q->deferred_bios);
+ INIT_WORK(&q->deferred_bio_work, blk_deferred_bio_work);
+ q->deferred_bio_workqueue = alloc_workqueue("kdeferbiod", WQ_MEM_RECLAIM, 0);
+ if (!q->deferred_bio_workqueue)
goto fail_id;
+ if (blkcg_init_queue(q))
+ goto fail_deferred_bio_wq;
+
return q;
+fail_deferred_bio_wq:
+ destroy_workqueue(q->deferred_bio_workqueue);
fail_id:
ida_simple_remove(&blk_queue_ida, q->id);
fail_q:
@@ -1635,8 +1691,10 @@ static inline int bio_check_eod(struct b
return 0;
}
+
+
static noinline_for_stack bool
-generic_make_request_checks(struct bio *bio)
+generic_make_request_checks_early(struct bio *bio)
{
struct request_queue *q;
int nr_sectors = bio_sectors(bio);
@@ -1715,9 +1773,6 @@ generic_make_request_checks(struct bio *
*/
create_io_context(GFP_ATOMIC, q->node);
- if (blk_throtl_bio(q, bio))
- return false; /* throttled, will be resubmitted later */
-
trace_block_bio_queue(q, bio);
return true;
@@ -1726,6 +1781,56 @@ end_io:
return false;
}
+static noinline_for_stack bool
+generic_make_request_checks_late(struct bio *bio)
+{
+ struct request_queue *q;
+
+ q = bdev_get_queue(bio->bi_bdev);
+
+ /*
+ * Various block parts want %current->io_context and lazy ioc
+ * allocation ends up trading a lot of pain for a small amount of
+ * memory. Just allocate it upfront. This may fail and block
+ * layer knows how to live with it.
+ */
+ create_io_context(GFP_ATOMIC, q->node);
+
+ if (blk_throtl_bio(q, bio))
+ return false; /* throttled, will be resubmitted later */
+
+ return true;
+}
+
+static void __generic_make_request(struct bio *bio)
+{
+ struct request_queue *q;
+
+ if (!generic_make_request_checks_late(bio))
+ return;
+ q = bdev_get_queue(bio->bi_bdev);
+ q->make_request_fn(q, bio);
+}
+
+static void generic_make_request_defer_bio(struct bio *bio)
+{
+ struct request_queue *q;
+ unsigned long flags;
+
+ q = bdev_get_queue(bio->bi_bdev);
+
+ spin_lock_irqsave(q->queue_lock, flags);
+ if (unlikely(blk_queue_dead(q))) {
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ bio_endio(bio, -ENODEV);
+ return;
+ }
+ set_bit(BIO_DEFERRED, &bio->bi_flags);
+ bio_list_add(&q->deferred_bios, bio);
+ spin_unlock_irqrestore(q->queue_lock, flags);
+ queue_work(q->deferred_bio_workqueue, &q->deferred_bio_work);
+}
+
/**
* generic_make_request - hand a buffer to its device driver for I/O
* @bio: The bio describing the location in memory and on the device.
@@ -1752,51 +1857,31 @@ end_io:
*/
void generic_make_request(struct bio *bio)
{
- struct bio_list bio_list_on_stack;
+ unsigned long sp = 0;
+ unsigned int threshold = (THREAD_SIZE * 2)/10;
- if (!generic_make_request_checks(bio))
- return;
+ BUG_ON(bio->bi_next);
- /*
- * We only want one ->make_request_fn to be active at a time, else
- * stack usage with stacked devices could be a problem. So use
- * current->bio_list to keep a list of requests submited by a
- * make_request_fn function. current->bio_list is also used as a
- * flag to say if generic_make_request is currently active in this
- * task or not. If it is NULL, then no make_request is active. If
- * it is non-NULL, then a make_request is active, and new requests
- * should be added at the tail
- */
- if (current->bio_list) {
- bio_list_add(current->bio_list, bio);
+ /* Submitteing deferred bio from worker context. */
+ if (bio_flagged(bio, BIO_DEFERRED)) {
+ clear_bit(BIO_DEFERRED, &bio->bi_flags);
+ __generic_make_request(bio);
return;
}
- /* following loop may be a bit non-obvious, and so deserves some
- * explanation.
- * Before entering the loop, bio->bi_next is NULL (as all callers
- * ensure that) so we have a list with a single bio.
- * We pretend that we have just taken it off a longer list, so
- * we assign bio_list to a pointer to the bio_list_on_stack,
- * thus initialising the bio_list of new bios to be
- * added. ->make_request() may indeed add some more bios
- * through a recursive call to generic_make_request. If it
- * did, we find a non-NULL value in bio_list and re-enter the loop
- * from the top. In this case we really did just take the bio
- * of the top of the list (no pretending) and so remove it from
- * bio_list, and call into ->make_request() again.
- */
- BUG_ON(bio->bi_next);
- bio_list_init(&bio_list_on_stack);
- current->bio_list = &bio_list_on_stack;
- do {
- struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+ if (!generic_make_request_checks_early(bio))
+ return;
- q->make_request_fn(q, bio);
+ /*
+ * FIXME. Provide an arch dependent function to return left stack
+ * space for current task. This is hack for x86_64.
+ */
+ asm volatile("movq %%rsp,%0" : "=m"(sp));