Linux Archive

Linux Archive (http://www.linux-archive.org/)
-   Device-mapper Development (http://www.linux-archive.org/device-mapper-development/)
-   -   block: Avoid deadlocks with bio allocation by stacking drivers (http://www.linux-archive.org/device-mapper-development/696818-block-avoid-deadlocks-bio-allocation-stacking-drivers.html)

Tejun Heo 08-22-2012 08:30 PM

block: Avoid deadlocks with bio allocation by stacking drivers
 
Hello, Kent.

On Wed, Aug 22, 2012 at 10:04:04AM -0700, Kent Overstreet wrote:
> Previously, if we ever try to allocate more than once from the same bio
> set while running under generic_make_request(), we risk deadlock.
>
> This would happen if e.g. a bio ever needed to be split more than once,
> and it's difficult to handle correctly in the drivers - so in practice
> it's not.
>
> This patch fixes this issue by allocating a rescuer workqueue for each
> bio_set, and punting queued bios to said rescuer when necessary:

This description needs to be several times more descriptive than it
currently is. The deadlock issue at hand is very unobvious. Please
explain it so that someone who isn't familiar with the problem can
understand it by reading it.

Also, please describe how the patch was tested.

> @@ -306,16 +324,39 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> p = kmalloc(sizeof(struct bio) +
> nr_iovecs * sizeof(struct bio_vec),
> gfp_mask);
> +
> front_pad = 0;
> inline_vecs = nr_iovecs;
> } else {
> - p = mempool_alloc(bs->bio_pool, gfp_mask);
> + /*
> + * If we're running under generic_make_request()
> + * (current->bio_list != NULL), we risk deadlock if we sleep on
> + * allocation and there's already bios on current->bio_list that
> + * were allocated from the same bio_set; they won't be submitted
> + * (and thus freed) as long as we're blocked here.
> + *
> + * To deal with this, we first try the allocation without using
> + * the mempool; if that fails, we punt all the bios on
> + * current->bio_list to a different thread and then retry the
> + * allocation with the original gfp mask.
> + */

Ditto here.

> + if (current->bio_list &&
> + !bio_list_empty(current->bio_list) &&
> + (gfp_mask & __GFP_WAIT))
> + gfp_mask &= GFP_ATOMIC;

Why aren't we turning off __GFP_WAIT instead? e.g. What if the caller
has one of NUMA flags or __GFP_COLD specified?

> +retry:
> + if (gfp_mask & __GFP_WAIT)
> + p = mempool_alloc(bs->bio_pool, gfp_mask);
> + else
> + p = kmem_cache_alloc(bs->bio_slab, gfp_mask);

Why is it necessary to avoid using the mempool if __GFP_WAIT is
already cleared?

> front_pad = bs->front_pad;
> inline_vecs = BIO_INLINE_VECS;
> }
>
> if (unlikely(!p))
> - return NULL;
> + goto err;
>
> bio = p + front_pad;
> bio_init(bio);
> @@ -336,6 +377,19 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>
> err_free:
> mempool_free(p, bs->bio_pool);
> +err:
> + if (gfp_mask != saved_gfp) {
> + gfp_mask = saved_gfp;
> +
> + spin_lock(&bs->rescue_lock);
> + bio_list_merge(&bs->rescue_list, current->bio_list);
> + bio_list_init(current->bio_list);
> + spin_unlock(&bs->rescue_lock);
> +
> + queue_work(bs->rescue_workqueue, &bs->rescue_work);
> + goto retry;
> + }
> +
> return NULL;
> }
> EXPORT_SYMBOL(bio_alloc_bioset);
> @@ -1544,6 +1598,9 @@ static void biovec_free_pools(struct bio_set *bs)
>
> void bioset_free(struct bio_set *bs)
> {
> + if (bs->rescue_workqueue)
> + destroy_workqueue(bs->rescue_workqueue);
> +
> if (bs->bio_pool)
> mempool_destroy(bs->bio_pool);
>
> @@ -1579,6 +1636,10 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
>
> bs->front_pad = front_pad;
>
> + spin_lock_init(&bs->rescue_lock);
> + bio_list_init(&bs->rescue_list);
> + INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
> +
> bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
> if (!bs->bio_slab) {
> kfree(bs);
> @@ -1589,9 +1650,14 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
> if (!bs->bio_pool)
> goto bad;
>
> - if (!biovec_create_pools(bs, pool_size))
> - return bs;
> + if (biovec_create_pools(bs, pool_size))
> + goto bad;
> +
> + bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
> + if (!bs->rescue_workqueue)
> + goto bad;
>
> + return bs;
> bad:
> bioset_free(bs);
> return NULL;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index b22c22b..ba5b52e 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -290,39 +290,6 @@ static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }
> static inline void bio_disassociate_task(struct bio *bio) { }
> #endif /* CONFIG_BLK_CGROUP */
>
> -/*
> - * bio_set is used to allow other portions of the IO system to
> - * allocate their own private memory pools for bio and iovec structures.
> - * These memory pools in turn all allocate from the bio_slab
> - * and the bvec_slabs[].
> - */
> -#define BIO_POOL_SIZE 2
> -#define BIOVEC_NR_POOLS 6
> -#define BIOVEC_MAX_IDX (BIOVEC_NR_POOLS - 1)
> -
> -struct bio_set {
> - struct kmem_cache *bio_slab;
> - unsigned int front_pad;
> -
> - mempool_t *bio_pool;
> -#if defined(CONFIG_BLK_DEV_INTEGRITY)
> - mempool_t *bio_integrity_pool;
> -#endif
> - mempool_t *bvec_pool;
> -};
> -
> -struct biovec_slab {
> - int nr_vecs;
> - char *name;
> - struct kmem_cache *slab;
> -};

Plesae don't mix struct definition relocation (or any relocation
really) with actual changes. It buries the actual changes and makes
reviewing difficult.

Thanks.

--
tejun

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Kent Overstreet 08-24-2012 05:55 AM

block: Avoid deadlocks with bio allocation by stacking drivers
 
On Wed, Aug 22, 2012 at 01:30:03PM -0700, Tejun Heo wrote:
> This description needs to be several times more descriptive than it
> currently is. The deadlock issue at hand is very unobvious. Please
> explain it so that someone who isn't familiar with the problem can
> understand it by reading it.
>
> Also, please describe how the patch was tested.

New patch below, tell me how the new description reads?

> > @@ -306,16 +324,39 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> > p = kmalloc(sizeof(struct bio) +
> > nr_iovecs * sizeof(struct bio_vec),
> > gfp_mask);
> > +
> > front_pad = 0;
> > inline_vecs = nr_iovecs;
> > } else {
> > - p = mempool_alloc(bs->bio_pool, gfp_mask);
> > + /*
> > + * If we're running under generic_make_request()
> > + * (current->bio_list != NULL), we risk deadlock if we sleep on
> > + * allocation and there's already bios on current->bio_list that
> > + * were allocated from the same bio_set; they won't be submitted
> > + * (and thus freed) as long as we're blocked here.
> > + *
> > + * To deal with this, we first try the allocation without using
> > + * the mempool; if that fails, we punt all the bios on
> > + * current->bio_list to a different thread and then retry the
> > + * allocation with the original gfp mask.
> > + */
>
> Ditto here.

Ok, tell me if the new version below is better.

>
> > + if (current->bio_list &&
> > + !bio_list_empty(current->bio_list) &&
> > + (gfp_mask & __GFP_WAIT))
> > + gfp_mask &= GFP_ATOMIC;
>
> Why aren't we turning off __GFP_WAIT instead? e.g. What if the caller
> has one of NUMA flags or __GFP_COLD specified?

Didn't think of that. The reason I did it that way is I wasn't sure if
just doing &= ~__GFP_WAIT would be correct, since that would leave
__GFP_IO|__GFP_FS set.

Look at how the mempool code uses the gfp flags, it's not clear to me
what happens in general if __GFP_WAIT isn't set but __GFP_IO is... but I
_think_ masking out __GFP_WAIT is sufficient and correct.

(The mempool code masks out __GFP_IO where it masks out __GFP_WAIT, but
it doesn't mask out __GFP_FS. Why would masking out __GFP_IO matter? And
if it does, why isn't __GFP_FS masked out? Inquiring minds want to know.

But at least for my purposes masking out __GFP_WAIT looks correct.

> > +retry:
> > + if (gfp_mask & __GFP_WAIT)
> > + p = mempool_alloc(bs->bio_pool, gfp_mask);
> > + else
> > + p = kmem_cache_alloc(bs->bio_slab, gfp_mask);
>
> Why is it necessary to avoid using the mempool if __GFP_WAIT is
> already cleared?

Now that I think about it, avoiding the mempool shouldn't be necessary
at all (and it's better not to since the mempool code adds various
flags, so allocating without those flags is at least inconsistent).

> Plesae don't mix struct definition relocation (or any relocation
> really) with actual changes. It buries the actual changes and makes
> reviewing difficult.

Make a new patch that does nothing more than reorder the definitions,
then?


commit d4351a8fc634d3ed95c27f633184f3f87ba5c754
Author: Kent Overstreet <koverstreet@google.com>
Date: Thu Aug 23 22:54:17 2012 -0700

block: Avoid deadlocks with bio allocation by stacking drivers

Previously, if we ever try to allocate more than once from the same bio
set while running under generic_make_request() (i.e. a stacking block
driver), we risk deadlock.

This is because of the code in generic_make_request() that converts
recursion to iteration; any bios we submit won't actually be submitted
(so they can complete and eventually be freed) until after we return -
this means if we allocate a second bio, we're blocking the first one
from ever being freed.

Thus if enough threads call into a stacking block driver at the same
time with bios that need multiple splits, and the bio_set's reserve gets
used up, we deadlock.

This can be worked around in the driver code - we could check if we're
running under generic_make_request(), then mask out __GFP_WAIT when we
go to allocate a bio, and if the allocation fails punt to workqueue and
retry the allocation.

But this is tricky and not a generic solution. This patch solves it for
all users by inverting the previously described technique. We allocate a
rescuer workqueue for each bio_set, and then in the allocation code if
there are bios on current->bio_list we would be blocking, we punt them
to the rescuer workqueue to be submitted.

Tested it by forcing the rescue codepath to be taken (by disabling the
first GFP_NOWAIT) attempt, and then ran it with bcache (which does a lot
of arbitrary bio splitting) and verified that the rescuer was being
invoked.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>

diff --git a/fs/bio.c b/fs/bio.c
index ed34526..f9f87d7 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -282,6 +282,23 @@ void bio_reset(struct bio *bio)
}
EXPORT_SYMBOL(bio_reset);

+static void bio_alloc_rescue(struct work_struct *work)
+{
+ struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
+ struct bio *bio;
+
+ while (1) {
+ spin_lock(&bs->rescue_lock);
+ bio = bio_list_pop(&bs->rescue_list);
+ spin_unlock(&bs->rescue_lock);
+
+ if (!bio)
+ break;
+
+ generic_make_request(bio);
+ }
+}
+
/**
* bio_alloc_bioset - allocate a bio for I/O
* @gfp_mask: the GFP_ mask given to the slab allocator
@@ -304,6 +321,7 @@ EXPORT_SYMBOL(bio_reset);
*/
struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
{
+ gfp_t saved_gfp = gfp_mask;
unsigned front_pad;
unsigned inline_vecs;
unsigned long idx = BIO_POOL_NONE;
@@ -318,16 +336,44 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
p = kmalloc(sizeof(struct bio) +
nr_iovecs * sizeof(struct bio_vec),
gfp_mask);
+
front_pad = 0;
inline_vecs = nr_iovecs;
} else {
+ /*
+ * generic_make_request() converts recursion to iteration; this
+ * means if we're running beneath it, any bios we allocate and
+ * submit will not be submitted (and thus freed) until after we
+ * return.
+ *
+ * This exposes us to a potential deadlock if we allocate
+ * multiple bios from the same bio_set() while running
+ * underneath generic_make_request(). If we were to allocate
+ * multiple bios (say a stacking block driver that was splitting
+ * bios), we would deadlock if we exhausted the mempool's
+ * reserve.
+ *
+ * We solve this, and guarantee forward progress, with a rescuer
+ * workqueue per bio_set. If we go to allocate and there are
+ * bios on current->bio_list, we first try the allocation
+ * without __GFP_WAIT; if that fails, we punt those bios we
+ * would be blocking to the rescuer workqueue before we retry
+ * with the original gfp_flags.
+ */
+
+ if (current->bio_list &&
+ !bio_list_empty(current->bio_list) &&
+ (gfp_mask & __GFP_WAIT))
+ gfp_mask &= ~__GFP_WAIT;
+retry:
p = mempool_alloc(bs->bio_pool, gfp_mask);
+
front_pad = bs->front_pad;
inline_vecs = BIO_INLINE_VECS;
}

if (unlikely(!p))
- return NULL;
+ goto err;

bio = p + front_pad;
bio_init(bio);
@@ -348,6 +394,19 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)

err_free:
mempool_free(p, bs->bio_pool);
+err:
+ if (gfp_mask != saved_gfp) {
+ gfp_mask = saved_gfp;
+
+ spin_lock(&bs->rescue_lock);
+ bio_list_merge(&bs->rescue_list, current->bio_list);
+ bio_list_init(current->bio_list);
+ spin_unlock(&bs->rescue_lock);
+
+ queue_work(bs->rescue_workqueue, &bs->rescue_work);
+ goto retry;
+ }
+
return NULL;
}
EXPORT_SYMBOL(bio_alloc_bioset);
@@ -1556,6 +1615,9 @@ static void biovec_free_pools(struct bio_set *bs)

void bioset_free(struct bio_set *bs)
{
+ if (bs->rescue_workqueue)
+ destroy_workqueue(bs->rescue_workqueue);
+
if (bs->bio_pool)
mempool_destroy(bs->bio_pool);

@@ -1591,6 +1653,10 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)

bs->front_pad = front_pad;

+ spin_lock_init(&bs->rescue_lock);
+ bio_list_init(&bs->rescue_list);
+ INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
+
bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
if (!bs->bio_slab) {
kfree(bs);
@@ -1601,9 +1667,14 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
if (!bs->bio_pool)
goto bad;

- if (!biovec_create_pools(bs, pool_size))
- return bs;
+ if (biovec_create_pools(bs, pool_size))
+ goto bad;
+
+ bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
+ if (!bs->rescue_workqueue)
+ goto bad;

+ return bs;
bad:
bioset_free(bs);
return NULL;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index b22c22b..ba5b52e 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -290,39 +290,6 @@ static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }
static inline void bio_disassociate_task(struct bio *bio) { }
#endif /* CONFIG_BLK_CGROUP */

-/*
- * bio_set is used to allow other portions of the IO system to
- * allocate their own private memory pools for bio and iovec structures.
- * These memory pools in turn all allocate from the bio_slab
- * and the bvec_slabs[].
- */
-#define BIO_POOL_SIZE 2
-#define BIOVEC_NR_POOLS 6
-#define BIOVEC_MAX_IDX (BIOVEC_NR_POOLS - 1)
-
-struct bio_set {
- struct kmem_cache *bio_slab;
- unsigned int front_pad;
-
- mempool_t *bio_pool;
-#if defined(CONFIG_BLK_DEV_INTEGRITY)
- mempool_t *bio_integrity_pool;
-#endif
- mempool_t *bvec_pool;
-};
-
-struct biovec_slab {
- int nr_vecs;
- char *name;
- struct kmem_cache *slab;
-};
-
-/*
- * a small number of entries is fine, not going to be performance critical.
- * basically we just need to survive
- */
-#define BIO_SPLIT_ENTRIES 2
-
#ifdef CONFIG_HIGHMEM
/*
* remember never ever reenable interrupts between a bvec_kmap_irq and
@@ -497,6 +464,48 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
return bio;
}

+/*
+ * bio_set is used to allow other portions of the IO system to
+ * allocate their own private memory pools for bio and iovec structures.
+ * These memory pools in turn all allocate from the bio_slab
+ * and the bvec_slabs[].
+ */
+#define BIO_POOL_SIZE 2
+#define BIOVEC_NR_POOLS 6
+#define BIOVEC_MAX_IDX (BIOVEC_NR_POOLS - 1)
+
+struct bio_set {
+ struct kmem_cache *bio_slab;
+ unsigned int front_pad;
+
+ mempool_t *bio_pool;
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+ mempool_t *bio_integrity_pool;
+#endif
+ mempool_t *bvec_pool;
+
+ /*
+ * Deadlock avoidance for stacking block drivers: see comments in
+ * bio_alloc_bioset() for details
+ */
+ spinlock_t rescue_lock;
+ struct bio_list rescue_list;
+ struct work_struct rescue_work;
+ struct workqueue_struct *rescue_workqueue;
+};
+
+struct biovec_slab {
+ int nr_vecs;
+ char *name;
+ struct kmem_cache *slab;
+};
+
+/*
+ * a small number of entries is fine, not going to be performance critical.
+ * basically we just need to survive
+ */
+#define BIO_SPLIT_ENTRIES 2
+
#if defined(CONFIG_BLK_DEV_INTEGRITY)

#define bip_vec_idx(bip, idx) (&(bip->bip_vec[(idx)]))

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Tejun Heo 08-24-2012 08:28 PM

block: Avoid deadlocks with bio allocation by stacking drivers
 
Hello,

On Thu, Aug 23, 2012 at 10:55:54PM -0700, Kent Overstreet wrote:
> > Why aren't we turning off __GFP_WAIT instead? e.g. What if the caller
> > has one of NUMA flags or __GFP_COLD specified?
>
> Didn't think of that. The reason I did it that way is I wasn't sure if
> just doing &= ~__GFP_WAIT would be correct, since that would leave
> __GFP_IO|__GFP_FS set.

Using the appropriate __GFP_IO/FS flags is the caller's
responsibility. The only thing bioset needs to worry and take action
about here is __GFP_WAIT causing indefinite wait in mempool.

> > Plesae don't mix struct definition relocation (or any relocation
> > really) with actual changes. It buries the actual changes and makes
> > reviewing difficult.
>
> Make a new patch that does nothing more than reorder the definitions,
> then?

Yeap, with justification.

> block: Avoid deadlocks with bio allocation by stacking drivers
>
> Previously, if we ever try to allocate more than once from the same bio
> set while running under generic_make_request() (i.e. a stacking block
> driver), we risk deadlock.
>
> This is because of the code in generic_make_request() that converts
> recursion to iteration; any bios we submit won't actually be submitted
> (so they can complete and eventually be freed) until after we return -
> this means if we allocate a second bio, we're blocking the first one
> from ever being freed.
>
> Thus if enough threads call into a stacking block driver at the same
> time with bios that need multiple splits, and the bio_set's reserve gets
> used up, we deadlock.
>
> This can be worked around in the driver code - we could check if we're
> running under generic_make_request(), then mask out __GFP_WAIT when we
> go to allocate a bio, and if the allocation fails punt to workqueue and
> retry the allocation.
>
> But this is tricky and not a generic solution. This patch solves it for
> all users by inverting the previously described technique. We allocate a
> rescuer workqueue for each bio_set, and then in the allocation code if
> there are bios on current->bio_list we would be blocking, we punt them
> to the rescuer workqueue to be submitted.
>
> Tested it by forcing the rescue codepath to be taken (by disabling the
> first GFP_NOWAIT) attempt, and then ran it with bcache (which does a lot
> of arbitrary bio splitting) and verified that the rescuer was being
> invoked.

Yeah, the description looks good to me.

> struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> {
> + gfp_t saved_gfp = gfp_mask;
> unsigned front_pad;
> unsigned inline_vecs;
> unsigned long idx = BIO_POOL_NONE;
> @@ -318,16 +336,44 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> p = kmalloc(sizeof(struct bio) +
> nr_iovecs * sizeof(struct bio_vec),
> gfp_mask);
> +
> front_pad = 0;
> inline_vecs = nr_iovecs;
> } else {
> + /*
> + * generic_make_request() converts recursion to iteration; this
> + * means if we're running beneath it, any bios we allocate and
> + * submit will not be submitted (and thus freed) until after we
> + * return.
> + *
> + * This exposes us to a potential deadlock if we allocate
> + * multiple bios from the same bio_set() while running
> + * underneath generic_make_request(). If we were to allocate
> + * multiple bios (say a stacking block driver that was splitting
> + * bios), we would deadlock if we exhausted the mempool's
> + * reserve.
> + *
> + * We solve this, and guarantee forward progress, with a rescuer
> + * workqueue per bio_set. If we go to allocate and there are
> + * bios on current->bio_list, we first try the allocation
> + * without __GFP_WAIT; if that fails, we punt those bios we
> + * would be blocking to the rescuer workqueue before we retry
> + * with the original gfp_flags.
> + */

Can you please add a comment in generic_make_request() to describe the
issue briefly and link back here?

> void bioset_free(struct bio_set *bs)
> {
> + if (bs->rescue_workqueue)

Why is the conditional necessary? Is it possible to have a bioset w/o
rescue_workqueue?

> + destroy_workqueue(bs->rescue_workqueue);
> +
> if (bs->bio_pool)
> mempool_destroy(bs->bio_pool);

This makes bioset_free() require process context, which probably is
okay for bioset but still isn't nice. Might worth noting in the patch
description.

Thanks.

--
tejun

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Tejun Heo 08-28-2012 08:49 PM

block: Avoid deadlocks with bio allocation by stacking drivers
 
Hello,

On Tue, Aug 28, 2012 at 10:37:36AM -0700, Kent Overstreet wrote:
> @@ -324,13 +342,37 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> front_pad = 0;
> inline_vecs = nr_iovecs;
> } else {
> + /*
> + * generic_make_request() converts recursion to iteration; this
> + * means if we're running beneath it, any bios we allocate and
> + * submit will not be submitted (and thus freed) until after we
> + * return.
> + *
> + * This exposes us to a potential deadlock if we allocate
> + * multiple bios from the same bio_set() while running
> + * underneath generic_make_request(). If we were to allocate
> + * multiple bios (say a stacking block driver that was splitting
> + * bios), we would deadlock if we exhausted the mempool's
> + * reserve.
> + *
> + * We solve this, and guarantee forward progress, with a rescuer
> + * workqueue per bio_set. If we go to allocate and there are
> + * bios on current->bio_list, we first try the allocation
> + * without __GFP_WAIT; if that fails, we punt those bios we
> + * would be blocking to the rescuer workqueue before we retry
> + * with the original gfp_flags.
> + */
> +
> + if (current->bio_list && !bio_list_empty(current->bio_list))
> + gfp_mask &= ~__GFP_WAIT;
> +retry:
> p = mempool_alloc(bs->bio_pool, gfp_mask);
> front_pad = bs->front_pad;
> inline_vecs = BIO_INLINE_VECS;
> }
>
> if (unlikely(!p))
> - return NULL;
> + goto err;
>
> bio = p + front_pad;
> bio_init(bio);
> @@ -351,6 +393,19 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>
> err_free:
> mempool_free(p, bs->bio_pool);
> +err:
> + if (gfp_mask != saved_gfp) {
> + gfp_mask = saved_gfp;
> +
> + spin_lock(&bs->rescue_lock);
> + bio_list_merge(&bs->rescue_list, current->bio_list);
> + bio_list_init(current->bio_list);
> + spin_unlock(&bs->rescue_lock);
> +
> + queue_work(bs->rescue_workqueue, &bs->rescue_work);
> + goto retry;
> + }

I wonder whether it would be easier to follow if this part is in-line
where retry: is. All that needs to be duplicated is single
mempool_alloc() call, right?

Overall, I *think* this is correct but need to think more about it to
be sure.

Thanks!

--
tejun

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Vivek Goyal 08-28-2012 10:06 PM

block: Avoid deadlocks with bio allocation by stacking drivers
 
On Tue, Aug 28, 2012 at 10:37:36AM -0700, Kent Overstreet wrote:
> Previously, if we ever try to allocate more than once from the same bio
> set while running under generic_make_request() (i.e. a stacking block
> driver), we risk deadlock.
>
> This is because of the code in generic_make_request() that converts
> recursion to iteration; any bios we submit won't actually be submitted
> (so they can complete and eventually be freed) until after we return -
> this means if we allocate a second bio, we're blocking the first one
> from ever being freed.
>
> Thus if enough threads call into a stacking block driver at the same
> time with bios that need multiple splits, and the bio_set's reserve gets
> used up, we deadlock.

Hi Kent,

So above deadlock possibility arises only if multiple threads are doing
multiple splits from same pool and all the threads get blocked on mempool
and don't return from ->make_request function.

Is there any existing driver in the tree which can cause this deadlock or
it becomes a possibility only when splitting and bcache code shows up?

Thanks
Vivek

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Kent Overstreet 08-28-2012 10:23 PM

block: Avoid deadlocks with bio allocation by stacking drivers
 
On Tue, Aug 28, 2012 at 06:06:10PM -0400, Vivek Goyal wrote:
> On Tue, Aug 28, 2012 at 10:37:36AM -0700, Kent Overstreet wrote:
> > Previously, if we ever try to allocate more than once from the same bio
> > set while running under generic_make_request() (i.e. a stacking block
> > driver), we risk deadlock.
> >
> > This is because of the code in generic_make_request() that converts
> > recursion to iteration; any bios we submit won't actually be submitted
> > (so they can complete and eventually be freed) until after we return -
> > this means if we allocate a second bio, we're blocking the first one
> > from ever being freed.
> >
> > Thus if enough threads call into a stacking block driver at the same
> > time with bios that need multiple splits, and the bio_set's reserve gets
> > used up, we deadlock.
>
> Hi Kent,
>
> So above deadlock possibility arises only if multiple threads are doing
> multiple splits from same pool and all the threads get blocked on mempool
> and don't return from ->make_request function.
>
> Is there any existing driver in the tree which can cause this deadlock or
> it becomes a possibility only when splitting and bcache code shows up?

It is emphatically possible with dm, though you would probably need a
pathalogical configuration of your targets for it to be possible in
practice - at least with the targets currently commonly in use.

With most of the newer dm targets coming down the pipe I expect it
should be possible under real world configurations, with the caveat that
under that kind of memory pressure in practice many other things will
fall over first.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Kent Overstreet 08-28-2012 10:28 PM

block: Avoid deadlocks with bio allocation by stacking drivers
 
On Tue, Aug 28, 2012 at 01:49:10PM -0700, Tejun Heo wrote:
> Hello,
>
> On Tue, Aug 28, 2012 at 10:37:36AM -0700, Kent Overstreet wrote:
> > @@ -324,13 +342,37 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> > front_pad = 0;
> > inline_vecs = nr_iovecs;
> > } else {
> > + /*
> > + * generic_make_request() converts recursion to iteration; this
> > + * means if we're running beneath it, any bios we allocate and
> > + * submit will not be submitted (and thus freed) until after we
> > + * return.
> > + *
> > + * This exposes us to a potential deadlock if we allocate
> > + * multiple bios from the same bio_set() while running
> > + * underneath generic_make_request(). If we were to allocate
> > + * multiple bios (say a stacking block driver that was splitting
> > + * bios), we would deadlock if we exhausted the mempool's
> > + * reserve.
> > + *
> > + * We solve this, and guarantee forward progress, with a rescuer
> > + * workqueue per bio_set. If we go to allocate and there are
> > + * bios on current->bio_list, we first try the allocation
> > + * without __GFP_WAIT; if that fails, we punt those bios we
> > + * would be blocking to the rescuer workqueue before we retry
> > + * with the original gfp_flags.
> > + */
> > +
> > + if (current->bio_list && !bio_list_empty(current->bio_list))
> > + gfp_mask &= ~__GFP_WAIT;
> > +retry:
> > p = mempool_alloc(bs->bio_pool, gfp_mask);
> > front_pad = bs->front_pad;
> > inline_vecs = BIO_INLINE_VECS;
> > }
> >
> > if (unlikely(!p))
> > - return NULL;
> > + goto err;
> >
> > bio = p + front_pad;
> > bio_init(bio);
> > @@ -351,6 +393,19 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> >
> > err_free:
> > mempool_free(p, bs->bio_pool);
> > +err:
> > + if (gfp_mask != saved_gfp) {
> > + gfp_mask = saved_gfp;
> > +
> > + spin_lock(&bs->rescue_lock);
> > + bio_list_merge(&bs->rescue_list, current->bio_list);
> > + bio_list_init(current->bio_list);
> > + spin_unlock(&bs->rescue_lock);
> > +
> > + queue_work(bs->rescue_workqueue, &bs->rescue_work);
> > + goto retry;
> > + }
>
> I wonder whether it would be easier to follow if this part is in-line
> where retry: is. All that needs to be duplicated is single
> mempool_alloc() call, right?

Actually, what might be better than both of those approaches is shoving
that code into another function. Then it's just:

if (gfp_mask != saved_gfp) {
gfp_mask = saved_gfp;
shovel_bios_to_rescuer();
goto retry;
}

> Overall, I *think* this is correct but need to think more about it to
> be sure.

Please do. As much time as I've spent staring at this kind of stuff,
I'm pretty sure I've got it correct but it still makes my head hurt to
work out all the various possible deadlocks.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Kent Overstreet 08-28-2012 11:01 PM

block: Avoid deadlocks with bio allocation by stacking drivers
 
On Tue, Aug 28, 2012 at 03:28:00PM -0700, Kent Overstreet wrote:
> On Tue, Aug 28, 2012 at 01:49:10PM -0700, Tejun Heo wrote:
> > Overall, I *think* this is correct but need to think more about it to
> > be sure.
>
> Please do. As much time as I've spent staring at this kind of stuff,
> I'm pretty sure I've got it correct but it still makes my head hurt to
> work out all the various possible deadlocks.

Hilarious thought: We're punting bios to a rescuer thread that's
specific to a certain bio_set, right? What if we happen to punt bios
from a different bio_set? And then the rescuer goes to resubmit those
bios, and in the process they happen to have dependencies on the
original bio_set...

I think it's actually necessary to filter out only bios from the current
bio_set to punt to the rescuer.

God I love the block layer sometimes.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Vivek Goyal 08-29-2012 01:31 AM

block: Avoid deadlocks with bio allocation by stacking drivers
 
On Tue, Aug 28, 2012 at 04:01:08PM -0700, Kent Overstreet wrote:
> On Tue, Aug 28, 2012 at 03:28:00PM -0700, Kent Overstreet wrote:
> > On Tue, Aug 28, 2012 at 01:49:10PM -0700, Tejun Heo wrote:
> > > Overall, I *think* this is correct but need to think more about it to
> > > be sure.
> >
> > Please do. As much time as I've spent staring at this kind of stuff,
> > I'm pretty sure I've got it correct but it still makes my head hurt to
> > work out all the various possible deadlocks.
>
> Hilarious thought: We're punting bios to a rescuer thread that's
> specific to a certain bio_set, right? What if we happen to punt bios
> from a different bio_set? And then the rescuer goes to resubmit those
> bios, and in the process they happen to have dependencies on the
> original bio_set...

Are they not fully allocated bios and when you submit these to underlying
device, ideally we should not be sharing memory pool at different layers
of stack otherwise we will deadlock any way as stack depth increases. So
there should not be a dependency on original bio_set?

Or, am I missing something. May be an example will help.

Thanks
Vivek

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Kent Overstreet 08-29-2012 03:25 AM

block: Avoid deadlocks with bio allocation by stacking drivers
 
On Tue, Aug 28, 2012 at 09:31:50PM -0400, Vivek Goyal wrote:
> On Tue, Aug 28, 2012 at 04:01:08PM -0700, Kent Overstreet wrote:
> > On Tue, Aug 28, 2012 at 03:28:00PM -0700, Kent Overstreet wrote:
> > > On Tue, Aug 28, 2012 at 01:49:10PM -0700, Tejun Heo wrote:
> > > > Overall, I *think* this is correct but need to think more about it to
> > > > be sure.
> > >
> > > Please do. As much time as I've spent staring at this kind of stuff,
> > > I'm pretty sure I've got it correct but it still makes my head hurt to
> > > work out all the various possible deadlocks.
> >
> > Hilarious thought: We're punting bios to a rescuer thread that's
> > specific to a certain bio_set, right? What if we happen to punt bios
> > from a different bio_set? And then the rescuer goes to resubmit those
> > bios, and in the process they happen to have dependencies on the
> > original bio_set...
>
> Are they not fully allocated bios and when you submit these to underlying
> device, ideally we should not be sharing memory pool at different layers
> of stack otherwise we will deadlock any way as stack depth increases. So
> there should not be a dependency on original bio_set?
>
> Or, am I missing something. May be an example will help.

Uh, it's more complicated than that. My brain is too fried right now to
walk through it in detail, but the problem (if it is a problem; I can't
convince myself one way or the other) is roughly:

one virt block device stacked on top of another - they both do arbitrary
splitting:

So once they've submitted a bio, that bio needs to make forward progress
even if the thread goes to allocate another bio and blocks before it
returns from its make_request fn.

That much my patch solves, with the rescuer thread; if the thread goes
to block, it punts those blocked bios off to a rescuer thread - and we
create one rescuer per bio set.

So going back to the stacked block devices, if you've got say dm on top
of md (or something else since md doesn't really do much splitting) -
each block device will have its own rescuer and everything should be
hunky dory.

Except that when thread a goes to punt those blocked bios to its
rescuer, it punts _all_ the bios on current->bio_list. Even those
generated by/belonging to other bio_sets.

So thread 1 in device b punts bios to its rescuer, thread 2

But thread 2 ends up with bios for both device a and b - because they're
stacked.

Thread 2 starts on bios for device a before it gets to those for device
b. But a is stacked on top of b, so in the process it generates more
bios for b.

So now it's uhm...

yeah, I'm gonna sleep on this. I'm pretty sure to be rigorously correct
filtering the right bios when we punt them to the rescuer is needed,
though.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


All times are GMT. The time now is 04:57 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.