FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Device-mapper Development

 
 
LinkBack Thread Tools
 
Old 08-13-2010, 02:38 PM
Christoph Hellwig
 
Default block: replace barrier with sequenced flush

On Fri, Aug 13, 2010 at 03:48:59PM +0200, Tejun Heo wrote:
> There are two reason to avoid changing the meaning of REQ_HARDBARRIER
> and just deprecate it. One is to avoid breaking filesystems'
> expectations underneath it. Please note that there are out-of-tree
> filesystems too. I think it would be too dangerous to relax
> REQ_HARDBARRIER.

Note that the renaming patch would include a move from REQ_HARDBARRIER
to REQ_FLUSH_FUA, so things just using REQ_HARDBARRIER will fail to
compile. And while out of tree filesystems do exist they it's their
problem to keep up with kernel changes. They decide not to be part
of the Linux kernel, so it'll be their job to keep up with it.

> Another is that pseudo block layer drivers (loop, virtio_blk,
> md/dm...) have assumptions about REQ_HARDBARRIER behavior and things
> would be broken in obscure ways between REQ_HARDBARRIER semantics
> change and updates to each of those drivers, so I don't really think
> changing the semantics while the mechanism is online is a good idea.

I don't think doing those changes in a separate commit is a good idea.

> > Then we can patches do disable the reiserfs barrier "optimization" as
> > the very first one, and DM/MD support which I'm currently working on
> > as the last one and we can start doing the heavy testing.
>
> Oops, I've already converted loop, virtio_blk/lguest and am working on
> md/dm right now too. I'm almost done with md and now doing dm. :-)
> Maybe we should post them right now so that we don't waste too much
> time trying to solve the same problems?

Here's the dm patch. It only handles normal bio based dm yet, which
I understand and can test. request based dm (multipath) still needs
work.


Index: linux-2.6/drivers/md/dm-crypt.c
================================================== =================
--- linux-2.6.orig/drivers/md/dm-crypt.c 2010-08-13 16:11:04.207010218 +0200
+++ linux-2.6/drivers/md/dm-crypt.c 2010-08-13 16:11:10.048003862 +0200
@@ -1249,7 +1249,7 @@ static int crypt_map(struct dm_target *t
struct dm_crypt_io *io;
struct crypt_config *cc;

- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio_empty_flush(bio)) {
cc = ti->private;
bio->bi_bdev = cc->dev->bdev;
return DM_MAPIO_REMAPPED;
Index: linux-2.6/drivers/md/dm-io.c
================================================== =================
--- linux-2.6.orig/drivers/md/dm-io.c 2010-08-13 16:11:04.213011894 +0200
+++ linux-2.6/drivers/md/dm-io.c 2010-08-13 16:11:10.049003792 +0200
@@ -364,7 +364,7 @@ static void dispatch_io(int rw, unsigned
*/
for (i = 0; i < num_regions; i++) {
*dp = old_pages;
- if (where[i].count || (rw & REQ_HARDBARRIER))
+ if (where[i].count || (rw & REQ_FLUSH))
do_region(rw, i, where + i, dp, io);
}

@@ -412,8 +412,8 @@ retry:
}
set_current_state(TASK_RUNNING);

- if (io->eopnotsupp_bits && (rw & REQ_HARDBARRIER)) {
- rw &= ~REQ_HARDBARRIER;
+ if (io->eopnotsupp_bits && (rw & REQ_FLUSH)) {
+ rw &= ~REQ_FLUSH;
goto retry;
}

Index: linux-2.6/drivers/md/dm-raid1.c
================================================== =================
--- linux-2.6.orig/drivers/md/dm-raid1.c 2010-08-13 16:11:04.220013431 +0200
+++ linux-2.6/drivers/md/dm-raid1.c 2010-08-13 16:11:10.054018319 +0200
@@ -670,7 +670,7 @@ static void do_writes(struct mirror_set
bio_list_init(&requeue);

while ((bio = bio_list_pop(writes))) {
- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio_empty_flush(bio)) {
bio_list_add(&sync, bio);
continue;
}
@@ -1203,7 +1203,7 @@ static int mirror_end_io(struct dm_targe
* We need to dec pending if this was a write.
*/
if (rw == WRITE) {
- if (likely(!bio_empty_barrier(bio)))
+ if (!bio_empty_flush(bio))
dm_rh_dec(ms->rh, map_context->ll);
return error;
}
Index: linux-2.6/drivers/md/dm-region-hash.c
================================================== =================
--- linux-2.6.orig/drivers/md/dm-region-hash.c 2010-08-13 16:11:04.228004631 +0200
+++ linux-2.6/drivers/md/dm-region-hash.c 2010-08-13 16:11:10.060003932 +0200
@@ -399,7 +399,7 @@ void dm_rh_mark_nosync(struct dm_region_
region_t region = dm_rh_bio_to_region(rh, bio);
int recovering = 0;

- if (bio_empty_barrier(bio)) {
+ if (bio_empty_flush(bio)) {
rh->barrier_failure = 1;
return;
}
@@ -524,7 +524,7 @@ void dm_rh_inc_pending(struct dm_region_
struct bio *bio;

for (bio = bios->head; bio; bio = bio->bi_next) {
- if (bio_empty_barrier(bio))
+ if (bio_empty_flush(bio))
continue;
rh_inc(rh, dm_rh_bio_to_region(rh, bio));
}
Index: linux-2.6/drivers/md/dm-snap.c
================================================== =================
--- linux-2.6.orig/drivers/md/dm-snap.c 2010-08-13 16:11:04.238004701 +0200
+++ linux-2.6/drivers/md/dm-snap.c 2010-08-13 16:11:10.067005677 +0200
@@ -1581,7 +1581,7 @@ static int snapshot_map(struct dm_target
chunk_t chunk;
struct dm_snap_pending_exception *pe = NULL;

- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio_empty_flush(bio)) {
bio->bi_bdev = s->cow->bdev;
return DM_MAPIO_REMAPPED;
}
@@ -1685,7 +1685,7 @@ static int snapshot_merge_map(struct dm_
int r = DM_MAPIO_REMAPPED;
chunk_t chunk;

- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio_empty_flush(bio)) {
if (!map_context->flush_request)
bio->bi_bdev = s->origin->bdev;
else
@@ -2123,7 +2123,7 @@ static int origin_map(struct dm_target *
struct dm_dev *dev = ti->private;
bio->bi_bdev = dev->bdev;

- if (unlikely(bio_empty_barrier(bio)))
+ if (bio_empty_flush(bio))
return DM_MAPIO_REMAPPED;

/* Only tell snapshots if this is a write */
Index: linux-2.6/drivers/md/dm-stripe.c
================================================== =================
--- linux-2.6.orig/drivers/md/dm-stripe.c 2010-08-13 16:11:04.247011266 +0200
+++ linux-2.6/drivers/md/dm-stripe.c 2010-08-13 16:11:10.072026629 +0200
@@ -214,7 +214,7 @@ static int stripe_map(struct dm_target *
sector_t offset, chunk;
uint32_t stripe;

- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio_empty_flush(bio)) {
BUG_ON(map_context->flush_request >= sc->stripes);
bio->bi_bdev = sc->stripe[map_context->flush_request].dev->bdev;
return DM_MAPIO_REMAPPED;
Index: linux-2.6/drivers/md/dm.c
================================================== =================
--- linux-2.6.orig/drivers/md/dm.c 2010-08-13 16:11:04.256004631 +0200
+++ linux-2.6/drivers/md/dm.c 2010-08-13 16:11:37.152005462 +0200
@@ -139,17 +139,6 @@ struct mapped_device {
spinlock_t deferred_lock;

/*
- * An error from the barrier request currently being processed.
- */
- int barrier_error;
-
- /*
- * Protect barrier_error from concurrent endio processing
- * in request-based dm.
- */
- spinlock_t barrier_error_lock;
-
- /*
* Processing queue (flush/barriers)
*/
struct workqueue_struct *wq;
@@ -194,9 +183,6 @@ struct mapped_device {

/* sysfs handle */
struct kobject kobj;
-
- /* zero-length barrier that will be cloned and submitted to targets */
- struct bio barrier_bio;
};

/*
@@ -505,10 +491,6 @@ static void end_io_acct(struct dm_io *io
part_stat_add(cpu, &dm_disk(md)->part0, ticks[rw], duration);
part_stat_unlock();

- /*
- * After this is decremented the bio must not be touched if it is
- * a barrier.
- */
dm_disk(md)->part0.in_flight[rw] = pending =
atomic_dec_return(&md->pending[rw]);
pending += atomic_read(&md->pending[rw^0x1]);
@@ -621,7 +603,7 @@ static void dec_pending(struct dm_io *io
*/
spin_lock_irqsave(&md->deferred_lock, flags);
if (__noflush_suspending(md)) {
- if (!(io->bio->bi_rw & REQ_HARDBARRIER))
+ if (!(io->bio->bi_rw & (REQ_FLUSH|REQ_FUA)))
bio_list_add_head(&md->deferred,
io->bio);
} else
@@ -633,25 +615,13 @@ static void dec_pending(struct dm_io *io
io_error = io->error;
bio = io->bio;

- if (bio->bi_rw & REQ_HARDBARRIER) {
- /*
- * There can be just one barrier request so we use
- * a per-device variable for error reporting.
- * Note that you can't touch the bio after end_io_acct
- */
- if (!md->barrier_error && io_error != -EOPNOTSUPP)
- md->barrier_error = io_error;
- end_io_acct(io);
- free_io(md, io);
- } else {
- end_io_acct(io);
- free_io(md, io);
+ end_io_acct(io);
+ free_io(md, io);

- if (io_error != DM_ENDIO_REQUEUE) {
- trace_block_bio_complete(md->queue, bio);
+ if (io_error != DM_ENDIO_REQUEUE) {
+ trace_block_bio_complete(md->queue, bio);

- bio_endio(bio, io_error);
- }
+ bio_endio(bio, io_error);
}
}
}
@@ -744,23 +714,6 @@ static void end_clone_bio(struct bio *cl
blk_update_request(tio->orig, 0, nr_bytes);
}

-static void store_barrier_error(struct mapped_device *md, int error)
-{
- unsigned long flags;
-
- spin_lock_irqsave(&md->barrier_error_lock, flags);
- /*
- * Basically, the first error is taken, but:
- * -EOPNOTSUPP supersedes any I/O error.
- * Requeue request supersedes any I/O error but -EOPNOTSUPP.
- */
- if (!md->barrier_error || error == -EOPNOTSUPP ||
- (md->barrier_error != -EOPNOTSUPP &&
- error == DM_ENDIO_REQUEUE))
- md->barrier_error = error;
- spin_unlock_irqrestore(&md->barrier_error_lock, flags);
-}
-
/*
* Don't touch any member of the md after calling this function because
* the md may be freed in dm_put() at the end of this function.
@@ -798,13 +751,11 @@ static void free_rq_clone(struct request
static void dm_end_request(struct request *clone, int error)
{
int rw = rq_data_dir(clone);
- int run_queue = 1;
- bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
struct dm_rq_target_io *tio = clone->end_io_data;
struct mapped_device *md = tio->md;
struct request *rq = tio->orig;

- if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
+ if (rq->cmd_type == REQ_TYPE_BLOCK_PC) {
rq->errors = clone->errors;
rq->resid_len = clone->resid_len;

@@ -818,15 +769,8 @@ static void dm_end_request(struct reques
}

free_rq_clone(clone);
-
- if (unlikely(is_barrier)) {
- if (unlikely(error))
- store_barrier_error(md, error);
- run_queue = 0;
- } else
- blk_end_request_all(rq, error);
-
- rq_completed(md, rw, run_queue);
+ blk_end_request_all(rq, error);
+ rq_completed(md, rw, 1);
}

static void dm_unprep_request(struct request *rq)
@@ -1113,7 +1057,7 @@ static struct bio *split_bvec(struct bio

clone->bi_sector = sector;
clone->bi_bdev = bio->bi_bdev;
- clone->bi_rw = bio->bi_rw & ~REQ_HARDBARRIER;
+ clone->bi_rw = bio->bi_rw;
clone->bi_vcnt = 1;
clone->bi_size = to_bytes(len);
clone->bi_io_vec->bv_offset = offset;
@@ -1140,7 +1084,6 @@ static struct bio *clone_bio(struct bio

clone = bio_alloc_bioset(GFP_NOIO, bio->bi_max_vecs, bs);
__bio_clone(clone, bio);
- clone->bi_rw &= ~REQ_HARDBARRIER;
clone->bi_destructor = dm_bio_destructor;
clone->bi_sector = sector;
clone->bi_idx = idx;
@@ -1186,7 +1129,7 @@ static void __flush_target(struct clone_
__map_bio(ti, clone, tio);
}

-static int __clone_and_map_empty_barrier(struct clone_info *ci)
+static int __clone_and_map_empty_flush(struct clone_info *ci)
{
unsigned target_nr = 0, flush_nr;
struct dm_target *ti;
@@ -1208,8 +1151,8 @@ static int __clone_and_map(struct clone_
sector_t len = 0, max;
struct dm_target_io *tio;

- if (unlikely(bio_empty_barrier(bio)))
- return __clone_and_map_empty_barrier(ci);
+ if (bio_empty_flush(bio))
+ return __clone_and_map_empty_flush(ci);

ti = dm_table_find_target(ci->map, ci->sector);
if (!dm_target_is_valid(ti))
@@ -1308,11 +1251,7 @@ static void __split_and_process_bio(stru

ci.map = dm_get_live_table(md);
if (unlikely(!ci.map)) {
- if (!(bio->bi_rw & REQ_HARDBARRIER))
- bio_io_error(bio);
- else
- if (!md->barrier_error)
- md->barrier_error = -EIO;
+ bio_io_error(bio);
return;
}

@@ -1326,7 +1265,7 @@ static void __split_and_process_bio(stru
spin_lock_init(&ci.io->endio_lock);
ci.sector = bio->bi_sector;
ci.sector_count = bio_sectors(bio);
- if (unlikely(bio_empty_barrier(bio)))
+ if (bio_empty_flush(bio))
ci.sector_count = 1;
ci.idx = bio->bi_idx;

@@ -1420,8 +1359,7 @@ static int _dm_request(struct request_qu
* If we're suspended or the thread is processing barriers
* we have to queue this io for later.
*/
- if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
- unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
+ if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags))) {
up_read(&md->io_lock);

if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) &&
@@ -1873,7 +1811,6 @@ static struct mapped_device *alloc_dev(i
init_rwsem(&md->io_lock);
mutex_init(&md->suspend_lock);
spin_lock_init(&md->deferred_lock);
- spin_lock_init(&md->barrier_error_lock);
rwlock_init(&md->map_lock);
atomic_set(&md->holders, 1);
atomic_set(&md->open_count, 0);
@@ -2233,38 +2170,6 @@ static int dm_wait_for_completion(struct
return r;
}

-static void dm_flush(struct mapped_device *md)
-{
- dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-
- bio_init(&md->barrier_bio);
- md->barrier_bio.bi_bdev = md->bdev;
- md->barrier_bio.bi_rw = WRITE_BARRIER;
- __split_and_process_bio(md, &md->barrier_bio);
-
- dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-}
-
-static void process_barrier(struct mapped_device *md, struct bio *bio)
-{
- md->barrier_error = 0;
-
- dm_flush(md);
-
- if (!bio_empty_barrier(bio)) {
- __split_and_process_bio(md, bio);
- dm_flush(md);
- }
-
- if (md->barrier_error != DM_ENDIO_REQUEUE)
- bio_endio(bio, md->barrier_error);
- else {
- spin_lock_irq(&md->deferred_lock);
- bio_list_add_head(&md->deferred, bio);
- spin_unlock_irq(&md->deferred_lock);
- }
-}
-
/*
* Process the deferred bios
*/
@@ -2290,12 +2195,8 @@ static void dm_wq_work(struct work_struc

if (dm_request_based(md))
generic_make_request(c);
- else {
- if (c->bi_rw & REQ_HARDBARRIER)
- process_barrier(md, c);
- else
- __split_and_process_bio(md, c);
- }
+ else
+ __split_and_process_bio(md, c);

down_write(&md->io_lock);
}
@@ -2326,8 +2227,6 @@ static int dm_rq_barrier(struct mapped_d
struct dm_target *ti;
struct request *clone;

- md->barrier_error = 0;
-
for (i = 0; i < num_targets; i++) {
ti = dm_table_get_target(map, i);
for (j = 0; j < ti->num_flush_requests; j++) {
@@ -2341,7 +2240,7 @@ static int dm_rq_barrier(struct mapped_d
dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
dm_table_put(map);

- return md->barrier_error;
+ return 0;
}

static void dm_rq_barrier_work(struct work_struct *work)
Index: linux-2.6/include/linux/bio.h
================================================== =================
--- linux-2.6.orig/include/linux/bio.h 2010-08-13 16:11:04.268004351 +0200
+++ linux-2.6/include/linux/bio.h 2010-08-13 16:11:10.082005677 +0200
@@ -66,8 +66,8 @@
#define bio_offset(bio) bio_iovec((bio))->bv_offset
#define bio_segments(bio) ((bio)->bi_vcnt - (bio)->bi_idx)
#define bio_sectors(bio) ((bio)->bi_size >> 9)
-#define bio_empty_barrier(bio)
- ((bio->bi_rw & REQ_HARDBARRIER) &&
+#define bio_empty_flush(bio)
+ ((bio->bi_rw & REQ_FLUSH) &&
!bio_has_data(bio) &&
!(bio->bi_rw & REQ_DISCARD))


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-14-2010, 10:36 AM
Christoph Hellwig
 
Default block: replace barrier with sequenced flush

On Fri, Aug 13, 2010 at 04:51:17PM +0200, Tejun Heo wrote:
> Do you want to change the whole thing in a single commit? That would
> be a pretty big invasive patch touching multiple subsystems.

We can just stop draining in the block layer in the first patch, then
stop doing the stuff in md/dm/etc in the following and then do the
final renaming patches. It would still be less patches then now, but
keep things working through the whole transition, which would really
help biseting any problems.

> + if (req->cmd_flags & REQ_FUA)
> + vbr->out_hdr.type |= VIRTIO_BLK_T_FUA;

I'd suggest not adding FUA support to virtio yet. Just using the flush
feature gives you a fully working barrier implementation.

Eventually we might want to add a flag in the block queue to send
REQ_FLUSH|REQ_FUA request through to virtio directly so that we can
avoid separate pre- and post flushes, but I really want to benchmark if
it makes an impact on real life setups first.

> Index: block/drivers/md/linear.c
> ================================================== =================
> --- block.orig/drivers/md/linear.c
> +++ block/drivers/md/linear.c
> @@ -294,8 +294,8 @@ static int linear_make_request (mddev_t
> dev_info_t *tmp_dev;
> sector_t start_sector;
>
> - if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
> - md_barrier_request(mddev, bio);
> + if (unlikely(bio->bi_rw & REQ_FLUSH)) {
> + md_flush_request(mddev, bio);

We only need the special md_flush_request handling for
empty REQ_FLUSH requests. REQ_WRITE | REQ_FLUSH just need the
flag propagated to the underlying devices.

> +static void md_end_flush(struct bio *bio, int err)
> {
> mdk_rdev_t *rdev = bio->bi_private;
> mddev_t *mddev = rdev->mddev;
>
> rdev_dec_pending(rdev, mddev);
>
> if (atomic_dec_and_test(&mddev->flush_pending)) {
> + /* The pre-request flush has finished */
> + schedule_work(&mddev->flush_work);

Once we only handle empty barriers here we can directly call bio_endio
instead of first scheduling a work queue.Once we only handle empty
barriers here we can directly call bio_endio and the super wakeup
instead of first scheduling a work queue.

> while ((bio = bio_list_pop(writes))) {
> - if (unlikely(bio_empty_barrier(bio))) {
> + if ((bio->bi_rw & REQ_FLUSH) && !bio_has_data(bio)) {

I kept bio_empty_barrier as bio_empty_flush, which actually is a quite
useful macro for the bio based drivers.

> @@ -621,7 +621,7 @@ static void dec_pending(struct dm_io *io
> */
> spin_lock_irqsave(&md->deferred_lock, flags);
> if (__noflush_suspending(md)) {
> - if (!(io->bio->bi_rw & REQ_HARDBARRIER))
> + if (!(io->bio->bi_rw & REQ_FLUSH))

I suspect we don't actually need to special case flushes here anymore.


> @@ -633,14 +633,14 @@ static void dec_pending(struct dm_io *io
> io_error = io->error;
> bio = io->bio;
>
> - if (bio->bi_rw & REQ_HARDBARRIER) {
> + if (bio->bi_rw & REQ_FLUSH) {
> /*
> - * There can be just one barrier request so we use
> + * There can be just one flush request so we use
> * a per-device variable for error reporting.
> * Note that you can't touch the bio after end_io_acct
> */
> - if (!md->barrier_error && io_error != -EOPNOTSUPP)
> - md->barrier_error = io_error;
> + if (!md->flush_error)
> + md->flush_error = io_error;

And we certainly do not need any special casing here. See my patch.

> {
> int rw = rq_data_dir(clone);
> int run_queue = 1;
> - bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
> + bool is_flush = clone->cmd_flags & REQ_FLUSH;
> struct dm_rq_target_io *tio = clone->end_io_data;
> struct mapped_device *md = tio->md;
> struct request *rq = tio->orig;
>
> - if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
> + if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_flush) {

We never send flush requests as REQ_TYPE_BLOCK_PC anymore, so no need
for the second half of this conditional.

> + if (!is_flush)
> + blk_end_request_all(rq, error);
> + else {
> if (unlikely(error))
> - store_barrier_error(md, error);
> + store_flush_error(md, error);
> run_queue = 0;
> - } else
> - blk_end_request_all(rq, error);
> + }

Flush requests can now be completed normally.

> @@ -1308,11 +1302,11 @@ static void __split_and_process_bio(stru
>
> ci.map = dm_get_live_table(md);
> if (unlikely(!ci.map)) {
> - if (!(bio->bi_rw & REQ_HARDBARRIER))
> + if (!(bio->bi_rw & REQ_FLUSH))
> bio_io_error(bio);
> else
> - if (!md->barrier_error)
> - md->barrier_error = -EIO;
> + if (!md->flush_error)
> + md->flush_error = -EIO;

No need for the special error handling here, flush requests can now
be completed normally.

> @@ -1417,11 +1419,11 @@ static int _dm_request(struct request_qu
> part_stat_unlock();
>
> /*
> - * If we're suspended or the thread is processing barriers
> + * If we're suspended or the thread is processing flushes
> * we have to queue this io for later.
> */
> if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
> - unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
> + (bio->bi_rw & REQ_FLUSH)) {
> up_read(&md->io_lock);

AFAICS this is only needed for the old barrier code, no need for this
for pure flushes.

> @@ -1464,10 +1466,7 @@ static int dm_request(struct request_que
>
> static bool dm_rq_is_flush_request(struct request *rq)
> {
> - if (rq->cmd_flags & REQ_FLUSH)
> - return true;
> - else
> - return false;
> + return rq->cmd_flags & REQ_FLUSH;
> }

It's probably worth just killing this wrapper.


> void dm_dispatch_request(struct request *rq)
> @@ -1520,7 +1519,7 @@ static int setup_clone(struct request *c
> if (dm_rq_is_flush_request(rq)) {
> blk_rq_init(NULL, clone);
> clone->cmd_type = REQ_TYPE_FS;
> - clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
> + clone->cmd_flags |= (REQ_FLUSH | WRITE);
> } else {
> r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
> dm_rq_bio_constructor, tio);

My suspicion is that we can get rif of all that special casing here
and just use blk_rq_prep_clone once it's been updated to propagate
REQ_FLUSH, similar to the DISCARD flag.

I also suspect that there is absolutely no need to the barrier work
queue once we stop waiting for outstanding request. But then again
the request based dm code still somewhat confuses me.

> +static void process_flush(struct mapped_device *md, struct bio *bio)
> {
> + md->flush_error = 0;
> +
> + /* handle REQ_FLUSH */
> dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
>
> - bio_init(&md->barrier_bio);
> - md->barrier_bio.bi_bdev = md->bdev;
> - md->barrier_bio.bi_rw = WRITE_BARRIER;
> - __split_and_process_bio(md, &md->barrier_bio);
> + bio_init(&md->flush_bio);
> + md->flush_bio.bi_bdev = md->bdev;
> + md->flush_bio.bi_rw = WRITE_FLUSH;
> + __split_and_process_bio(md, &md->flush_bio);

There's not need to use a separate flush_bio here.
__split_and_process_bio does the right thing for empty REQ_FLUSH
requests. See my patch for how to do this differenty. And yeah,
my version has been tested.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-17-2010, 01:19 PM
Christoph Hellwig
 
Default block: replace barrier with sequenced flush

On Tue, Aug 17, 2010 at 11:59:38AM +0200, Tejun Heo wrote:
> I'm not really convinced that would help much. If bisecting can point
> to the conversion as the culprit for whatever kind of failure,
> wouldn't that be enough? No matter what we do the conversion will be
> a single step thing. If we make the filesystems enforce the ordering
> first and then relax ordering in the block layer, bisection would
> still just point at the later patch. The same goes for md/dm, the
> best we can find out would be whether the conversion is correct or not
> anyway.

The filesystems already enforce the ordering, except reiserfs which
opts out if the barrier options is set.

> I'm not against restructuring the patchset if it makes more sense but
> it just feels like it would be a bit pointless effort (and one which
> would require much tighter coordination among different trees) at this
> point. Am I missing something?

What other trees do you mean? The conversions of the 8 filesystems
that actually support barriers need to go through this tree anyway
if we want to be able to test it. Also the changes in the filesystem
are absolutely minimal - it's basically just
s/WRITE_BARRIER/WRITE_FUA_FLUSH/ after my initial patch kill BH_Orderd,
and removing about 10 lines of code in reiserfs.

> > We only need the special md_flush_request handling for
> > empty REQ_FLUSH requests. REQ_WRITE | REQ_FLUSH just need the
> > flag propagated to the underlying devices.
>
> Hmm, not really, the WRITE should happen after all the data in cache
> are committed to NV media, meaning that empty FLUSH should already
> have finished by the time the WRITE starts.

You're right.

> >> while ((bio = bio_list_pop(writes))) {
> >> - if (unlikely(bio_empty_barrier(bio))) {
> >> + if ((bio->bi_rw & REQ_FLUSH) && !bio_has_data(bio)) {
> >
> > I kept bio_empty_barrier as bio_empty_flush, which actually is a quite
> > useful macro for the bio based drivers.
>
> Hmm... maybe. The reason why I removed bio_empty_flush() was that
> except for the front-most sequencer (block layer for all the request
> based ones and the front-most make_request for bio based ones), it
> doesn't make sense to see REQ_FLUSH + data bios. They should be
> sequenced at the front-most stage anyway, so I didn't have much use
> for them. Those code paths couldn't deal with REQ_FLUSH + data bios
> anyway.

The current bio_empty_barrier is only used in dm, and indeed only makes
sense for make_request-based drivers. But I think it's a rather useful
helper for them. Either way, it's not a big issue and either way is
fine with me.

> >> + if (bio->bi_rw & REQ_FLUSH) {
> >> /*
> >> - * There can be just one barrier request so we use
> >> + * There can be just one flush request so we use
> >> * a per-device variable for error reporting.
> >> * Note that you can't touch the bio after end_io_acct
> >> */
> >> - if (!md->barrier_error && io_error != -EOPNOTSUPP)
> >> - md->barrier_error = io_error;
> >> + if (!md->flush_error)
> >> + md->flush_error = io_error;
> >
> > And we certainly do not need any special casing here. See my patch.
>
> I wasn't sure about that part. You removed store_flush_error(), but
> DM_ENDIO_REQUEUE should still have higher priority than other
> failures, no?

Which priority?

> >> +static void process_flush(struct mapped_device *md, struct bio *bio)
> >> {
> >> + md->flush_error = 0;
> >> +
> >> + /* handle REQ_FLUSH */
> >> dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
> >>
> >> - bio_init(&md->barrier_bio);
> >> - md->barrier_bio.bi_bdev = md->bdev;
> >> - md->barrier_bio.bi_rw = WRITE_BARRIER;
> >> - __split_and_process_bio(md, &md->barrier_bio);
> >> + bio_init(&md->flush_bio);
> >> + md->flush_bio.bi_bdev = md->bdev;
> >> + md->flush_bio.bi_rw = WRITE_FLUSH;
> >> + __split_and_process_bio(md, &md->flush_bio);
> >
> > There's not need to use a separate flush_bio here.
> > __split_and_process_bio does the right thing for empty REQ_FLUSH
> > requests. See my patch for how to do this differenty. And yeah,
> > my version has been tested.
>
> But how do you make sure REQ_FLUSHes for preflush finish before
> starting the write?

Hmm, okay. I see how the special flush_bio makes the waiting easier,
let's see if Mike or other in the DM team have a better idea.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-17-2010, 04:59 PM
Christoph Hellwig
 
Default block: replace barrier with sequenced flush

On Tue, Aug 17, 2010 at 06:41:47PM +0200, Tejun Heo wrote:
> > What other trees do you mean?
>
> I was mostly thinking about dm/md, drdb and stuff, but you're talking
> about filesystem conversion patches being routed through block tree,
> right?

I think we really need all the conversions in one tree, block layer,
remapping drivers and filesystems.

Btw, I've done the conversion for all filesystems and I'm running tests
over them now. Expect the series late today or tomorrow.

> I might just resequence it to finish this part of discussion but what
> does that really buy us? It's not really gonna help bisection.
> Bisection won't be able to tell anything in higher resolution than
> "the new implementation doesn't work". If you show me how it would
> actually help, I'll happily reshuffle the patches.

It's not bisecting to find bugs in the barrier conversion. We can't
easily bisect it down anyway. The problem is when we try to bisect
other problems and get into the middle of the series barriers suddenly
are gone. Which is not very helpful for things like data integrity
problems in filesystems.

> >> I wasn't sure about that part. You removed store_flush_error(), but
> >> DM_ENDIO_REQUEUE should still have higher priority than other
> >> failures, no?
> >
> > Which priority?
>
> IIUC, when any of flushes get DM_ENDIO_REQUEUE (which tells the dm
> core layer to retry the whole bio later), it trumps all other failures
> and the bio is retried later. That was why DM_ENDIO_REQUEUE was
> prioritized over other error codes, which actually is sort of
> incorrect in that once a FLUSH fails, it _MUST_ be reported to upper
> layers as FLUSH failure implies data already lost. So,
> DM_ENDIO_REQUEUE actually should have lower priority than other
> failures. But, then again, the error codes still need to be
> prioritized.

I think that's something we better leave to the DM team.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-18-2010, 09:46 AM
Christoph Hellwig
 
Default block: replace barrier with sequenced flush

FYI: One issue with this series is that make_request based drivers
not have to access all REQ_FLUSH and REQ_FUA requests. We'll either
need to add handling to empty REQ_FLUSH requests to all of them or
figure out a way to prevent them getting sent. That is assuming they'll
simply ignore REQ_FLUSH/REQ_FUA on normal writes.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-19-2010, 10:20 AM
Christoph Hellwig
 
Default block: replace barrier with sequenced flush

On Thu, Aug 19, 2010 at 11:57:53AM +0200, Tejun Heo wrote:
> On 08/18/2010 11:46 AM, Christoph Hellwig wrote:
> > FYI: One issue with this series is that make_request based drivers
> > not have to access all REQ_FLUSH and REQ_FUA requests. We'll either
> > need to add handling to empty REQ_FLUSH requests to all of them or
> > figure out a way to prevent them getting sent. That is assuming they'll
> > simply ignore REQ_FLUSH/REQ_FUA on normal writes.
>
> Can you be a bit more specific? In most cases, request based drivers
> should be fine. They sit behind the front most request_queue which
> would discompose REQ_FLUSH/FUAs into appropriate command sequence.

I said make_request based drivers, that is drivers taking bios. These
get bios directly from __generic_make_request and need to deal with
REQ_FLUSH/FUA themselves. We have quite a few more than just dm/md of
this kind:

arch/powerpc/sysdev/axonram.c: blk_queue_make_request(bank->disk->queue, axon_ram_make_request);
drivers/block/aoe/aoeblk.c: blk_queue_make_request(d->blkq, aoeblk_make_request);
drivers/block/brd.c: blk_queue_make_request(brd->brd_queue, brd_make_request);
drivers/block/drbd/drbd_main.c: blk_queue_make_request(q, drbd_make_request_26);
drivers/block/loop.c: blk_queue_make_request(lo->lo_queue, loop_make_request);
drivers/block/pktcdvd.c: blk_queue_make_request(q, pkt_make_request);
drivers/block/ps3vram.c: blk_queue_make_request(queue, ps3vram_make_request);
drivers/block/umem.c: blk_queue_make_request(card->queue, mm_make_request);
drivers/s390/block/dcssblk.c: blk_queue_make_request(dev_info->dcssblk_queue, dcssblk_make_request);
drivers/s390/block/xpram.c: blk_queue_make_request(xpram_queues[i], xpram_make_request);
drivers/staging/zram/zram_drv.c:blk_queue_make_request(zram->queue, zram_make_request);


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-20-2010, 01:22 PM
Christoph Hellwig
 
Default block: replace barrier with sequenced flush

FYI: here's a little writeup to document the new cache flushing scheme,
intended to replace Documentation/block/barriers.txt. Any good
suggestion for a filename in the kernel tree?

---

Explicit volatile write cache control
=====================================

Introduction
------------

Many storage devices, especially in the consumer market, come with volatile
write back caches. That means the devices signal I/O completion to the
operating system before data actually has hit the physical medium. This
behavior obviously speeds up various workloads, but it means the operating
system needs to force data out to the physical medium when it performs
a data integrity operation like fsync, sync or an unmount.

The Linux block layer provides a two simple mechanism that lets filesystems
control the caching behavior of the storage device. These mechanisms are
a forced cache flush, and the Force Unit Access (FUA) flag for requests.


Explicit cache flushes
----------------------

The REQ_FLUSH flag can be OR ed into the r/w flags of a bio submitted from the
filesystem and will make sure the volatile cache of the storage device
has been flushed before the actual I/O operation is started. The explicit
guarantees write requests that have completed before the bio was submitted
actually are on the physical medium before this request has started.
In addition the REQ_FLUSH flag can be set on an otherwise empty bio
structure, which causes only an explicit cache flush without any dependent
I/O. It is recommend to use the blkdev_issue_flush() helper for a pure
cache flush.


Forced Unit Access
-----------------

The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the
filesystem and will make sure that I/O completion for this requests is not
signaled before the data has made it to non-volatile storage on the
physical medium.


Implementation details for filesystems
--------------------------------------

Filesystem can simply set the REQ_FLUSH and REQ_FUA bits and do not have to
worry if the underlying devices need any explicit cache flushing and how
the Forced Unit Access is implemented. The REQ_FLUSH and REQ_FUA flags
may both be set on a single bio.


Implementation details for make_request_fn based block drivers
--------------------------------------------------------------

These drivers will always see the REQ_FLUSH and REQ_FUA bits as they sit
directly below the submit_bio interface. For remapping drivers the REQ_FUA
bits needs to be propagate to underlying devices, and a global flush needs
to be implemented for bios with the REQ_FLUSH bit set. For real device
drivers that do not have a volatile cache the REQ_FLUSH and REQ_FUA bits
on non-empty bios can simply be ignored, and REQ_FLUSH requests without
data can be completed successfully without doing any work. Drivers for
devices with volatile caches need to implement the support for these
flags themselves without any help from the block layer.


Implementation details for request_fn based block drivers
--------------------------------------------------------------

For devices that do not support volatile write caches there is no driver
support required, the block layer completes empty REQ_FLUSH requests before
entering the driver and strips off the REQ_FLUSH and REQ_FUA bits from
requests that have a payload. For device with volatile write caches the
driver needs to tell the block layer that it supports flushing caches by
doing:

blk_queue_flush(sdkp->disk->queue, REQ_FLUSH);

and handle empty REQ_FLUSH requests in it's prep_fn/request_fn. Note that
REQ_FLUSH requests with a payload are automatically turned into a sequence
of empty REQ_FLUSH and the actual write by the block layer. For devices
that also support the FUA bit the block layer needs to be told to pass
through that bit using:

blk_queue_flush(sdkp->disk->queue, REQ_FLUSH | REQ_FUA);

and handle write requests that have the REQ_FUA bit set properly in it's
prep_fn/request_fn. If the FUA bit is not natively supported the block
layer turns it into an empty REQ_FLUSH requests after the actual write.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-23-2010, 12:48 PM
Christoph Hellwig
 
Default block: replace barrier with sequenced flush

On Mon, Aug 23, 2010 at 02:30:33PM +0200, Tejun Heo wrote:
> It might be useful to give several example configurations with
> different cache configurations. I don't have much experience with
> battery backed arrays but aren't they suppose to report write through
> cache automatically?

They usually do. I have one that doesn't, but SYNCHRONIZE CACHE on
it is so fast that it effectively must be a no-op.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-23-2010, 02:05 PM
Christoph Hellwig
 
Default block: replace barrier with sequenced flush

Below is an updated version of the documentation. It fixes several
typos Zach Brown noticed and replaces all references to a physical
medium with the term non-volatile storage. I haven't added any examples
yet as I need to figure how they fit into the rest of the document.

---

Explicit volatile write cache control
=====================================

Introduction
------------

Many storage devices, especially in the consumer market, come with volatile
write back caches. That means the devices signal I/O completion to the
operating system before data actually has hit the non-volatile storage. This
behavior obviously speeds up various workloads, but it means the operating
system needs to force data out to the non-volatile storage when it performs
a data integrity operation like fsync, sync or an unmount.

The Linux block layer provides two simple mechanism that lets filesystems
control the caching behavior of the storage device. These mechanisms are
a forced cache flush, and the Force Unit Access (FUA) flag for requests.


Explicit cache flushes
----------------------

The REQ_FLUSH flag can be OR ed into the r/w flags of a bio submitted from
the filesystem and will make sure the volatile cache of the storage device
has been flushed before the actual I/O operation is started. This explicitly
guarantees that previously completed write requests are on non-volatile
storage before the flagged bio starts. In addition the REQ_FLUSH flag can be
set on an otherwise empty bio structure, which causes only an explicit cache
flush without any dependent I/O. It is recommend to use
the blkdev_issue_flush() helper for a pure cache flush.


Forced Unit Access
-----------------

The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the
filesystem and will make sure that I/O completion for this requests is only
signaled after the data has been commited to non-volatile storage.


Implementation details for filesystems
--------------------------------------

Filesystem can simply set the REQ_FLUSH and REQ_FUA bits and do not have to
worry if the underlying devices need any explicit cache flushing and how
the Forced Unit Access is implemented. The REQ_FLUSH and REQ_FUA flags
may both be set on a single bio.


Implementation details for make_request_fn based block drivers
--------------------------------------------------------------

These drivers will always see the REQ_FLUSH and REQ_FUA bits as they sit
directly below the submit_bio interface. For remapping drivers the REQ_FUA
bits need to be propagated to underlying devices, and a global flush needs
to be implemented for bios with the REQ_FLUSH bit set. For real device
drivers that do not have a volatile cache the REQ_FLUSH and REQ_FUA bits
on non-empty bios can simply be ignored, and REQ_FLUSH requests without
data can be completed successfully without doing any work. Drivers for
devices with volatile caches need to implement the support for these
flags themselves without any help from the block layer.


Implementation details for request_fn based block drivers
--------------------------------------------------------------

For devices that do not support volatile write caches there is no driver
support required, the block layer completes empty REQ_FLUSH requests before
entering the driver and strips off the REQ_FLUSH and REQ_FUA bits from
requests that have a payload. For devices with volatile write caches the
driver needs to tell the block layer that it supports flushing caches by
doing:

blk_queue_flush(sdkp->disk->queue, REQ_FLUSH);

and handle empty REQ_FLUSH requests in its prep_fn/request_fn. Note that
REQ_FLUSH requests with a payload are automatically turned into a sequence
of empty REQ_FLUSH and the actual write by the block layer. For devices
that also support the FUA bit the block layer needs to be told to pass
through that bit using:

blk_queue_flush(sdkp->disk->queue, REQ_FLUSH | REQ_FUA);

and handle write requests that have the REQ_FUA bit set properly in its
prep_fn/request_fn. If the FUA bit is not natively supported the block
layer turns it into an empty REQ_FLUSH request after the actual write.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-23-2010, 02:08 PM
Christoph Hellwig
 
Default block: replace barrier with sequenced flush

On Mon, Aug 23, 2010 at 04:01:15PM +0200, Jens Axboe wrote:
> The problem purely exists on arrays that report write back cache enabled
> AND don't implement SYNC_CACHE as a noop. Do any of them exist, or are
> they purely urban legend?

I haven't seen it. I don't care particularly about this case, but once
it a while people want to disable flushing for testing or because they
really don't care.

What about adding a sysfs attribue to every request_queue that allows
disabling the cache flushing feature? Compared to the barrier option
this controls the feature at the right level and makes it available
to everyone instead of beeing duplicated. After a while we can then
simply ignore the barrier/nobarrier options.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 

Thread Tools




All times are GMT. The time now is 09:03 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org