FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Device-mapper Development

 
 
LinkBack Thread Tools
 
Old 08-13-2010, 01:21 PM
Tejun Heo
 
Default block: replace barrier with sequenced flush

Hello,

On 08/13/2010 02:55 PM, Vladislav Bolkhovitin wrote:
> If requested, I can develop the interface further.

I still think the benefit of ordering by tag would be marginal at
best, and what have you guys measured there? Under the current
framework, there's no easy way to measure full ordered-by-tag
implementation. The mechanism for filesystems to communicate the
ordering information (which would be a partially ordered graph) just
isn't there and there is no way the current usage of ordering-by-tag
only for barrier sequence can achieve anything close to that level of
difference.

Ripping out the original ordering by tag mechanism doesn't amount to
much. The use of ordering-by-tag was pretty half-assed there anyway.
If you think exporting full ordering information from filesystem to
the lower layers is worthwhile, please go ahead. It would be very
interesting to see how much actual difference it can make compared to
ordering-by-filesystem and if it's actually better and the added
complexity is manageable, there's no reason not to do that.

Thank you.

--
tejun

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-13-2010, 01:48 PM
Tejun Heo
 
Default block: replace barrier with sequenced flush

Hello, Christoph.

On 08/13/2010 01:48 PM, Christoph Hellwig wrote:
> The patchset looks functionally correct to me, and with a small patch
> to make use of WRITE_FUA_FLUSH survives xfstests, and instrumenting the
> underlying qemu shows that we actually get the flush requests where we should.

Great.

> No performance or power fail testing done yet.
>
> But I do not like the transition very much. The new WRITE_FUA_FLUSH
> request is exactly what filesystems expect from a current barrier
> request, so I'd rather move to that functionality without breaking stuff
> inbetween.
>
> So if it was to me I'd keep patches 1, 2, 4 and 5 from your series, than
> a main one to relax barrier semantics, then have the renaming patches 7
> and 8, and possible keep patch 11 separate from the main implementation
> change, and if absolutely also a separate one to introduce REQ_FUA and
> REQ_FLUSH in the bio interface, but keep things working while doing
> this.

There are two reason to avoid changing the meaning of REQ_HARDBARRIER
and just deprecate it. One is to avoid breaking filesystems'
expectations underneath it. Please note that there are out-of-tree
filesystems too. I think it would be too dangerous to relax
REQ_HARDBARRIER.

Another is that pseudo block layer drivers (loop, virtio_blk,
md/dm...) have assumptions about REQ_HARDBARRIER behavior and things
would be broken in obscure ways between REQ_HARDBARRIER semantics
change and updates to each of those drivers, so I don't really think
changing the semantics while the mechanism is online is a good idea.

> Then we can patches do disable the reiserfs barrier "optimization" as
> the very first one, and DM/MD support which I'm currently working on
> as the last one and we can start doing the heavy testing.

Oops, I've already converted loop, virtio_blk/lguest and am working on
md/dm right now too. I'm almost done with md and now doing dm. :-)
Maybe we should post them right now so that we don't waste too much
time trying to solve the same problems?

Thanks.

--
tejun

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-13-2010, 02:51 PM
Tejun Heo
 
Default block: replace barrier with sequenced flush

Hello,

On 08/13/2010 04:38 PM, Christoph Hellwig wrote:
> On Fri, Aug 13, 2010 at 03:48:59PM +0200, Tejun Heo wrote:
>> There are two reason to avoid changing the meaning of REQ_HARDBARRIER
>> and just deprecate it. One is to avoid breaking filesystems'
>> expectations underneath it. Please note that there are out-of-tree
>> filesystems too. I think it would be too dangerous to relax
>> REQ_HARDBARRIER.
>
> Note that the renaming patch would include a move from REQ_HARDBARRIER
> to REQ_FLUSH_FUA, so things just using REQ_HARDBARRIER will fail to
> compile. And while out of tree filesystems do exist they it's their
> problem to keep up with kernel changes. They decide not to be part
> of the Linux kernel, so it'll be their job to keep up with it.

Oh, right, we can simply remove REQ_HARDBARRIER completely.

>> Another is that pseudo block layer drivers (loop, virtio_blk,
>> md/dm...) have assumptions about REQ_HARDBARRIER behavior and things
>> would be broken in obscure ways between REQ_HARDBARRIER semantics
>> change and updates to each of those drivers, so I don't really think
>> changing the semantics while the mechanism is online is a good idea.
>
> I don't think doing those changes in a separate commit is a good idea.

Do you want to change the whole thing in a single commit? That would
be a pretty big invasive patch touching multiple subsystems. Also, I
don't know what to do about drdb and would like to leave its
conversion to the maintainer (in separate patches).

Eh, well, this is mostly logistics. Jens, what do you think?

>>> Then we can patches do disable the reiserfs barrier "optimization" as
>>> the very first one, and DM/MD support which I'm currently working on
>>> as the last one and we can start doing the heavy testing.
>>
>> Oops, I've already converted loop, virtio_blk/lguest and am working on
>> md/dm right now too. I'm almost done with md and now doing dm. :-)
>> Maybe we should post them right now so that we don't waste too much
>> time trying to solve the same problems?
>
> Here's the dm patch. It only handles normal bio based dm yet, which
> I understand and can test. request based dm (multipath) still needs
> work.

Here's the combined patch I've been working on. I've verified loop
and virtio_blk/loop. I just (like five mins ago) got dm/dm conversion
compiling, so I'm sure they're broken. The neat part is that thanks
to the separation between REQ_FLUSH and FUA handling, bio mangling
drivers only have to sequence the pre-flush and pass FUA directly to
lower layers which in many cases saves an array-wide cache flush
cycle.

After getting this patch working, the only remaining bits would be
blktrace and drdb.

Thanks.

Documentation/lguest/lguest.c | 36 +++-----
drivers/block/loop.c | 18 ++--
drivers/block/virtio_blk.c | 26 ++---
drivers/md/dm-io.c | 20 ----
drivers/md/dm-log.c | 2
drivers/md/dm-raid1.c | 8 -
drivers/md/dm-snap-persistent.c | 2
drivers/md/dm.c | 176 +++++++++++++++++++--------------------
drivers/md/linear.c | 4
drivers/md/md.c | 117 +++++---------------------
drivers/md/md.h | 23 +----
drivers/md/multipath.c | 4
drivers/md/raid0.c | 4
drivers/md/raid1.c | 178 +++++++++++++---------------------------
drivers/md/raid1.h | 2
drivers/md/raid10.c | 6 -
drivers/md/raid5.c | 18 +---
include/linux/virtio_blk.h | 6 +
18 files changed, 244 insertions(+), 406 deletions(-)

Index: block/drivers/block/loop.c
================================================== =================
--- block.orig/drivers/block/loop.c
+++ block/drivers/block/loop.c
@@ -477,17 +477,17 @@ static int do_bio_filebacked(struct loop
pos = ((loff_t) bio->bi_sector << 9) + lo->lo_offset;

if (bio_rw(bio) == WRITE) {
- bool barrier = (bio->bi_rw & REQ_HARDBARRIER);
struct file *file = lo->lo_backing_file;

- if (barrier) {
- if (unlikely(!file->f_op->fsync)) {
- ret = -EOPNOTSUPP;
- goto out;
- }
+ /* REQ_HARDBARRIER is deprecated */
+ if (bio->bi_rw & REQ_HARDBARRIER) {
+ ret = -EOPNOTSUPP;
+ goto out;
+ }

+ if (bio->bi_rw & REQ_FLUSH) {
ret = vfs_fsync(file, 0);
- if (unlikely(ret)) {
+ if (unlikely(ret && ret != -EINVAL)) {
ret = -EIO;
goto out;
}
@@ -495,9 +495,9 @@ static int do_bio_filebacked(struct loop

ret = lo_send(lo, bio, pos);

- if (barrier && !ret) {
+ if ((bio->bi_rw & REQ_FUA) && !ret) {
ret = vfs_fsync(file, 0);
- if (unlikely(ret))
+ if (unlikely(ret && ret != -EINVAL))
ret = -EIO;
}
} else
Index: block/drivers/block/virtio_blk.c
================================================== =================
--- block.orig/drivers/block/virtio_blk.c
+++ block/drivers/block/virtio_blk.c
@@ -128,9 +128,6 @@ static bool do_req(struct request_queue
}
}

- if (vbr->req->cmd_flags & REQ_HARDBARRIER)
- vbr->out_hdr.type |= VIRTIO_BLK_T_BARRIER;
-
sg_set_buf(&vblk->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));

/*
@@ -157,6 +154,8 @@ static bool do_req(struct request_queue
if (rq_data_dir(vbr->req) == WRITE) {
vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
out += num;
+ if (req->cmd_flags & REQ_FUA)
+ vbr->out_hdr.type |= VIRTIO_BLK_T_FUA;
} else {
vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
in += num;
@@ -307,6 +306,7 @@ static int __devinit virtblk_probe(struc
{
struct virtio_blk *vblk;
struct request_queue *q;
+ unsigned int flush;
int err;
u64 cap;
u32 v, blk_size, sg_elems, opt_io_size;
@@ -388,15 +388,13 @@ static int __devinit virtblk_probe(struc
vblk->disk->driverfs_dev = &vdev->dev;
index++;

- /*
- * If the FLUSH feature is supported we do have support for
- * flushing a volatile write cache on the host. Use that to
- * implement write barrier support; otherwise, we must assume
- * that the host does not perform any kind of volatile write
- * caching.
- */
+ /* configure queue flush support */
+ flush = 0;
if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
- blk_queue_flush(q, REQ_FLUSH);
+ flush |= REQ_FLUSH;
+ if (virtio_has_feature(vdev, VIRTIO_BLK_F_FUA))
+ flush |= REQ_FUA;
+ blk_queue_flush(q, flush);

/* If disk is read-only in the host, the guest should obey */
if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO))
@@ -515,9 +513,9 @@ static const struct virtio_device_id id_
};

static unsigned int features[] = {
- VIRTIO_BLK_F_BARRIER, VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX,
- VIRTIO_BLK_F_GEOMETRY, VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
- VIRTIO_BLK_F_SCSI, VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY
+ VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
+ VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE, VIRTIO_BLK_F_SCSI,
+ VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_FUA,
};

/*
Index: block/include/linux/virtio_blk.h
================================================== =================
--- block.orig/include/linux/virtio_blk.h
+++ block/include/linux/virtio_blk.h
@@ -16,6 +16,7 @@
#define VIRTIO_BLK_F_SCSI 7 /* Supports scsi command passthru */
#define VIRTIO_BLK_F_FLUSH 9 /* Cache flush command support */
#define VIRTIO_BLK_F_TOPOLOGY 10 /* Topology information is available */
+#define VIRTIO_BLK_F_FUA 11 /* Forced Unit Access write support */

#define VIRTIO_BLK_ID_BYTES 20 /* ID string length */

@@ -70,7 +71,10 @@ struct virtio_blk_config {
#define VIRTIO_BLK_T_FLUSH 4

/* Get device ID command */
-#define VIRTIO_BLK_T_GET_ID 8
+#define VIRTIO_BLK_T_GET_ID 8
+
+/* FUA command */
+#define VIRTIO_BLK_T_FUA 16

/* Barrier before this op. */
#define VIRTIO_BLK_T_BARRIER 0x80000000
Index: block/Documentation/lguest/lguest.c
================================================== =================
--- block.orig/Documentation/lguest/lguest.c
+++ block/Documentation/lguest/lguest.c
@@ -1639,15 +1639,6 @@ static void blk_request(struct virtqueue
off = out->sector * 512;

/*
- * The block device implements "barriers", where the Guest indicates
- * that it wants all previous writes to occur before this write. We
- * don't have a way of asking our kernel to do a barrier, so we just
- * synchronize all the data in the file. Pretty poor, no?
- */
- if (out->type & VIRTIO_BLK_T_BARRIER)
- fdatasync(vblk->fd);
-
- /*
* In general the virtio block driver is allowed to try SCSI commands.
* It'd be nice if we supported eject, for example, but we don't.
*/
@@ -1679,6 +1670,19 @@ static void blk_request(struct virtqueue
/* Die, bad Guest, die. */
errx(1, "Write past end %llu+%u", off, ret);
}
+
+ /* Honor FUA by syncing everything. */
+ if (ret >= 0 && (out->type & VIRTIO_BLK_T_FUA)) {
+ ret = fdatasync(vblk->fd);
+ verbose("FUA fdatasync: %i
", ret);
+ }
+
+ wlen = sizeof(*in);
+ *in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
+ } else if (out->type & VIRTIO_BLK_T_FLUSH) {
+ /* Flush */
+ ret = fdatasync(vblk->fd);
+ verbose("FLUSH fdatasync: %i
", ret);
wlen = sizeof(*in);
*in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
} else {
@@ -1702,15 +1706,6 @@ static void blk_request(struct virtqueue
}
}

- /*
- * OK, so we noted that it was pretty poor to use an fdatasync as a
- * barrier. But Christoph Hellwig points out that we need a sync
- * *afterwards* as well: "Barriers specify no reordering to the front
- * or the back." And Jens Axboe confirmed it, so here we are:
- */
- if (out->type & VIRTIO_BLK_T_BARRIER)
- fdatasync(vblk->fd);
-
/* Finished that request. */
add_used(vq, head, wlen);
}
@@ -1735,8 +1730,9 @@ static void setup_block_file(const char
vblk->fd = open_or_die(filename, O_RDWR|O_LARGEFILE);
vblk->len = lseek64(vblk->fd, 0, SEEK_END);

- /* We support barriers. */
- add_feature(dev, VIRTIO_BLK_F_BARRIER);
+ /* We support FLUSH and FUA. */
+ add_feature(dev, VIRTIO_BLK_F_FLUSH);
+ add_feature(dev, VIRTIO_BLK_F_FUA);

/* Tell Guest how many sectors this device has. */
conf.capacity = cpu_to_le64(vblk->len / 512);
Index: block/drivers/md/linear.c
================================================== =================
--- block.orig/drivers/md/linear.c
+++ block/drivers/md/linear.c
@@ -294,8 +294,8 @@ static int linear_make_request (mddev_t
dev_info_t *tmp_dev;
sector_t start_sector;

- if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
- md_barrier_request(mddev, bio);
+ if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bio);
return 0;
}

Index: block/drivers/md/md.c
================================================== =================
--- block.orig/drivers/md/md.c
+++ block/drivers/md/md.c
@@ -226,12 +226,12 @@ static int md_make_request(struct reques
return 0;
}
rcu_read_lock();
- if (mddev->suspended || mddev->barrier) {
+ if (mddev->suspended) {
DEFINE_WAIT(__wait);
for (; {
prepare_to_wait(&mddev->sb_wait, &__wait,
TASK_UNINTERRUPTIBLE);
- if (!mddev->suspended && !mddev->barrier)
+ if (!mddev->suspended)
break;
rcu_read_unlock();
schedule();
@@ -280,40 +280,29 @@ static void mddev_resume(mddev_t *mddev)

int mddev_congested(mddev_t *mddev, int bits)
{
- if (mddev->barrier)
- return 1;
return mddev->suspended;
}
EXPORT_SYMBOL(mddev_congested);

/*
- * Generic barrier handling for md
+ * Generic flush handling for md
*/

-#define POST_REQUEST_BARRIER ((void*)1)
-
-static void md_end_barrier(struct bio *bio, int err)
+static void md_end_flush(struct bio *bio, int err)
{
mdk_rdev_t *rdev = bio->bi_private;
mddev_t *mddev = rdev->mddev;
- if (err == -EOPNOTSUPP && mddev->barrier != POST_REQUEST_BARRIER)
- set_bit(BIO_EOPNOTSUPP, &mddev->barrier->bi_flags);

rdev_dec_pending(rdev, mddev);

if (atomic_dec_and_test(&mddev->flush_pending)) {
- if (mddev->barrier == POST_REQUEST_BARRIER) {
- /* This was a post-request barrier */
- mddev->barrier = NULL;
- wake_up(&mddev->sb_wait);
- } else
- /* The pre-request barrier has finished */
- schedule_work(&mddev->barrier_work);
+ /* The pre-request flush has finished */
+ schedule_work(&mddev->flush_work);
}
bio_put(bio);
}

-static void submit_barriers(mddev_t *mddev)
+static void submit_flushes(mddev_t *mddev)
{
mdk_rdev_t *rdev;

@@ -330,60 +319,56 @@ static void submit_barriers(mddev_t *mdd
atomic_inc(&rdev->nr_pending);
rcu_read_unlock();
bi = bio_alloc(GFP_KERNEL, 0);
- bi->bi_end_io = md_end_barrier;
+ bi->bi_end_io = md_end_flush;
bi->bi_private = rdev;
bi->bi_bdev = rdev->bdev;
atomic_inc(&mddev->flush_pending);
- submit_bio(WRITE_BARRIER, bi);
+ submit_bio(WRITE_FLUSH, bi);
rcu_read_lock();
rdev_dec_pending(rdev, mddev);
}
rcu_read_unlock();
}

-static void md_submit_barrier(struct work_struct *ws)
+static void md_submit_flush_data(struct work_struct *ws)
{
- mddev_t *mddev = container_of(ws, mddev_t, barrier_work);
- struct bio *bio = mddev->barrier;
+ mddev_t *mddev = container_of(ws, mddev_t, flush_work);
+ struct bio *bio = mddev->flush_bio;

atomic_set(&mddev->flush_pending, 1);

- if (test_bit(BIO_EOPNOTSUPP, &bio->bi_flags))
- bio_endio(bio, -EOPNOTSUPP);
- else if (bio->bi_size == 0)
+ if (bio->bi_size == 0)
/* an empty barrier - all done */
bio_endio(bio, 0);
else {
- bio->bi_rw &= ~REQ_HARDBARRIER;
+ bio->bi_rw &= ~REQ_FLUSH;
if (mddev->pers->make_request(mddev, bio))
generic_make_request(bio);
- mddev->barrier = POST_REQUEST_BARRIER;
- submit_barriers(mddev);
}
if (atomic_dec_and_test(&mddev->flush_pending)) {
- mddev->barrier = NULL;
+ mddev->flush_bio = NULL;
wake_up(&mddev->sb_wait);
}
}

-void md_barrier_request(mddev_t *mddev, struct bio *bio)
+void md_flush_request(mddev_t *mddev, struct bio *bio)
{
spin_lock_irq(&mddev->write_lock);
wait_event_lock_irq(mddev->sb_wait,
- !mddev->barrier,
+ !mddev->flush_bio,
mddev->write_lock, /*nothing*/);
- mddev->barrier = bio;
+ mddev->flush_bio = bio;
spin_unlock_irq(&mddev->write_lock);

atomic_set(&mddev->flush_pending, 1);
- INIT_WORK(&mddev->barrier_work, md_submit_barrier);
+ INIT_WORK(&mddev->flush_work, md_submit_flush_data);

- submit_barriers(mddev);
+ submit_flushes(mddev);

if (atomic_dec_and_test(&mddev->flush_pending))
- schedule_work(&mddev->barrier_work);
+ schedule_work(&mddev->flush_work);
}
-EXPORT_SYMBOL(md_barrier_request);
+EXPORT_SYMBOL(md_flush_request);

static inline mddev_t *mddev_get(mddev_t *mddev)
{
@@ -642,31 +627,6 @@ static void super_written(struct bio *bi
bio_put(bio);
}

-static void super_written_barrier(struct bio *bio, int error)
-{
- struct bio *bio2 = bio->bi_private;
- mdk_rdev_t *rdev = bio2->bi_private;
- mddev_t *mddev = rdev->mddev;
-
- if (!test_bit(BIO_UPTODATE, &bio->bi_flags) &&
- error == -EOPNOTSUPP) {
- unsigned long flags;
- /* barriers don't appear to be supported :-( */
- set_bit(BarriersNotsupp, &rdev->flags);
- mddev->barriers_work = 0;
- spin_lock_irqsave(&mddev->write_lock, flags);
- bio2->bi_next = mddev->biolist;
- mddev->biolist = bio2;
- spin_unlock_irqrestore(&mddev->write_lock, flags);
- wake_up(&mddev->sb_wait);
- bio_put(bio);
- } else {
- bio_put(bio2);
- bio->bi_private = rdev;
- super_written(bio, error);
- }
-}
-
void md_super_write(mddev_t *mddev, mdk_rdev_t *rdev,
sector_t sector, int size, struct page *page)
{
@@ -675,51 +635,28 @@ void md_super_write(mddev_t *mddev, mdk_
* and decrement it on completion, waking up sb_wait
* if zero is reached.
* If an error occurred, call md_error
- *
- * As we might need to resubmit the request if REQ_HARDBARRIER
- * causes ENOTSUPP, we allocate a spare bio...
*/
struct bio *bio = bio_alloc(GFP_NOIO, 1);
- int rw = REQ_WRITE | REQ_SYNC | REQ_UNPLUG;

bio->bi_bdev = rdev->bdev;
bio->bi_sector = sector;
bio_add_page(bio, page, size, 0);
bio->bi_private = rdev;
bio->bi_end_io = super_written;
- bio->bi_rw = rw;

atomic_inc(&mddev->pending_writes);
- if (!test_bit(BarriersNotsupp, &rdev->flags)) {
- struct bio *rbio;
- rw |= REQ_HARDBARRIER;
- rbio = bio_clone(bio, GFP_NOIO);
- rbio->bi_private = bio;
- rbio->bi_end_io = super_written_barrier;
- submit_bio(rw, rbio);
- } else
- submit_bio(rw, bio);
+ submit_bio(REQ_WRITE | REQ_SYNC | REQ_UNPLUG | REQ_FLUSH | REQ_FUA,
+ bio);
}

void md_super_wait(mddev_t *mddev)
{
- /* wait for all superblock writes that were scheduled to complete.
- * if any had to be retried (due to BARRIER problems), retry them
- */
+ /* wait for all superblock writes that were scheduled to complete */
DEFINE_WAIT(wq);
for(; {
prepare_to_wait(&mddev->sb_wait, &wq, TASK_UNINTERRUPTIBLE);
if (atomic_read(&mddev->pending_writes)==0)
break;
- while (mddev->biolist) {
- struct bio *bio;
- spin_lock_irq(&mddev->write_lock);
- bio = mddev->biolist;
- mddev->biolist = bio->bi_next ;
- bio->bi_next = NULL;
- spin_unlock_irq(&mddev->write_lock);
- submit_bio(bio->bi_rw, bio);
- }
schedule();
}
finish_wait(&mddev->sb_wait, &wq);
@@ -1016,7 +953,6 @@ static int super_90_validate(mddev_t *md
clear_bit(Faulty, &rdev->flags);
clear_bit(In_sync, &rdev->flags);
clear_bit(WriteMostly, &rdev->flags);
- clear_bit(BarriersNotsupp, &rdev->flags);

if (mddev->raid_disks == 0) {
mddev->major_version = 0;
@@ -1431,7 +1367,6 @@ static int super_1_validate(mddev_t *mdd
clear_bit(Faulty, &rdev->flags);
clear_bit(In_sync, &rdev->flags);
clear_bit(WriteMostly, &rdev->flags);
- clear_bit(BarriersNotsupp, &rdev->flags);

if (mddev->raid_disks == 0) {
mddev->major_version = 1;
@@ -4463,7 +4398,6 @@ static int md_run(mddev_t *mddev)
/* may be over-ridden by personality */
mddev->resync_max_sectors = mddev->dev_sectors;

- mddev->barriers_work = 1;
mddev->ok_start_degraded = start_dirty_degraded;

if (start_readonly && mddev->ro == 0)
@@ -4638,7 +4572,6 @@ static void md_clean(mddev_t *mddev)
mddev->recovery = 0;
mddev->in_sync = 0;
mddev->degraded = 0;
- mddev->barriers_work = 0;
mddev->safemode = 0;
mddev->bitmap_info.offset = 0;
mddev->bitmap_info.default_offset = 0;
Index: block/drivers/md/md.h
================================================== =================
--- block.orig/drivers/md/md.h
+++ block/drivers/md/md.h
@@ -67,7 +67,6 @@ struct mdk_rdev_s
#define Faulty 1 /* device is known to have a fault */
#define In_sync 2 /* device is in_sync with rest of array */
#define WriteMostly 4 /* Avoid reading if at all possible */
-#define BarriersNotsupp 5 /* REQ_HARDBARRIER is not supported */
#define AllReserved 6 /* If whole device is reserved for
* one array */
#define AutoDetected 7 /* added by auto-detect */
@@ -249,13 +248,6 @@ struct mddev_s
int degraded; /* whether md should consider
* adding a spare
*/
- int barriers_work; /* initialised to true, cleared as soon
- * as a barrier request to slave
- * fails. Only supported
- */
- struct bio *biolist; /* bios that need to be retried
- * because REQ_HARDBARRIER is not supported
- */

atomic_t recovery_active; /* blocks scheduled, but not written */
wait_queue_head_t recovery_wait;
@@ -308,16 +300,13 @@ struct mddev_s
struct list_head all_mddevs;

struct attribute_group *to_remove;
- /* Generic barrier handling.
- * If there is a pending barrier request, all other
- * writes are blocked while the devices are flushed.
- * The last to finish a flush schedules a worker to
- * submit the barrier request (without the barrier flag),
- * then submit more flush requests.
+ /* Generic flush handling.
+ * The last to finish preflush schedules a worker to submit
+ * the rest of the request (without the REQ_FLUSH flag).
*/
- struct bio *barrier;
+ struct bio *flush_bio;
atomic_t flush_pending;
- struct work_struct barrier_work;
+ struct work_struct flush_work;
};


@@ -458,7 +447,7 @@ extern void md_done_sync(mddev_t *mddev,
extern void md_error(mddev_t *mddev, mdk_rdev_t *rdev);

extern int mddev_congested(mddev_t *mddev, int bits);
-extern void md_barrier_request(mddev_t *mddev, struct bio *bio);
+extern void md_flush_request(mddev_t *mddev, struct bio *bio);
extern void md_super_write(mddev_t *mddev, mdk_rdev_t *rdev,
sector_t sector, int size, struct page *page);
extern void md_super_wait(mddev_t *mddev);
Index: block/drivers/md/raid0.c
================================================== =================
--- block.orig/drivers/md/raid0.c
+++ block/drivers/md/raid0.c
@@ -483,8 +483,8 @@ static int raid0_make_request(mddev_t *m
struct strip_zone *zone;
mdk_rdev_t *tmp_dev;

- if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
- md_barrier_request(mddev, bio);
+ if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bio);
return 0;
}

Index: block/drivers/md/raid1.c
================================================== =================
--- block.orig/drivers/md/raid1.c
+++ block/drivers/md/raid1.c
@@ -319,83 +319,74 @@ static void raid1_end_write_request(stru
if (r1_bio->bios[mirror] == bio)
break;

- if (error == -EOPNOTSUPP && test_bit(R1BIO_Barrier, &r1_bio->state)) {
- set_bit(BarriersNotsupp, &conf->mirrors[mirror].rdev->flags);
- set_bit(R1BIO_BarrierRetry, &r1_bio->state);
- r1_bio->mddev->barriers_work = 0;
- /* Don't rdev_dec_pending in this branch - keep it for the retry */
- } else {
+ /*
+ * 'one mirror IO has finished' event handler:
+ */
+ r1_bio->bios[mirror] = NULL;
+ to_put = bio;
+ if (!uptodate) {
+ md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
+ /* an I/O failed, we can't clear the bitmap */
+ set_bit(R1BIO_Degraded, &r1_bio->state);
+ } else
/*
- * this branch is our 'one mirror IO has finished' event handler:
+ * Set R1BIO_Uptodate in our master bio, so that we
+ * will return a good error code for to the higher
+ * levels even if IO on some other mirrored buffer
+ * fails.
+ *
+ * The 'master' represents the composite IO operation
+ * to user-side. So if something waits for IO, then it
+ * will wait for the 'master' bio.
*/
- r1_bio->bios[mirror] = NULL;
- to_put = bio;
- if (!uptodate) {
- md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
- /* an I/O failed, we can't clear the bitmap */
- set_bit(R1BIO_Degraded, &r1_bio->state);
- } else
- /*
- * Set R1BIO_Uptodate in our master bio, so that
- * we will return a good error code for to the higher
- * levels even if IO on some other mirrored buffer fails.
- *
- * The 'master' represents the composite IO operation to
- * user-side. So if something waits for IO, then it will
- * wait for the 'master' bio.
- */
- set_bit(R1BIO_Uptodate, &r1_bio->state);
+ set_bit(R1BIO_Uptodate, &r1_bio->state);
+
+ update_head_pos(mirror, r1_bio);

- update_head_pos(mirror, r1_bio);
+ if (behind) {
+ if (test_bit(WriteMostly, &conf->mirrors[mirror].rdev->flags))
+ atomic_dec(&r1_bio->behind_remaining);

- if (behind) {
- if (test_bit(WriteMostly, &conf->mirrors[mirror].rdev->flags))
- atomic_dec(&r1_bio->behind_remaining);
-
- /* In behind mode, we ACK the master bio once the I/O has safely
- * reached all non-writemostly disks. Setting the Returned bit
- * ensures that this gets done only once -- we don't ever want to
- * return -EIO here, instead we'll wait */
-
- if (atomic_read(&r1_bio->behind_remaining) >= (atomic_read(&r1_bio->remaining)-1) &&
- test_bit(R1BIO_Uptodate, &r1_bio->state)) {
- /* Maybe we can return now */
- if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
- struct bio *mbio = r1_bio->master_bio;
- PRINTK(KERN_DEBUG "raid1: behind end write sectors %llu-%llu
",
- (unsigned long long) mbio->bi_sector,
- (unsigned long long) mbio->bi_sector +
- (mbio->bi_size >> 9) - 1);
- bio_endio(mbio, 0);
- }
+ /*
+ * In behind mode, we ACK the master bio once the I/O
+ * has safely reached all non-writemostly
+ * disks. Setting the Returned bit ensures that this
+ * gets done only once -- we don't ever want to return
+ * -EIO here, instead we'll wait
+ */
+ if (atomic_read(&r1_bio->behind_remaining) >= (atomic_read(&r1_bio->remaining)-1) &&
+ test_bit(R1BIO_Uptodate, &r1_bio->state)) {
+ /* Maybe we can return now */
+ if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
+ struct bio *mbio = r1_bio->master_bio;
+ PRINTK(KERN_DEBUG "raid1: behind end write sectors %llu-%llu
",
+ (unsigned long long) mbio->bi_sector,
+ (unsigned long long) mbio->bi_sector +
+ (mbio->bi_size >> 9) - 1);
+ bio_endio(mbio, 0);
}
}
- rdev_dec_pending(conf->mirrors[mirror].rdev, conf->mddev);
}
+ rdev_dec_pending(conf->mirrors[mirror].rdev, conf->mddev);
+
/*
- *
* Let's see if all mirrored write operations have finished
* already.
*/
if (atomic_dec_and_test(&r1_bio->remaining)) {
- if (test_bit(R1BIO_BarrierRetry, &r1_bio->state))
- reschedule_retry(r1_bio);
- else {
- /* it really is the end of this request */
- if (test_bit(R1BIO_BehindIO, &r1_bio->state)) {
- /* free extra copy of the data pages */
- int i = bio->bi_vcnt;
- while (i--)
- safe_put_page(bio->bi_io_vec[i].bv_page);
- }
- /* clear the bitmap if all writes complete successfully */
- bitmap_endwrite(r1_bio->mddev->bitmap, r1_bio->sector,
- r1_bio->sectors,
- !test_bit(R1BIO_Degraded, &r1_bio->state),
- behind);
- md_write_end(r1_bio->mddev);
- raid_end_bio_io(r1_bio);
- }
+ if (test_bit(R1BIO_BehindIO, &r1_bio->state)) {
+ /* free extra copy of the data pages */
+ int i = bio->bi_vcnt;
+ while (i--)
+ safe_put_page(bio->bi_io_vec[i].bv_page);
+ }
+ /* clear the bitmap if all writes complete successfully */
+ bitmap_endwrite(r1_bio->mddev->bitmap, r1_bio->sector,
+ r1_bio->sectors,
+ !test_bit(R1BIO_Degraded, &r1_bio->state),
+ behind);
+ md_write_end(r1_bio->mddev);
+ raid_end_bio_io(r1_bio);
}

if (to_put)
@@ -787,17 +778,14 @@ static int make_request(mddev_t *mddev,
struct bio_list bl;
struct page **behind_pages = NULL;
const int rw = bio_data_dir(bio);
- const bool do_sync = (bio->bi_rw & REQ_SYNC);
- bool do_barriers;
+ const unsigned int do_sync = (bio->bi_rw & REQ_SYNC);
+ const unsigned int do_flush_fua = (bio->bi_rw & (REQ_FLUSH | REQ_FUA));
mdk_rdev_t *blocked_rdev;

/*
* Register the new request and wait if the reconstruction
* thread has put up a bar for new requests.
* Continue immediately if no resync is active currently.
- * We test barriers_work *after* md_write_start as md_write_start
- * may cause the first superblock write, and that will check out
- * if barriers work.
*/

md_write_start(mddev, bio); /* wait on superblock update early */
@@ -821,13 +809,6 @@ static int make_request(mddev_t *mddev,
}
finish_wait(&conf->wait_barrier, &w);
}
- if (unlikely(!mddev->barriers_work &&
- (bio->bi_rw & REQ_HARDBARRIER))) {
- if (rw == WRITE)
- md_write_end(mddev);
- bio_endio(bio, -EOPNOTSUPP);
- return 0;
- }

wait_barrier(conf);

@@ -877,7 +858,7 @@ static int make_request(mddev_t *mddev,
read_bio->bi_sector = r1_bio->sector + mirror->rdev->data_offset;
read_bio->bi_bdev = mirror->rdev->bdev;
read_bio->bi_end_io = raid1_end_read_request;
- read_bio->bi_rw = READ | do_sync;
+ read_bio->bi_rw = READ | do_sync | do_flush_fua;
read_bio->bi_private = r1_bio;

generic_make_request(read_bio);
@@ -959,10 +940,6 @@ static int make_request(mddev_t *mddev,
atomic_set(&r1_bio->remaining, 0);
atomic_set(&r1_bio->behind_remaining, 0);

- do_barriers = bio->bi_rw & REQ_HARDBARRIER;
- if (do_barriers)
- set_bit(R1BIO_Barrier, &r1_bio->state);
-
bio_list_init(&bl);
for (i = 0; i < disks; i++) {
struct bio *mbio;
@@ -975,7 +952,7 @@ static int make_request(mddev_t *mddev,
mbio->bi_sector = r1_bio->sector + conf->mirrors[i].rdev->data_offset;
mbio->bi_bdev = conf->mirrors[i].rdev->bdev;
mbio->bi_end_io = raid1_end_write_request;
- mbio->bi_rw = WRITE | do_barriers | do_sync;
+ mbio->bi_rw = WRITE | do_sync;
mbio->bi_private = r1_bio;

if (behind_pages) {
@@ -1631,41 +1608,6 @@ static void raid1d(mddev_t *mddev)
if (test_bit(R1BIO_IsSync, &r1_bio->state)) {
sync_request_write(mddev, r1_bio);
unplug = 1;
- } else if (test_bit(R1BIO_BarrierRetry, &r1_bio->state)) {
- /* some requests in the r1bio were REQ_HARDBARRIER
- * requests which failed with -EOPNOTSUPP. Hohumm..
- * Better resubmit without the barrier.
- * We know which devices to resubmit for, because
- * all others have had their bios[] entry cleared.
- * We already have a nr_pending reference on these rdevs.
- */
- int i;
- const bool do_sync = (r1_bio->master_bio->bi_rw & REQ_SYNC);
- clear_bit(R1BIO_BarrierRetry, &r1_bio->state);
- clear_bit(R1BIO_Barrier, &r1_bio->state);
- for (i=0; i < conf->raid_disks; i++)
- if (r1_bio->bios[i])
- atomic_inc(&r1_bio->remaining);
- for (i=0; i < conf->raid_disks; i++)
- if (r1_bio->bios[i]) {
- struct bio_vec *bvec;
- int j;
-
- bio = bio_clone(r1_bio->master_bio, GFP_NOIO);
- /* copy pages from the failed bio, as
- * this might be a write-behind device */
- __bio_for_each_segment(bvec, bio, j, 0)
- bvec->bv_page = bio_iovec_idx(r1_bio->bios[i], j)->bv_page;
- bio_put(r1_bio->bios[i]);
- bio->bi_sector = r1_bio->sector +
- conf->mirrors[i].rdev->data_offset;
- bio->bi_bdev = conf->mirrors[i].rdev->bdev;
- bio->bi_end_io = raid1_end_write_request;
- bio->bi_rw = WRITE | do_sync;
- bio->bi_private = r1_bio;
- r1_bio->bios[i] = bio;
- generic_make_request(bio);
- }
} else {
int disk;

Index: block/drivers/md/raid1.h
================================================== =================
--- block.orig/drivers/md/raid1.h
+++ block/drivers/md/raid1.h
@@ -117,8 +117,6 @@ struct r1bio_s {
#define R1BIO_IsSync 1
#define R1BIO_Degraded 2
#define R1BIO_BehindIO 3
-#define R1BIO_Barrier 4
-#define R1BIO_BarrierRetry 5
/* For write-behind requests, we call bi_end_io when
* the last non-write-behind device completes, providing
* any write was successful. Otherwise we call when
Index: block/drivers/md/raid5.c
================================================== =================
--- block.orig/drivers/md/raid5.c
+++ block/drivers/md/raid5.c
@@ -3278,7 +3278,7 @@ static void handle_stripe5(struct stripe

if (dec_preread_active) {
/* We delay this until after ops_run_io so that if make_request
- * is waiting on a barrier, it won't continue until the writes
+ * is waiting on a flush, it won't continue until the writes
* have actually been submitted.
*/
atomic_dec(&conf->preread_active_stripes);
@@ -3580,7 +3580,7 @@ static void handle_stripe6(struct stripe

if (dec_preread_active) {
/* We delay this until after ops_run_io so that if make_request
- * is waiting on a barrier, it won't continue until the writes
+ * is waiting on a flush, it won't continue until the writes
* have actually been submitted.
*/
atomic_dec(&conf->preread_active_stripes);
@@ -3958,14 +3958,8 @@ static int make_request(mddev_t *mddev,
const int rw = bio_data_dir(bi);
int remaining;

- if (unlikely(bi->bi_rw & REQ_HARDBARRIER)) {
- /* Drain all pending writes. We only really need
- * to ensure they have been submitted, but this is
- * easier.
- */
- mddev->pers->quiesce(mddev, 1);
- mddev->pers->quiesce(mddev, 0);
- md_barrier_request(mddev, bi);
+ if (unlikely(bi->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bi);
return 0;
}

@@ -4083,7 +4077,7 @@ static int make_request(mddev_t *mddev,
finish_wait(&conf->wait_for_overlap, &w);
set_bit(STRIPE_HANDLE, &sh->state);
clear_bit(STRIPE_DELAYED, &sh->state);
- if (mddev->barrier &&
+ if (mddev->flush_bio &&
!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
atomic_inc(&conf->preread_active_stripes);
release_stripe(sh);
@@ -4106,7 +4100,7 @@ static int make_request(mddev_t *mddev,
bio_endio(bi, 0);
}

- if (mddev->barrier) {
+ if (mddev->flush_bio) {
/* We need to wait for the stripes to all be handled.
* So: wait for preread_active_stripes to drop to 0.
*/
Index: block/drivers/md/multipath.c
================================================== =================
--- block.orig/drivers/md/multipath.c
+++ block/drivers/md/multipath.c
@@ -142,8 +142,8 @@ static int multipath_make_request(mddev_
struct multipath_bh * mp_bh;
struct multipath_info *multipath;

- if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
- md_barrier_request(mddev, bio);
+ if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bio);
return 0;
}

Index: block/drivers/md/raid10.c
================================================== =================
--- block.orig/drivers/md/raid10.c
+++ block/drivers/md/raid10.c
@@ -799,13 +799,13 @@ static int make_request(mddev_t *mddev,
int i;
int chunk_sects = conf->chunk_mask + 1;
const int rw = bio_data_dir(bio);
- const bool do_sync = (bio->bi_rw & REQ_SYNC);
+ const unsigned int do_sync = (bio->bi_rw & REQ_SYNC);
struct bio_list bl;
unsigned long flags;
mdk_rdev_t *blocked_rdev;

- if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
- md_barrier_request(mddev, bio);
+ if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+ md_flush_request(mddev, bio);
return 0;
}

Index: block/drivers/md/dm-io.c
================================================== =================
--- block.orig/drivers/md/dm-io.c
+++ block/drivers/md/dm-io.c
@@ -31,7 +31,6 @@ struct dm_io_client {
*/
struct io {
unsigned long error_bits;
- unsigned long eopnotsupp_bits;
atomic_t count;
struct task_struct *sleeper;
struct dm_io_client *client;
@@ -130,11 +129,8 @@ static void retrieve_io_and_region_from_
*---------------------------------------------------------------*/
static void dec_count(struct io *io, unsigned int region, int error)
{
- if (error) {
+ if (error)
set_bit(region, &io->error_bits);
- if (error == -EOPNOTSUPP)
- set_bit(region, &io->eopnotsupp_bits);
- }

if (atomic_dec_and_test(&io->count)) {
if (io->sleeper)
@@ -310,8 +306,8 @@ static void do_region(int rw, unsigned r
sector_t remaining = where->count;

/*
- * where->count may be zero if rw holds a write barrier and we
- * need to send a zero-sized barrier.
+ * where->count may be zero if rw holds a flush and we need to
+ * send a zero-sized flush.
*/
do {
/*
@@ -364,7 +360,7 @@ static void dispatch_io(int rw, unsigned
*/
for (i = 0; i < num_regions; i++) {
*dp = old_pages;
- if (where[i].count || (rw & REQ_HARDBARRIER))
+ if (where[i].count || (rw & REQ_FLUSH))
do_region(rw, i, where + i, dp, io);
}

@@ -393,9 +389,7 @@ static int sync_io(struct dm_io_client *
return -EIO;
}

-retry:
io->error_bits = 0;
- io->eopnotsupp_bits = 0;
atomic_set(&io->count, 1); /* see dispatch_io() */
io->sleeper = current;
io->client = client;
@@ -412,11 +406,6 @@ retry:
}
set_current_state(TASK_RUNNING);

- if (io->eopnotsupp_bits && (rw & REQ_HARDBARRIER)) {
- rw &= ~REQ_HARDBARRIER;
- goto retry;
- }
-
if (error_bits)
*error_bits = io->error_bits;

@@ -437,7 +426,6 @@ static int async_io(struct dm_io_client

io = mempool_alloc(client->pool, GFP_NOIO);
io->error_bits = 0;
- io->eopnotsupp_bits = 0;
atomic_set(&io->count, 1); /* see dispatch_io() */
io->sleeper = NULL;
io->client = client;
Index: block/drivers/md/dm-raid1.c
================================================== =================
--- block.orig/drivers/md/dm-raid1.c
+++ block/drivers/md/dm-raid1.c
@@ -259,7 +259,7 @@ static int mirror_flush(struct dm_target
struct dm_io_region io[ms->nr_mirrors];
struct mirror *m;
struct dm_io_request io_req = {
- .bi_rw = WRITE_BARRIER,
+ .bi_rw = WRITE_FLUSH,
.mem.type = DM_IO_KMEM,
.mem.ptr.bvec = NULL,
.client = ms->io_client,
@@ -629,7 +629,7 @@ static void do_write(struct mirror_set *
struct dm_io_region io[ms->nr_mirrors], *dest = io;
struct mirror *m;
struct dm_io_request io_req = {
- .bi_rw = WRITE | (bio->bi_rw & WRITE_BARRIER),
+ .bi_rw = WRITE | (bio->bi_rw & (WRITE_FLUSH | WRITE_FUA)),
.mem.type = DM_IO_BVEC,
.mem.ptr.bvec = bio->bi_io_vec + bio->bi_idx,
.notify.fn = write_callback,
@@ -670,7 +670,7 @@ static void do_writes(struct mirror_set
bio_list_init(&requeue);

while ((bio = bio_list_pop(writes))) {
- if (unlikely(bio_empty_barrier(bio))) {
+ if ((bio->bi_rw & REQ_FLUSH) && !bio_has_data(bio)) {
bio_list_add(&sync, bio);
continue;
}
@@ -1203,7 +1203,7 @@ static int mirror_end_io(struct dm_targe
* We need to dec pending if this was a write.
*/
if (rw == WRITE) {
- if (likely(!bio_empty_barrier(bio)))
+ if (!(bio->bi_rw & REQ_FLUSH) || bio_has_data(bio))
dm_rh_dec(ms->rh, map_context->ll);
return error;
}
Index: block/drivers/md/dm.c
================================================== =================
--- block.orig/drivers/md/dm.c
+++ block/drivers/md/dm.c
@@ -139,21 +139,21 @@ struct mapped_device {
spinlock_t deferred_lock;

/*
- * An error from the barrier request currently being processed.
+ * An error from the flush request currently being processed.
*/
- int barrier_error;
+ int flush_error;

/*
- * Protect barrier_error from concurrent endio processing
+ * Protect flush_error from concurrent endio processing
* in request-based dm.
*/
- spinlock_t barrier_error_lock;
+ spinlock_t flush_error_lock;

/*
- * Processing queue (flush/barriers)
+ * Processing queue (flush)
*/
struct workqueue_struct *wq;
- struct work_struct barrier_work;
+ struct work_struct flush_work;

/* A pointer to the currently processing pre/post flush request */
struct request *flush_request;
@@ -195,8 +195,8 @@ struct mapped_device {
/* sysfs handle */
struct kobject kobj;

- /* zero-length barrier that will be cloned and submitted to targets */
- struct bio barrier_bio;
+ /* zero-length flush that will be cloned and submitted to targets */
+ struct bio flush_bio;
};

/*
@@ -507,7 +507,7 @@ static void end_io_acct(struct dm_io *io

/*
* After this is decremented the bio must not be touched if it is
- * a barrier.
+ * a flush.
*/
dm_disk(md)->part0.in_flight[rw] = pending =
atomic_dec_return(&md->pending[rw]);
@@ -621,7 +621,7 @@ static void dec_pending(struct dm_io *io
*/
spin_lock_irqsave(&md->deferred_lock, flags);
if (__noflush_suspending(md)) {
- if (!(io->bio->bi_rw & REQ_HARDBARRIER))
+ if (!(io->bio->bi_rw & REQ_FLUSH))
bio_list_add_head(&md->deferred,
io->bio);
} else
@@ -633,14 +633,14 @@ static void dec_pending(struct dm_io *io
io_error = io->error;
bio = io->bio;

- if (bio->bi_rw & REQ_HARDBARRIER) {
+ if (bio->bi_rw & REQ_FLUSH) {
/*
- * There can be just one barrier request so we use
+ * There can be just one flush request so we use
* a per-device variable for error reporting.
* Note that you can't touch the bio after end_io_acct
*/
- if (!md->barrier_error && io_error != -EOPNOTSUPP)
- md->barrier_error = io_error;
+ if (!md->flush_error)
+ md->flush_error = io_error;
end_io_acct(io);
free_io(md, io);
} else {
@@ -744,21 +744,18 @@ static void end_clone_bio(struct bio *cl
blk_update_request(tio->orig, 0, nr_bytes);
}

-static void store_barrier_error(struct mapped_device *md, int error)
+static void store_flush_error(struct mapped_device *md, int error)
{
unsigned long flags;

- spin_lock_irqsave(&md->barrier_error_lock, flags);
+ spin_lock_irqsave(&md->flush_error_lock, flags);
/*
- * Basically, the first error is taken, but:
- * -EOPNOTSUPP supersedes any I/O error.
- * Requeue request supersedes any I/O error but -EOPNOTSUPP.
- */
- if (!md->barrier_error || error == -EOPNOTSUPP ||
- (md->barrier_error != -EOPNOTSUPP &&
- error == DM_ENDIO_REQUEUE))
- md->barrier_error = error;
- spin_unlock_irqrestore(&md->barrier_error_lock, flags);
+ * Basically, the first error is taken, but requeue request
+ * supersedes any I/O error.
+ */
+ if (!md->flush_error || error == DM_ENDIO_REQUEUE)
+ md->flush_error = error;
+ spin_unlock_irqrestore(&md->flush_error_lock, flags);
}

/*
@@ -799,12 +796,12 @@ static void dm_end_request(struct reques
{
int rw = rq_data_dir(clone);
int run_queue = 1;
- bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
+ bool is_flush = clone->cmd_flags & REQ_FLUSH;
struct dm_rq_target_io *tio = clone->end_io_data;
struct mapped_device *md = tio->md;
struct request *rq = tio->orig;

- if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
+ if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_flush) {
rq->errors = clone->errors;
rq->resid_len = clone->resid_len;

@@ -819,12 +816,13 @@ static void dm_end_request(struct reques

free_rq_clone(clone);

- if (unlikely(is_barrier)) {
+ if (!is_flush)
+ blk_end_request_all(rq, error);
+ else {
if (unlikely(error))
- store_barrier_error(md, error);
+ store_flush_error(md, error);
run_queue = 0;
- } else
- blk_end_request_all(rq, error);
+ }

rq_completed(md, rw, run_queue);
}
@@ -851,9 +849,9 @@ void dm_requeue_unmapped_request(struct
struct request_queue *q = rq->q;
unsigned long flags;

- if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
+ if (clone->cmd_flags & REQ_FLUSH) {
/*
- * Barrier clones share an original request.
+ * Flush clones share an original request.
* Leave it to dm_end_request(), which handles this special
* case.
*/
@@ -950,14 +948,14 @@ static void dm_complete_request(struct r
struct dm_rq_target_io *tio = clone->end_io_data;
struct request *rq = tio->orig;

- if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
+ if (clone->cmd_flags & REQ_FLUSH) {
/*
- * Barrier clones share an original request. So can't use
+ * Flush clones share an original request. So can't use
* softirq_done with the original.
* Pass the clone to dm_done() directly in this special case.
* It is safe (even if clone->q->queue_lock is held here)
* because there is no I/O dispatching during the completion
- * of barrier clone.
+ * of flush clone.
*/
dm_done(clone, error, true);
return;
@@ -979,9 +977,9 @@ void dm_kill_unmapped_request(struct req
struct dm_rq_target_io *tio = clone->end_io_data;
struct request *rq = tio->orig;

- if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
+ if (clone->cmd_flags & REQ_FLUSH) {
/*
- * Barrier clones share an original request.
+ * Flush clones share an original request.
* Leave it to dm_end_request(), which handles this special
* case.
*/
@@ -1098,7 +1096,7 @@ static void dm_bio_destructor(struct bio
}

/*
- * Creates a little bio that is just does part of a bvec.
+ * Creates a little bio that is just a part of a bvec.
*/
static struct bio *split_bvec(struct bio *bio, sector_t sector,
unsigned short idx, unsigned int offset,
@@ -1113,7 +1111,7 @@ static struct bio *split_bvec(struct bio

clone->bi_sector = sector;
clone->bi_bdev = bio->bi_bdev;
- clone->bi_rw = bio->bi_rw & ~REQ_HARDBARRIER;
+ clone->bi_rw = bio->bi_rw;
clone->bi_vcnt = 1;
clone->bi_size = to_bytes(len);
clone->bi_io_vec->bv_offset = offset;
@@ -1140,7 +1138,6 @@ static struct bio *clone_bio(struct bio

clone = bio_alloc_bioset(GFP_NOIO, bio->bi_max_vecs, bs);
__bio_clone(clone, bio);
- clone->bi_rw &= ~REQ_HARDBARRIER;
clone->bi_destructor = dm_bio_destructor;
clone->bi_sector = sector;
clone->bi_idx = idx;
@@ -1186,7 +1183,7 @@ static void __flush_target(struct clone_
__map_bio(ti, clone, tio);
}

-static int __clone_and_map_empty_barrier(struct clone_info *ci)
+static int __clone_and_map_flush(struct clone_info *ci)
{
unsigned target_nr = 0, flush_nr;
struct dm_target *ti;
@@ -1208,9 +1205,6 @@ static int __clone_and_map(struct clone_
sector_t len = 0, max;
struct dm_target_io *tio;

- if (unlikely(bio_empty_barrier(bio)))
- return __clone_and_map_empty_barrier(ci);
-
ti = dm_table_find_target(ci->map, ci->sector);
if (!dm_target_is_valid(ti))
return -EIO;
@@ -1308,11 +1302,11 @@ static void __split_and_process_bio(stru

ci.map = dm_get_live_table(md);
if (unlikely(!ci.map)) {
- if (!(bio->bi_rw & REQ_HARDBARRIER))
+ if (!(bio->bi_rw & REQ_FLUSH))
bio_io_error(bio);
else
- if (!md->barrier_error)
- md->barrier_error = -EIO;
+ if (!md->flush_error)
+ md->flush_error = -EIO;
return;
}

@@ -1325,14 +1319,22 @@ static void __split_and_process_bio(stru
ci.io->md = md;
spin_lock_init(&ci.io->endio_lock);
ci.sector = bio->bi_sector;
- ci.sector_count = bio_sectors(bio);
- if (unlikely(bio_empty_barrier(bio)))
+ if (!(bio->bi_rw & REQ_FLUSH))
+ ci.sector_count = bio_sectors(bio);
+ else {
+ /* FLUSH bio reaching here should all be empty */
+ WARN_ON_ONCE(bio_has_data(bio));
ci.sector_count = 1;
+ }
ci.idx = bio->bi_idx;

start_io_acct(ci.io);
- while (ci.sector_count && !error)
- error = __clone_and_map(&ci);
+ while (ci.sector_count && !error) {
+ if (!(bio->bi_rw & REQ_FLUSH))
+ error = __clone_and_map(&ci);
+ else
+ error = __clone_and_map_flush(&ci);
+ }

/* drop the extra reference count */
dec_pending(ci.io, error);
@@ -1417,11 +1419,11 @@ static int _dm_request(struct request_qu
part_stat_unlock();

/*
- * If we're suspended or the thread is processing barriers
+ * If we're suspended or the thread is processing flushes
* we have to queue this io for later.
*/
if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
- unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
+ (bio->bi_rw & REQ_FLUSH)) {
up_read(&md->io_lock);

if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) &&
@@ -1464,10 +1466,7 @@ static int dm_request(struct request_que

static bool dm_rq_is_flush_request(struct request *rq)
{
- if (rq->cmd_flags & REQ_FLUSH)
- return true;
- else
- return false;
+ return rq->cmd_flags & REQ_FLUSH;
}

void dm_dispatch_request(struct request *rq)
@@ -1520,7 +1519,7 @@ static int setup_clone(struct request *c
if (dm_rq_is_flush_request(rq)) {
blk_rq_init(NULL, clone);
clone->cmd_type = REQ_TYPE_FS;
- clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
+ clone->cmd_flags |= (REQ_FLUSH | WRITE);
} else {
r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
dm_rq_bio_constructor, tio);
@@ -1668,7 +1667,7 @@ static void dm_request_fn(struct request
BUG_ON(md->flush_request);
md->flush_request = rq;
blk_start_request(rq);
- queue_work(md->wq, &md->barrier_work);
+ queue_work(md->wq, &md->flush_work);
goto out;
}

@@ -1843,7 +1842,7 @@ out:
static const struct block_device_operations dm_blk_dops;

static void dm_wq_work(struct work_struct *work);
-static void dm_rq_barrier_work(struct work_struct *work);
+static void dm_rq_flush_work(struct work_struct *work);

/*
* Allocate and initialise a blank device with a given minor.
@@ -1873,7 +1872,7 @@ static struct mapped_device *alloc_dev(i
init_rwsem(&md->io_lock);
mutex_init(&md->suspend_lock);
spin_lock_init(&md->deferred_lock);
- spin_lock_init(&md->barrier_error_lock);
+ spin_lock_init(&md->flush_error_lock);
rwlock_init(&md->map_lock);
atomic_set(&md->holders, 1);
atomic_set(&md->open_count, 0);
@@ -1918,7 +1917,7 @@ static struct mapped_device *alloc_dev(i
atomic_set(&md->pending[1], 0);
init_waitqueue_head(&md->wait);
INIT_WORK(&md->work, dm_wq_work);
- INIT_WORK(&md->barrier_work, dm_rq_barrier_work);
+ INIT_WORK(&md->flush_work, dm_rq_flush_work);
init_waitqueue_head(&md->eventq);

md->disk->major = _major;
@@ -2233,31 +2232,28 @@ static int dm_wait_for_completion(struct
return r;
}

-static void dm_flush(struct mapped_device *md)
+static void process_flush(struct mapped_device *md, struct bio *bio)
{
+ md->flush_error = 0;
+
+ /* handle REQ_FLUSH */
dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);

- bio_init(&md->barrier_bio);
- md->barrier_bio.bi_bdev = md->bdev;
- md->barrier_bio.bi_rw = WRITE_BARRIER;
- __split_and_process_bio(md, &md->barrier_bio);
+ bio_init(&md->flush_bio);
+ md->flush_bio.bi_bdev = md->bdev;
+ md->flush_bio.bi_rw = WRITE_FLUSH;
+ __split_and_process_bio(md, &md->flush_bio);

dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-}
-
-static void process_barrier(struct mapped_device *md, struct bio *bio)
-{
- md->barrier_error = 0;

- dm_flush(md);
+ bio->bi_rw &= ~REQ_FLUSH;

- if (!bio_empty_barrier(bio)) {
+ /* handle data + REQ_FUA */
+ if (bio_has_data(bio))
__split_and_process_bio(md, bio);
- dm_flush(md);
- }

- if (md->barrier_error != DM_ENDIO_REQUEUE)
- bio_endio(bio, md->barrier_error);
+ if (md->flush_error != DM_ENDIO_REQUEUE)
+ bio_endio(bio, md->flush_error);
else {
spin_lock_irq(&md->deferred_lock);
bio_list_add_head(&md->deferred, bio);
@@ -2291,8 +2287,8 @@ static void dm_wq_work(struct work_struc
if (dm_request_based(md))
generic_make_request(c);
else {
- if (c->bi_rw & REQ_HARDBARRIER)
- process_barrier(md, c);
+ if (c->bi_rw & REQ_FLUSH)
+ process_flush(md, c);
else
__split_and_process_bio(md, c);
}
@@ -2317,8 +2313,8 @@ static void dm_rq_set_flush_nr(struct re
tio->info.flush_request = flush_nr;
}

-/* Issue barrier requests to targets and wait for their completion. */
-static int dm_rq_barrier(struct mapped_device *md)
+/* Issue flush requests to targets and wait for their completion. */
+static int dm_rq_flush(struct mapped_device *md)
{
int i, j;
struct dm_table *map = dm_get_live_table(md);
@@ -2326,7 +2322,7 @@ static int dm_rq_barrier(struct mapped_d
struct dm_target *ti;
struct request *clone;

- md->barrier_error = 0;
+ md->flush_error = 0;

for (i = 0; i < num_targets; i++) {
ti = dm_table_get_target(map, i);
@@ -2341,26 +2337,26 @@ static int dm_rq_barrier(struct mapped_d
dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
dm_table_put(map);

- return md->barrier_error;
+ return md->flush_error;
}

-static void dm_rq_barrier_work(struct work_struct *work)
+static void dm_rq_flush_work(struct work_struct *work)
{
int error;
struct mapped_device *md = container_of(work, struct mapped_device,
- barrier_work);
+ flush_work);
struct request_queue *q = md->queue;
struct request *rq;
unsigned long flags;

/*
* Hold the md reference here and leave it at the last part so that
- * the md can't be deleted by device opener when the barrier request
+ * the md can't be deleted by device opener when the flush request
* completes.
*/
dm_get(md);

- error = dm_rq_barrier(md);
+ error = dm_rq_flush(md);

rq = md->flush_request;
md->flush_request = NULL;
@@ -2520,7 +2516,7 @@ int dm_suspend(struct mapped_device *md,
up_write(&md->io_lock);

/*
- * Request-based dm uses md->wq for barrier (dm_rq_barrier_work) which
+ * Request-based dm uses md->wq for flush (dm_rq_flush_work) which
* can be kicked until md->queue is stopped. So stop md->queue before
* flushing md->wq.
*/
Index: block/drivers/md/dm-log.c
================================================== =================
--- block.orig/drivers/md/dm-log.c
+++ block/drivers/md/dm-log.c
@@ -300,7 +300,7 @@ static int flush_header(struct log_c *lc
.count = 0,
};

- lc->io_req.bi_rw = WRITE_BARRIER;
+ lc->io_req.bi_rw = WRITE_FLUSH;

return dm_io(&lc->io_req, 1, &null_location, NULL);
}
Index: block/drivers/md/dm-snap-persistent.c
================================================== =================
--- block.orig/drivers/md/dm-snap-persistent.c
+++ block/drivers/md/dm-snap-persistent.c
@@ -687,7 +687,7 @@ static void persistent_commit_exception(
/*
* Commit exceptions to disk.
*/
- if (ps->valid && area_io(ps, WRITE_BARRIER))
+ if (ps->valid && area_io(ps, WRITE_FLUSH_FUA))
ps->valid = 0;

/*

--
tejun

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-17-2010, 09:59 AM
Tejun Heo
 
Default block: replace barrier with sequenced flush

Hello, Christoph.

On 08/14/2010 12:36 PM, Christoph Hellwig wrote:
> On Fri, Aug 13, 2010 at 04:51:17PM +0200, Tejun Heo wrote:
>> Do you want to change the whole thing in a single commit? That would
>> be a pretty big invasive patch touching multiple subsystems.
>
> We can just stop draining in the block layer in the first patch, then
> stop doing the stuff in md/dm/etc in the following and then do the
> final renaming patches. It would still be less patches then now, but
> keep things working through the whole transition, which would really
> help biseting any problems.

I'm not really convinced that would help much. If bisecting can point
to the conversion as the culprit for whatever kind of failure,
wouldn't that be enough? No matter what we do the conversion will be
a single step thing. If we make the filesystems enforce the ordering
first and then relax ordering in the block layer, bisection would
still just point at the later patch. The same goes for md/dm, the
best we can find out would be whether the conversion is correct or not
anyway.

I'm not against restructuring the patchset if it makes more sense but
it just feels like it would be a bit pointless effort (and one which
would require much tighter coordination among different trees) at this
point. Am I missing something?

>> + if (req->cmd_flags & REQ_FUA)
>> + vbr->out_hdr.type |= VIRTIO_BLK_T_FUA;
>
> I'd suggest not adding FUA support to virtio yet. Just using the flush
> feature gives you a fully working barrier implementation.
>
> Eventually we might want to add a flag in the block queue to send
> REQ_FLUSH|REQ_FUA request through to virtio directly so that we can
> avoid separate pre- and post flushes, but I really want to benchmark if
> it makes an impact on real life setups first.

I wrote this in the other mail but I think it would make difference if
the backend storag is md/dm especially if it's shared by multiple VMs.
It cuts down on one array wide cache flush.

>> Index: block/drivers/md/linear.c
>> ================================================== =================
>> --- block.orig/drivers/md/linear.c
>> +++ block/drivers/md/linear.c
>> @@ -294,8 +294,8 @@ static int linear_make_request (mddev_t
>> dev_info_t *tmp_dev;
>> sector_t start_sector;
>>
>> - if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
>> - md_barrier_request(mddev, bio);
>> + if (unlikely(bio->bi_rw & REQ_FLUSH)) {
>> + md_flush_request(mddev, bio);
>
> We only need the special md_flush_request handling for
> empty REQ_FLUSH requests. REQ_WRITE | REQ_FLUSH just need the
> flag propagated to the underlying devices.

Hmm, not really, the WRITE should happen after all the data in cache
are committed to NV media, meaning that empty FLUSH should already
have finished by the time the WRITE starts.

>> +static void md_end_flush(struct bio *bio, int err)
>> {
>> mdk_rdev_t *rdev = bio->bi_private;
>> mddev_t *mddev = rdev->mddev;
>>
>> rdev_dec_pending(rdev, mddev);
>>
>> if (atomic_dec_and_test(&mddev->flush_pending)) {
>> + /* The pre-request flush has finished */
>> + schedule_work(&mddev->flush_work);
>
> Once we only handle empty barriers here we can directly call bio_endio
> instead of first scheduling a work queue.Once we only handle empty
> barriers here we can directly call bio_endio and the super wakeup
> instead of first scheduling a work queue.

Yeap, right. That would be a nice optimization.

>> while ((bio = bio_list_pop(writes))) {
>> - if (unlikely(bio_empty_barrier(bio))) {
>> + if ((bio->bi_rw & REQ_FLUSH) && !bio_has_data(bio)) {
>
> I kept bio_empty_barrier as bio_empty_flush, which actually is a quite
> useful macro for the bio based drivers.

Hmm... maybe. The reason why I removed bio_empty_flush() was that
except for the front-most sequencer (block layer for all the request
based ones and the front-most make_request for bio based ones), it
doesn't make sense to see REQ_FLUSH + data bios. They should be
sequenced at the front-most stage anyway, so I didn't have much use
for them. Those code paths couldn't deal with REQ_FLUSH + data bios
anyway.

>> @@ -621,7 +621,7 @@ static void dec_pending(struct dm_io *io
>> */
>> spin_lock_irqsave(&md->deferred_lock, flags);
>> if (__noflush_suspending(md)) {
>> - if (!(io->bio->bi_rw & REQ_HARDBARRIER))
>> + if (!(io->bio->bi_rw & REQ_FLUSH))
>
> I suspect we don't actually need to special case flushes here anymore.

Oh, I'm not sure about this part at all. I'll ask Mike.

>> @@ -633,14 +633,14 @@ static void dec_pending(struct dm_io *io
>> io_error = io->error;
>> bio = io->bio;
>>
>> - if (bio->bi_rw & REQ_HARDBARRIER) {
>> + if (bio->bi_rw & REQ_FLUSH) {
>> /*
>> - * There can be just one barrier request so we use
>> + * There can be just one flush request so we use
>> * a per-device variable for error reporting.
>> * Note that you can't touch the bio after end_io_acct
>> */
>> - if (!md->barrier_error && io_error != -EOPNOTSUPP)
>> - md->barrier_error = io_error;
>> + if (!md->flush_error)
>> + md->flush_error = io_error;
>
> And we certainly do not need any special casing here. See my patch.

I wasn't sure about that part. You removed store_flush_error(), but
DM_ENDIO_REQUEUE should still have higher priority than other
failures, no?

>> {
>> int rw = rq_data_dir(clone);
>> int run_queue = 1;
>> - bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
>> + bool is_flush = clone->cmd_flags & REQ_FLUSH;
>> struct dm_rq_target_io *tio = clone->end_io_data;
>> struct mapped_device *md = tio->md;
>> struct request *rq = tio->orig;
>>
>> - if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
>> + if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_flush) {
>
> We never send flush requests as REQ_TYPE_BLOCK_PC anymore, so no need
> for the second half of this conditional.

I see.

>> + if (!is_flush)
>> + blk_end_request_all(rq, error);
>> + else {
>> if (unlikely(error))
>> - store_barrier_error(md, error);
>> + store_flush_error(md, error);
>> run_queue = 0;
>> - } else
>> - blk_end_request_all(rq, error);
>> + }
>
> Flush requests can now be completed normally.

The same question as before. I think we still need to prioritize
DM_ENDIO_REQUEUE failures.

>> @@ -1417,11 +1419,11 @@ static int _dm_request(struct request_qu
>> part_stat_unlock();
>>
>> /*
>> - * If we're suspended or the thread is processing barriers
>> + * If we're suspended or the thread is processing flushes
>> * we have to queue this io for later.
>> */
>> if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
>> - unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
>> + (bio->bi_rw & REQ_FLUSH)) {
>> up_read(&md->io_lock);
>
> AFAICS this is only needed for the old barrier code, no need for this
> for pure flushes.

I'll ask Mike.

>> @@ -1464,10 +1466,7 @@ static int dm_request(struct request_que
>>
>> static bool dm_rq_is_flush_request(struct request *rq)
>> {
>> - if (rq->cmd_flags & REQ_FLUSH)
>> - return true;
>> - else
>> - return false;
>> + return rq->cmd_flags & REQ_FLUSH;
>> }
>
> It's probably worth just killing this wrapper.

Yeah, probably. It was an accidental edit to begin with and I left
this part out in the new patch.

>> +static void process_flush(struct mapped_device *md, struct bio *bio)
>> {
>> + md->flush_error = 0;
>> +
>> + /* handle REQ_FLUSH */
>> dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
>>
>> - bio_init(&md->barrier_bio);
>> - md->barrier_bio.bi_bdev = md->bdev;
>> - md->barrier_bio.bi_rw = WRITE_BARRIER;
>> - __split_and_process_bio(md, &md->barrier_bio);
>> + bio_init(&md->flush_bio);
>> + md->flush_bio.bi_bdev = md->bdev;
>> + md->flush_bio.bi_rw = WRITE_FLUSH;
>> + __split_and_process_bio(md, &md->flush_bio);
>
> There's not need to use a separate flush_bio here.
> __split_and_process_bio does the right thing for empty REQ_FLUSH
> requests. See my patch for how to do this differenty. And yeah,
> my version has been tested.

But how do you make sure REQ_FLUSHes for preflush finish before
starting the write?

Thanks.

--
tejun

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-17-2010, 04:41 PM
Tejun Heo
 
Default block: replace barrier with sequenced flush

Hi,

On 08/17/2010 03:19 PM, Christoph Hellwig wrote:
> On Tue, Aug 17, 2010 at 11:59:38AM +0200, Tejun Heo wrote:
>> I'm not against restructuring the patchset if it makes more sense but
>> it just feels like it would be a bit pointless effort (and one which
>> would require much tighter coordination among different trees) at this
>> point. Am I missing something?
>
> What other trees do you mean?

I was mostly thinking about dm/md, drdb and stuff, but you're talking
about filesystem conversion patches being routed through block tree,
right?

> The conversions of the 8 filesystems that actually support barriers
> need to go through this tree anyway if we want to be able to test
> it. Also the changes in the filesystem are absolutely minimal -
> it's basically just s/WRITE_BARRIER/WRITE_FUA_FLUSH/ after my
> initial patch kill BH_Orderd, and removing about 10 lines of code in
> reiserfs.

I might just resequence it to finish this part of discussion but what
does that really buy us? It's not really gonna help bisection.
Bisection won't be able to tell anything in higher resolution than
"the new implementation doesn't work". If you show me how it would
actually help, I'll happily reshuffle the patches.

>> I wasn't sure about that part. You removed store_flush_error(), but
>> DM_ENDIO_REQUEUE should still have higher priority than other
>> failures, no?
>
> Which priority?

IIUC, when any of flushes get DM_ENDIO_REQUEUE (which tells the dm
core layer to retry the whole bio later), it trumps all other failures
and the bio is retried later. That was why DM_ENDIO_REQUEUE was
prioritized over other error codes, which actually is sort of
incorrect in that once a FLUSH fails, it _MUST_ be reported to upper
layers as FLUSH failure implies data already lost. So,
DM_ENDIO_REQUEUE actually should have lower priority than other
failures. But, then again, the error codes still need to be
prioritized.

>> But how do you make sure REQ_FLUSHes for preflush finish before
>> starting the write?
>
> Hmm, okay. I see how the special flush_bio makes the waiting easier,
> let's see if Mike or other in the DM team have a better idea.

Yeah, it would be better if it can be sequenced w/o using a work but
let's leave it for later.

Thanks.

--
tejun

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-18-2010, 06:35 AM
Tejun Heo
 
Default block: replace barrier with sequenced flush

Hello,

On 08/17/2010 06:59 PM, Christoph Hellwig wrote:
> I think we really need all the conversions in one tree, block layer,
> remapping drivers and filesystems.

I don't know. If filesystem changes are really trivial maybe, but
md/dm changes seem a bit too invasive to go through the block tree.

> Btw, I've done the conversion for all filesystems and I'm running tests
> over them now. Expect the series late today or tomorrow.

Cool. :-)

>> I might just resequence it to finish this part of discussion but what
>> does that really buy us? It's not really gonna help bisection.
>> Bisection won't be able to tell anything in higher resolution than
>> "the new implementation doesn't work". If you show me how it would
>> actually help, I'll happily reshuffle the patches.
>
> It's not bisecting to find bugs in the barrier conversion. We can't
> easily bisect it down anyway. The problem is when we try to bisect
> other problems and get into the middle of the series barriers suddenly
> are gone. Which is not very helpful for things like data integrity
> problems in filesystems.

Ah, okay, hmmm.... alright, I'll resequence the patches. If the
filesystem changes can be put into a single tree somehow, we can keep
things mostly working at least for direct devices.

>> IIUC, when any of flushes get DM_ENDIO_REQUEUE (which tells the dm
>> core layer to retry the whole bio later), it trumps all other failures
>> and the bio is retried later. That was why DM_ENDIO_REQUEUE was
>> prioritized over other error codes, which actually is sort of
>> incorrect in that once a FLUSH fails, it _MUST_ be reported to upper
>> layers as FLUSH failure implies data already lost. So,
>> DM_ENDIO_REQUEUE actually should have lower priority than other
>> failures. But, then again, the error codes still need to be
>> prioritized.
>
> I think that's something we better leave to the DM team.

Sure, but we shouldn't be ripping out the code to do that.

Thanks.

--
tejun

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-18-2010, 08:11 AM
Tejun Heo
 
Default block: replace barrier with sequenced flush

Hello,

On 08/18/2010 08:35 AM, Tejun Heo wrote:
>> It's not bisecting to find bugs in the barrier conversion. We can't
>> easily bisect it down anyway. The problem is when we try to bisect
>> other problems and get into the middle of the series barriers suddenly
>> are gone. Which is not very helpful for things like data integrity
>> problems in filesystems.
>
> Ah, okay, hmmm.... alright, I'll resequence the patches. If the
> filesystem changes can be put into a single tree somehow, we can keep
> things mostly working at least for direct devices.

Sorry but I'm doing it. It just doesn't make much sense. I can't
relax the ordering for REQ_HARDBARRIER without breaking the remapping
drivers. So, to keep things working, I'll have to 1. relax the
ordering 2. implement new REQ_FLUSH/FUA based interface and 3. use
them in the filesystems in the same patch. That's just wrong. And I
don't think md/dm changes can or should go through the block tree.
They're way too invasive for that. It's a new implementation and
barrier won't work (fail gracefully) for several commits during the
transition. I don't think there's a better way around it.

Thanks.

--
tejun

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-18-2010, 07:29 PM
Vladislav Bolkhovitin
 
Default block: replace barrier with sequenced flush

Christoph Hellwig, on 08/13/2010 05:17 PM wrote:

As far as playing with ordered tags it's just adding a new flag for
it on the bio that gets passed down to the driver. For a final version
you'd need a queue-level feature if it's supported, but you don't
even need that for the initial work. Then you can implement a
variant of blk_do_flush that does away with queueing additional requests
once finish but queues all two or three at the same time with your
new ordered flag set, at which point you are back to the level or
ordered tag usage that the old code allows. You're still left with
all the hard problems of actually implementing error handling for it
and using it higher up in the filesystem and generic page cache code.


But how about file systems doing internal local order-by-drain? Without
converting them to use ordered commands it would be impossible to show
full potential of them and to make the conversion one would need deep
internal FS knowledge. That's my point. But if there's a trivial way to
see all such places in the filesystems code and convert, then OK, I agree.



I'd really love to see your results, up to the point of just trying
that once I get a little spare time. But my theory is that it won't
help us - the problem with ordered tags is that they enforce global
ordering while we currently have local ordering. While it will reduce
the latency for the process waiting for an fsync or similar it will
affect other I/O going on in the background and reduce the devices
ability to reorder that I/O.


The local ordering vs global ordering is relevant only if you have
several applications/threads load. But how about a single
application/thread?


Another point, for which, AFAIU, the ORDERED commands were invented, is
that they make ordering on the _another_ side of the link _after_ all
link/transfer latencies. This is why it's hard to see advantage of them
on local disks.


Vlad

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-18-2010, 07:30 PM
Vladislav Bolkhovitin
 
Default block: replace barrier with sequenced flush

Hello,

Tejun Heo, on 08/13/2010 05:21 PM wrote:

If requested, I can develop the interface further.


I still think the benefit of ordering by tag would be marginal at
best, and what have you guys measured there? Under the current
framework, there's no easy way to measure full ordered-by-tag
implementation. The mechanism for filesystems to communicate the
ordering information (which would be a partially ordered graph) just
isn't there and there is no way the current usage of ordering-by-tag
only for barrier sequence can achieve anything close to that level of
difference.


Basically, I measured how iSCSI link utilization depends from amount of
queued commands and queued data size. This is why I made it as a table.
From it you can see which improvement you will have removing queue
draining after 1, 2, 4, etc. commands depending of commands sizes.


For instance, on my previous XFS rm example, where rm of 4 files took
3.5 minutes with nobarrier option, I could see that XFS was sending 1-3
32K commands in a row. From my table you can see that if it sent all
them at once without draining, it would have about 150-200% speed increase.


Vlad

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-19-2010, 09:51 AM
Tejun Heo
 
Default block: replace barrier with sequenced flush

Hello,

On 08/18/2010 09:30 PM, Vladislav Bolkhovitin wrote:
> Basically, I measured how iSCSI link utilization depends from amount
> of queued commands and queued data size. This is why I made it as a
> table. From it you can see which improvement you will have removing
> queue draining after 1, 2, 4, etc. commands depending of commands
> sizes.
>
> For instance, on my previous XFS rm example, where rm of 4 files
> took 3.5 minutes with nobarrier option, I could see that XFS was
> sending 1-3 32K commands in a row. From my table you can see that if
> it sent all them at once without draining, it would have about
> 150-200% speed increase.

You compared barrier off/on. Of course, it will make a big
difference. I think good part of that gain should be realized by the
currently proposed patchset which removes draining. What's needed to
be demonstrated is the difference between ordered-by-waiting and
ordered-by-tag. We've never had code to do that properly.

The original ordered-by-tag we had only applied tag ordering to two or
three command sequences inside a barrier, which doesn't amount to much
(and could even be harmful as it imposes draining of all simple
commands inside the device only to reduce issue latencies for a few
commands). You'll need to hook into filesystem and somehow export the
ordering information down to the driver so that whatever needs
ordering is sent out as ordered commands.

As I've wrote multiple times, I'm pretty skeptical it will bring much.
Ordered tag mandates draining inside the device just like the original
barrier implementation. Sure, it's done at a lower layer and command
issue latencies will be reduced thanks to that but ordered-by-waiting
doesn't require _any_ draining at all. The whole pipeline can be kept
full all the time. I'm often wrong tho, so please feel free to go
ahead and prove me wrong. :-)

Thanks.

--
tejun

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 

Thread Tools




All times are GMT. The time now is 01:22 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org