FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Device-mapper Development

 
 
LinkBack Thread Tools
 
Old 05-18-2012, 03:00 AM
 
Default Gut bio_add_page()

From: Kent Overstreet <koverstreet@google.com>

Since generic_make_request() can now handle arbitrary size bios, all we
have to do is make sure the bvec array doesn't overflow.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
---
fs/bio.c | 133 ++++++++++++-------------------------------------------------
1 files changed, 26 insertions(+), 107 deletions(-)

diff --git a/fs/bio.c b/fs/bio.c
index 360ac93..e3276bd 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -568,12 +568,22 @@ int bio_get_nr_vecs(struct block_device *bdev)
}
EXPORT_SYMBOL(bio_get_nr_vecs);

-static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
- *page, unsigned int len, unsigned int offset,
- unsigned short max_sectors)
+/**
+ * bio_add_page - attempt to add page to bio
+ * @bio: destination bio
+ * @page: page to add
+ * @len: vec entry length
+ * @offset: vec entry offset
+ *
+ * Attempt to add a page to the bio_vec maplist. This can fail for a
+ * number of reasons, such as the bio being full or target block device
+ * limitations. The target block device must allow bio's up to PAGE_SIZE,
+ * so it is always possible to add a single page to an empty bio.
+ */
+int bio_add_page(struct bio *bio, struct page *page,
+ unsigned int len, unsigned int offset)
{
- int retried_segments = 0;
- struct bio_vec *bvec;
+ struct bio_vec *bv;

/*
* cloned bio must not modify vec list
@@ -581,40 +591,17 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
if (unlikely(bio_flagged(bio, BIO_CLONED)))
return 0;

- if (((bio->bi_size + len) >> 9) > max_sectors)
- return 0;
-
/*
* For filesystems with a blocksize smaller than the pagesize
* we will often be called with the same page as last time and
* a consecutive offset. Optimize this special case.
*/
if (bio->bi_vcnt > 0) {
- struct bio_vec *prev = &bio->bi_io_vec[bio->bi_vcnt - 1];
-
- if (page == prev->bv_page &&
- offset == prev->bv_offset + prev->bv_len) {
- unsigned int prev_bv_len = prev->bv_len;
- prev->bv_len += len;
-
- if (q->merge_bvec_fn) {
- struct bvec_merge_data bvm = {
- /* prev_bvec is already charged in
- bi_size, discharge it in order to
- simulate merging updated prev_bvec
- as new bvec. */
- .bi_bdev = bio->bi_bdev,
- .bi_sector = bio->bi_sector,
- .bi_size = bio->bi_size - prev_bv_len,
- .bi_rw = bio->bi_rw,
- };
-
- if (q->merge_bvec_fn(q, &bvm, prev) < prev->bv_len) {
- prev->bv_len -= len;
- return 0;
- }
- }
+ bv = bio_iovec_idx(bio, bio->bi_vcnt - 1);

+ if (page == bv->bv_page &&
+ offset == bv->bv_offset + bv->bv_len) {
+ bv->bv_len += len;
goto done;
}
}
@@ -622,64 +609,17 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
if (bio->bi_vcnt >= bio->bi_max_vecs)
return 0;

- /*
- * we might lose a segment or two here, but rather that than
- * make this too complex.
- */
-
- while (bio->bi_phys_segments >= queue_max_segments(q)) {
-
- if (retried_segments)
- return 0;
-
- retried_segments = 1;
- blk_recount_segments(q, bio);
- }
-
- /*
- * setup the new entry, we might clear it again later if we
- * cannot add the page
- */
- bvec = &bio->bi_io_vec[bio->bi_vcnt];
- bvec->bv_page = page;
- bvec->bv_len = len;
- bvec->bv_offset = offset;
-
- /*
- * if queue has other restrictions (eg varying max sector size
- * depending on offset), it can specify a merge_bvec_fn in the
- * queue to get further control
- */
- if (q->merge_bvec_fn) {
- struct bvec_merge_data bvm = {
- .bi_bdev = bio->bi_bdev,
- .bi_sector = bio->bi_sector,
- .bi_size = bio->bi_size,
- .bi_rw = bio->bi_rw,
- };
-
- /*
- * merge_bvec_fn() returns number of bytes it can accept
- * at this offset
- */
- if (q->merge_bvec_fn(q, &bvm, bvec) < bvec->bv_len) {
- bvec->bv_page = NULL;
- bvec->bv_len = 0;
- bvec->bv_offset = 0;
- return 0;
- }
- }
-
- /* If we may be able to merge these biovecs, force a recount */
- if (bio->bi_vcnt && (BIOVEC_PHYS_MERGEABLE(bvec-1, bvec)))
- bio->bi_flags &= ~(1 << BIO_SEG_VALID);
+ bv = bio_iovec_idx(bio, bio->bi_vcnt);
+ bv->bv_page = page;
+ bv->bv_len = len;
+ bv->bv_offset = offset;

bio->bi_vcnt++;
- bio->bi_phys_segments++;
- done:
+done:
bio->bi_size += len;
return len;
}
+EXPORT_SYMBOL(bio_add_page);

/**
* bio_add_pc_page - attempt to add page to bio
@@ -699,31 +639,10 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page *page,
unsigned int len, unsigned int offset)
{
- return __bio_add_page(q, bio, page, len, offset,
- queue_max_hw_sectors(q));
+ return bio_add_page(bio, page, len, offset);
}
EXPORT_SYMBOL(bio_add_pc_page);

-/**
- * bio_add_page - attempt to add page to bio
- * @bio: destination bio
- * @page: page to add
- * @len: vec entry length
- * @offset: vec entry offset
- *
- * Attempt to add a page to the bio_vec maplist. This can fail for a
- * number of reasons, such as the bio being full or target block device
- * limitations. The target block device must allow bio's up to PAGE_SIZE,
- * so it is always possible to add a single page to an empty bio.
- */
-int bio_add_page(struct bio *bio, struct page *page, unsigned int len,
- unsigned int offset)
-{
- struct request_queue *q = bdev_get_queue(bio->bi_bdev);
- return __bio_add_page(q, bio, page, len, offset, queue_max_sectors(q));
-}
-EXPORT_SYMBOL(bio_add_page);
-
struct bio_map_data {
struct bio_vec *iovecs;
struct sg_iovec *sgvecs;
--
1.7.9.rc2

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 05-24-2012, 12:02 AM
Kent Overstreet
 
Default Gut bio_add_page()

Since generic_make_request() can now handle arbitrary size bios, all we
have to do is make sure the bvec array doesn't overflow.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
---
fs/bio.c | 133 ++++++++++++--------------------------------------------------
1 file changed, 26 insertions(+), 107 deletions(-)

diff --git a/fs/bio.c b/fs/bio.c
index 9077a07..763e919 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -566,12 +566,22 @@ int bio_get_nr_vecs(struct block_device *bdev)
}
EXPORT_SYMBOL(bio_get_nr_vecs);

-static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
- *page, unsigned int len, unsigned int offset,
- unsigned short max_sectors)
+/**
+ * bio_add_page - attempt to add page to bio
+ * @bio: destination bio
+ * @page: page to add
+ * @len: vec entry length
+ * @offset: vec entry offset
+ *
+ * Attempt to add a page to the bio_vec maplist. This can fail for a
+ * number of reasons, such as the bio being full or target block device
+ * limitations. The target block device must allow bio's up to PAGE_SIZE,
+ * so it is always possible to add a single page to an empty bio.
+ */
+int bio_add_page(struct bio *bio, struct page *page,
+ unsigned int len, unsigned int offset)
{
- int retried_segments = 0;
- struct bio_vec *bvec;
+ struct bio_vec *bv;

/*
* cloned bio must not modify vec list
@@ -579,40 +589,17 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
if (unlikely(bio_flagged(bio, BIO_CLONED)))
return 0;

- if (((bio->bi_size + len) >> 9) > max_sectors)
- return 0;
-
/*
* For filesystems with a blocksize smaller than the pagesize
* we will often be called with the same page as last time and
* a consecutive offset. Optimize this special case.
*/
if (bio->bi_vcnt > 0) {
- struct bio_vec *prev = &bio->bi_io_vec[bio->bi_vcnt - 1];
-
- if (page == prev->bv_page &&
- offset == prev->bv_offset + prev->bv_len) {
- unsigned int prev_bv_len = prev->bv_len;
- prev->bv_len += len;
-
- if (q->merge_bvec_fn) {
- struct bvec_merge_data bvm = {
- /* prev_bvec is already charged in
- bi_size, discharge it in order to
- simulate merging updated prev_bvec
- as new bvec. */
- .bi_bdev = bio->bi_bdev,
- .bi_sector = bio->bi_sector,
- .bi_size = bio->bi_size - prev_bv_len,
- .bi_rw = bio->bi_rw,
- };
-
- if (q->merge_bvec_fn(q, &bvm, prev) < prev->bv_len) {
- prev->bv_len -= len;
- return 0;
- }
- }
+ bv = bio_iovec_idx(bio, bio->bi_vcnt - 1);

+ if (page == bv->bv_page &&
+ offset == bv->bv_offset + bv->bv_len) {
+ bv->bv_len += len;
goto done;
}
}
@@ -620,64 +607,17 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
if (bio->bi_vcnt >= bio->bi_max_vecs)
return 0;

- /*
- * we might lose a segment or two here, but rather that than
- * make this too complex.
- */
-
- while (bio->bi_phys_segments >= queue_max_segments(q)) {
-
- if (retried_segments)
- return 0;
-
- retried_segments = 1;
- blk_recount_segments(q, bio);
- }
-
- /*
- * setup the new entry, we might clear it again later if we
- * cannot add the page
- */
- bvec = &bio->bi_io_vec[bio->bi_vcnt];
- bvec->bv_page = page;
- bvec->bv_len = len;
- bvec->bv_offset = offset;
-
- /*
- * if queue has other restrictions (eg varying max sector size
- * depending on offset), it can specify a merge_bvec_fn in the
- * queue to get further control
- */
- if (q->merge_bvec_fn) {
- struct bvec_merge_data bvm = {
- .bi_bdev = bio->bi_bdev,
- .bi_sector = bio->bi_sector,
- .bi_size = bio->bi_size,
- .bi_rw = bio->bi_rw,
- };
-
- /*
- * merge_bvec_fn() returns number of bytes it can accept
- * at this offset
- */
- if (q->merge_bvec_fn(q, &bvm, bvec) < bvec->bv_len) {
- bvec->bv_page = NULL;
- bvec->bv_len = 0;
- bvec->bv_offset = 0;
- return 0;
- }
- }
-
- /* If we may be able to merge these biovecs, force a recount */
- if (bio->bi_vcnt && (BIOVEC_PHYS_MERGEABLE(bvec-1, bvec)))
- bio->bi_flags &= ~(1 << BIO_SEG_VALID);
+ bv = bio_iovec_idx(bio, bio->bi_vcnt);
+ bv->bv_page = page;
+ bv->bv_len = len;
+ bv->bv_offset = offset;

bio->bi_vcnt++;
- bio->bi_phys_segments++;
- done:
+done:
bio->bi_size += len;
return len;
}
+EXPORT_SYMBOL(bio_add_page);

/**
* bio_add_pc_page - attempt to add page to bio
@@ -697,31 +637,10 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page *page,
unsigned int len, unsigned int offset)
{
- return __bio_add_page(q, bio, page, len, offset,
- queue_max_hw_sectors(q));
+ return bio_add_page(bio, page, len, offset);
}
EXPORT_SYMBOL(bio_add_pc_page);

-/**
- * bio_add_page - attempt to add page to bio
- * @bio: destination bio
- * @page: page to add
- * @len: vec entry length
- * @offset: vec entry offset
- *
- * Attempt to add a page to the bio_vec maplist. This can fail for a
- * number of reasons, such as the bio being full or target block device
- * limitations. The target block device must allow bio's up to PAGE_SIZE,
- * so it is always possible to add a single page to an empty bio.
- */
-int bio_add_page(struct bio *bio, struct page *page, unsigned int len,
- unsigned int offset)
-{
- struct request_queue *q = bdev_get_queue(bio->bi_bdev);
- return __bio_add_page(q, bio, page, len, offset, queue_max_sectors(q));
-}
-EXPORT_SYMBOL(bio_add_page);
-
struct bio_map_data {
struct bio_vec *iovecs;
struct sg_iovec *sgvecs;
--
1.7.9.3.327.g2980b

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 05-25-2012, 08:25 PM
Kent Overstreet
 
Default Gut bio_add_page()

Since generic_make_request() can now handle arbitrary size bios, all we
have to do is make sure the bvec array doesn't overflow.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
---
fs/bio.c | 133 ++++++++++++--------------------------------------------------
1 file changed, 26 insertions(+), 107 deletions(-)

diff --git a/fs/bio.c b/fs/bio.c
index e4d54b2..b0c2944 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -570,12 +570,22 @@ int bio_get_nr_vecs(struct block_device *bdev)
}
EXPORT_SYMBOL(bio_get_nr_vecs);

-static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
- *page, unsigned int len, unsigned int offset,
- unsigned short max_sectors)
+/**
+ * bio_add_page - attempt to add page to bio
+ * @bio: destination bio
+ * @page: page to add
+ * @len: vec entry length
+ * @offset: vec entry offset
+ *
+ * Attempt to add a page to the bio_vec maplist. This can fail for a
+ * number of reasons, such as the bio being full or target block device
+ * limitations. The target block device must allow bio's up to PAGE_SIZE,
+ * so it is always possible to add a single page to an empty bio.
+ */
+int bio_add_page(struct bio *bio, struct page *page,
+ unsigned int len, unsigned int offset)
{
- int retried_segments = 0;
- struct bio_vec *bvec;
+ struct bio_vec *bv;

/*
* cloned bio must not modify vec list
@@ -583,40 +593,17 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
if (unlikely(bio_flagged(bio, BIO_CLONED)))
return 0;

- if (((bio->bi_size + len) >> 9) > max_sectors)
- return 0;
-
/*
* For filesystems with a blocksize smaller than the pagesize
* we will often be called with the same page as last time and
* a consecutive offset. Optimize this special case.
*/
if (bio->bi_vcnt > 0) {
- struct bio_vec *prev = &bio->bi_io_vec[bio->bi_vcnt - 1];
-
- if (page == prev->bv_page &&
- offset == prev->bv_offset + prev->bv_len) {
- unsigned int prev_bv_len = prev->bv_len;
- prev->bv_len += len;
-
- if (q->merge_bvec_fn) {
- struct bvec_merge_data bvm = {
- /* prev_bvec is already charged in
- bi_size, discharge it in order to
- simulate merging updated prev_bvec
- as new bvec. */
- .bi_bdev = bio->bi_bdev,
- .bi_sector = bio->bi_sector,
- .bi_size = bio->bi_size - prev_bv_len,
- .bi_rw = bio->bi_rw,
- };
-
- if (q->merge_bvec_fn(q, &bvm, prev) < prev->bv_len) {
- prev->bv_len -= len;
- return 0;
- }
- }
+ bv = bio_iovec_idx(bio, bio->bi_vcnt - 1);

+ if (page == bv->bv_page &&
+ offset == bv->bv_offset + bv->bv_len) {
+ bv->bv_len += len;
goto done;
}
}
@@ -624,64 +611,17 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
if (bio->bi_vcnt >= bio->bi_max_vecs)
return 0;

- /*
- * we might lose a segment or two here, but rather that than
- * make this too complex.
- */
-
- while (bio->bi_phys_segments >= queue_max_segments(q)) {
-
- if (retried_segments)
- return 0;
-
- retried_segments = 1;
- blk_recount_segments(q, bio);
- }
-
- /*
- * setup the new entry, we might clear it again later if we
- * cannot add the page
- */
- bvec = &bio->bi_io_vec[bio->bi_vcnt];
- bvec->bv_page = page;
- bvec->bv_len = len;
- bvec->bv_offset = offset;
-
- /*
- * if queue has other restrictions (eg varying max sector size
- * depending on offset), it can specify a merge_bvec_fn in the
- * queue to get further control
- */
- if (q->merge_bvec_fn) {
- struct bvec_merge_data bvm = {
- .bi_bdev = bio->bi_bdev,
- .bi_sector = bio->bi_sector,
- .bi_size = bio->bi_size,
- .bi_rw = bio->bi_rw,
- };
-
- /*
- * merge_bvec_fn() returns number of bytes it can accept
- * at this offset
- */
- if (q->merge_bvec_fn(q, &bvm, bvec) < bvec->bv_len) {
- bvec->bv_page = NULL;
- bvec->bv_len = 0;
- bvec->bv_offset = 0;
- return 0;
- }
- }
-
- /* If we may be able to merge these biovecs, force a recount */
- if (bio->bi_vcnt && (BIOVEC_PHYS_MERGEABLE(bvec-1, bvec)))
- bio->bi_flags &= ~(1 << BIO_SEG_VALID);
+ bv = bio_iovec_idx(bio, bio->bi_vcnt);
+ bv->bv_page = page;
+ bv->bv_len = len;
+ bv->bv_offset = offset;

bio->bi_vcnt++;
- bio->bi_phys_segments++;
- done:
+done:
bio->bi_size += len;
return len;
}
+EXPORT_SYMBOL(bio_add_page);

/**
* bio_add_pc_page - attempt to add page to bio
@@ -701,31 +641,10 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page *page,
unsigned int len, unsigned int offset)
{
- return __bio_add_page(q, bio, page, len, offset,
- queue_max_hw_sectors(q));
+ return bio_add_page(bio, page, len, offset);
}
EXPORT_SYMBOL(bio_add_pc_page);

-/**
- * bio_add_page - attempt to add page to bio
- * @bio: destination bio
- * @page: page to add
- * @len: vec entry length
- * @offset: vec entry offset
- *
- * Attempt to add a page to the bio_vec maplist. This can fail for a
- * number of reasons, such as the bio being full or target block device
- * limitations. The target block device must allow bio's up to PAGE_SIZE,
- * so it is always possible to add a single page to an empty bio.
- */
-int bio_add_page(struct bio *bio, struct page *page, unsigned int len,
- unsigned int offset)
-{
- struct request_queue *q = bdev_get_queue(bio->bi_bdev);
- return __bio_add_page(q, bio, page, len, offset, queue_max_sectors(q));
-}
-EXPORT_SYMBOL(bio_add_page);
-
struct bio_map_data {
struct bio_vec *iovecs;
struct sg_iovec *sgvecs;
--
1.7.9.3.327.g2980b

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 05-25-2012, 08:46 PM
Mike Snitzer
 
Default Gut bio_add_page()

On Fri, May 25 2012 at 4:25pm -0400,
Kent Overstreet <koverstreet@google.com> wrote:

> Since generic_make_request() can now handle arbitrary size bios, all we
> have to do is make sure the bvec array doesn't overflow.

I'd love to see the merge_bvec stuff go away but it does serve a
purpose: filesystems benefit from accurately building up much larger
bios (based on underlying device limits). XFS has leveraged this for
some time and ext4 adopted this (commit bd2d0210cf) because of the
performance advantage.

So if you don't have a mechanism for the filesystem's IO to have
accurate understanding of the limits of the device the filesystem is
built on (merge_bvec was the mechanism) and are leaning on late
splitting does filesystem performance suffer?

Would be nice to see before and after XFS and ext4 benchmarks against a
RAID device (level 5 or 6). I'm especially interested to get Dave
Chinner's and Ted's insight here.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 05-25-2012, 09:09 PM
Kent Overstreet
 
Default Gut bio_add_page()

On Fri, May 25, 2012 at 04:46:51PM -0400, Mike Snitzer wrote:
> I'd love to see the merge_bvec stuff go away but it does serve a
> purpose: filesystems benefit from accurately building up much larger
> bios (based on underlying device limits). XFS has leveraged this for
> some time and ext4 adopted this (commit bd2d0210cf) because of the
> performance advantage.

That commit only talks about skipping buffer heads, from the patch
description I don't see how merge_bvec_fn would have anything to do with
what it's after.

> So if you don't have a mechanism for the filesystem's IO to have
> accurate understanding of the limits of the device the filesystem is
> built on (merge_bvec was the mechanism) and are leaning on late
> splitting does filesystem performance suffer?

So is the issue that it may take longer for an IO to complete, or is it
CPU utilization/scalability?

If it's the former, we've got a real problem. If it's the latter - it
might be a problem in the interim (I don't expect generic_make_request()
to be splitting bios in the common case long term), but I doubt it's
going to be much of an issue.

> Would be nice to see before and after XFS and ext4 benchmarks against a
> RAID device (level 5 or 6). I'm especially interested to get Dave
> Chinner's and Ted's insight here.

Yeah.

I can't remember who it was, but Ted knows someone who was able to
benchmark on a 48 core system. I don't think we need numbers from a 48
core machine for these patches, but whatever workloads they were testing
that were problematic CPU wise would be useful to test.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 05-25-2012, 10:39 PM
Alasdair G Kergon
 
Default Gut bio_add_page()

Where's the urge to remove merge_bvec coming from?

I think it's premature to touch this, and that the other changes, if
fixed and integrated, should be allowed to bed themselves down first.


Ideally every bio would be the best size on submission and no bio would
ever need to be split.

But there is a cost involved in calculating the best size - we use
merge_bvec for this, which gives a (probable) maximum size. It's
usually very cheap to calculate - but not always. [In dm, we permit
some situations where the answer we give will turn out to be wrong, but
ensure dm will always fix up those particular cases itself later and
still process the over-sized bio correctly.]

Similarly there is a performance penalty incurred when the size is wrong
- the bio has to be split, requiring memory, potential delays etc.

There is a trade-off between those two, and our experience with the current
code has that tilted strongly in favour of using merge_bvec all the time.
The wasted overhead in cases where it is of no benefit seem to be
outweighed by the benefit where it does avoid lots of splitting and help
filesystems optimise their behaviour.


If the splitting mechanism is changed as proposed, then that balance
might shift. My gut feeling though is that any shift would strengthen
the case for merge_bvec.

Alasdair

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 05-28-2012, 04:07 PM
Mikulas Patocka
 
Default Gut bio_add_page()

Hi

The general problem with bio_add_page simplification is this:

Suppose that you have an old ATA disk that can read or write at most 256
sectors. Suppose that you are reading from the disk and readahead for 512
sectors is used:

With accurately sized bios, you send one bio for 256 sectors (it is sent
immediatelly to the disk) and a second bio for another 256 sectors (it is
put to the block device queue). The first bio finishes, pages are marked
as uptodate, the second bio is sent to the disk. While the disk is
processing the second bio, the kernel already knows that the first 256
sectors are finished - so it copies the data to userspace and lets the
userspace process them - while the disk is processing the second bio. So,
disk transfer and data processing are overlapped.

Now, with your patch, you send just one 512-sector bio. The bio is split
to two bios, the first one is sent to the disk and you wait. The disk
finishes the first bio, you send the second bio to the disk and wait. The
disk finishes the second bio. You complete the master bio, mark all 512
sectors as uptodate in the pagecache, start copying data to the userspace
and processing them. Disk transfer and data processing are not overlapped.

The same problem arises with raid-0, raid-5 or raid-10: if you send
accurately-sized bios (that don't span stripe boundaries), each bio waits
just for one disk to seek to the requested position. If you send oversized
bio that spans several stripes, that bio will wait until all the disks
seek to the requested position.

In general, you can send oversized bios if the user is waiting for all the
data requested (for example O_DIRECT read or write). You shouldn't send
oversized bios if the user is waiting just for a small part of data and
the kernel is doing readahead - in this case, oversized bio will result in
additional delay.


I think bio_add_page should be simplified in such a way that in the most
common cases it doesn't create oversized bio, but it can create oversized
bios in uncommon cases. We could retain a limit on a maximum number of
sectors (this limit is most commonly hit on disks), put a stripe boundary
to queue_limits (the stripe boundary limit is most commonly hit on raid),
ignore the rest of the limits in bio_add_page and remove merge_bvec.

Mikulas



On Fri, 25 May 2012, Alasdair G Kergon wrote:

> Where's the urge to remove merge_bvec coming from?
>
> I think it's premature to touch this, and that the other changes, if
> fixed and integrated, should be allowed to bed themselves down first.
>
>
> Ideally every bio would be the best size on submission and no bio would
> ever need to be split.
>
> But there is a cost involved in calculating the best size - we use
> merge_bvec for this, which gives a (probable) maximum size. It's
> usually very cheap to calculate - but not always. [In dm, we permit
> some situations where the answer we give will turn out to be wrong, but
> ensure dm will always fix up those particular cases itself later and
> still process the over-sized bio correctly.]
>
> Similarly there is a performance penalty incurred when the size is wrong
> - the bio has to be split, requiring memory, potential delays etc.
>
> There is a trade-off between those two, and our experience with the current
> code has that tilted strongly in favour of using merge_bvec all the time.
> The wasted overhead in cases where it is of no benefit seem to be
> outweighed by the benefit where it does avoid lots of splitting and help
> filesystems optimise their behaviour.
>
>
> If the splitting mechanism is changed as proposed, then that balance
> might shift. My gut feeling though is that any shift would strengthen
> the case for merge_bvec.
>
> Alasdair
>

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 05-28-2012, 08:28 PM
Tejun Heo
 
Default Gut bio_add_page()

Hello,

On Mon, May 28, 2012 at 12:07:14PM -0400, Mikulas Patocka wrote:
> With accurately sized bios, you send one bio for 256 sectors (it is sent
> immediatelly to the disk) and a second bio for another 256 sectors (it is
> put to the block device queue). The first bio finishes, pages are marked
> as uptodate, the second bio is sent to the disk. While the disk is

They're split and made in-flight together.

> processing the second bio, the kernel already knows that the first 256
> sectors are finished - so it copies the data to userspace and lets the
> userspace process them - while the disk is processing the second bio. So,
> disk transfer and data processing are overlapped.
>
> Now, with your patch, you send just one 512-sector bio. The bio is split
> to two bios, the first one is sent to the disk and you wait. The disk
> finishes the first bio, you send the second bio to the disk and wait. The
> disk finishes the second bio. You complete the master bio, mark all 512
> sectors as uptodate in the pagecache, start copying data to the userspace
> and processing them. Disk transfer and data processing are not overlapped.

Disk will most likely seek to the sector read all of them into buffer
at once and then serve the two consecutive commands back-to-back
without much inter-command delay.

> accurately-sized bios (that don't span stripe boundaries), each bio waits
> just for one disk to seek to the requested position. If you send oversized
> bio that spans several stripes, that bio will wait until all the disks
> seek to the requested position.
>
> In general, you can send oversized bios if the user is waiting for all the
> data requested (for example O_DIRECT read or write). You shouldn't send
> oversized bios if the user is waiting just for a small part of data and
> the kernel is doing readahead - in this case, oversized bio will result in
> additional delay.

Isn't it more like you shouldn't be sending read requested by user and
read ahead in the same bio?

> I think bio_add_page should be simplified in such a way that in the most
> common cases it doesn't create oversized bio, but it can create oversized
> bios in uncommon cases. We could retain a limit on a maximum number of
> sectors (this limit is most commonly hit on disks), put a stripe boundary
> to queue_limits (the stripe boundary limit is most commonly hit on raid),
> ignore the rest of the limits in bio_add_page and remove merge_bvec.

If exposing segmenting limit upwards is a must (I'm kinda skeptical),
let's have proper hints (or dynamic hinting interface) instead.

Thanks.

--
tejun

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 05-28-2012, 09:27 PM
Mikulas Patocka
 
Default Gut bio_add_page()

On Tue, 29 May 2012, Tejun Heo wrote:

> Hello,
>
> On Mon, May 28, 2012 at 12:07:14PM -0400, Mikulas Patocka wrote:
> > With accurately sized bios, you send one bio for 256 sectors (it is sent
> > immediatelly to the disk) and a second bio for another 256 sectors (it is
> > put to the block device queue). The first bio finishes, pages are marked
> > as uptodate, the second bio is sent to the disk. While the disk is
>
> They're split and made in-flight together.

I was talking about old ATA disk (without command queueing). So the
requests are not sent together. USB 2 may be a similar case, it has
limited transfer size and it doesn't have command queueing too.

> > processing the second bio, the kernel already knows that the first 256
> > sectors are finished - so it copies the data to userspace and lets the
> > userspace process them - while the disk is processing the second bio. So,
> > disk transfer and data processing are overlapped.
> >
> > Now, with your patch, you send just one 512-sector bio. The bio is split
> > to two bios, the first one is sent to the disk and you wait. The disk
> > finishes the first bio, you send the second bio to the disk and wait. The
> > disk finishes the second bio. You complete the master bio, mark all 512
> > sectors as uptodate in the pagecache, start copying data to the userspace
> > and processing them. Disk transfer and data processing are not overlapped.
>
> Disk will most likely seek to the sector read all of them into buffer
> at once and then serve the two consecutive commands back-to-back
> without much inter-command delay.

Without command queueing, the disk will serve the first request, then
receive the second request, and then serve the second request (hopefully
the data would be already prefetched after the first request).

The point is that while the disk is processing the second request, the CPU
can already process data from the first request.

> > accurately-sized bios (that don't span stripe boundaries), each bio waits
> > just for one disk to seek to the requested position. If you send oversized
> > bio that spans several stripes, that bio will wait until all the disks
> > seek to the requested position.
> >
> > In general, you can send oversized bios if the user is waiting for all the
> > data requested (for example O_DIRECT read or write). You shouldn't send
> > oversized bios if the user is waiting just for a small part of data and
> > the kernel is doing readahead - in this case, oversized bio will result in
> > additional delay.
>
> Isn't it more like you shouldn't be sending read requested by user and
> read ahead in the same bio?

If the user calls read with 512 bytes, you would send bio for just one
sector. That's too small and you'd get worse performance because of higher
command overhead. You need to send larger bios.

AHCI can interrupt after partial transfer (so for example you can send a
command to read 1M, but signal interrupt after the first 4k was
transferred), but no one really wrote code that could use this feature. It
is questionable if this would improve performance because it would double
interrupt load.

> > I think bio_add_page should be simplified in such a way that in the most
> > common cases it doesn't create oversized bio, but it can create oversized
> > bios in uncommon cases. We could retain a limit on a maximum number of
> > sectors (this limit is most commonly hit on disks), put a stripe boundary
> > to queue_limits (the stripe boundary limit is most commonly hit on raid),
> > ignore the rest of the limits in bio_add_page and remove merge_bvec.
>
> If exposing segmenting limit upwards is a must (I'm kinda skeptical),
> let's have proper hints (or dynamic hinting interface) instead.

With this patchset, you don't have to expose all the limits. You can
expose just a few most useful limits to avoid bio split in the cases
described above.

> Thanks.
>
> --
> tejun

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 05-28-2012, 09:38 PM
Tejun Heo
 
Default Gut bio_add_page()

Hello,

On Mon, May 28, 2012 at 05:27:33PM -0400, Mikulas Patocka wrote:
> > They're split and made in-flight together.
>
> I was talking about old ATA disk (without command queueing). So the
> requests are not sent together. USB 2 may be a similar case, it has
> limited transfer size and it doesn't have command queueing too.

I meant in the block layer. For consecutive commands, queueing
doesn't really matter.

> > Disk will most likely seek to the sector read all of them into buffer
> > at once and then serve the two consecutive commands back-to-back
> > without much inter-command delay.
>
> Without command queueing, the disk will serve the first request, then
> receive the second request, and then serve the second request (hopefully
> the data would be already prefetched after the first request).
>
> The point is that while the disk is processing the second request, the CPU
> can already process data from the first request.

Those are transfer latencies - multiple orders of magnitude shorter
than IO latencies. It would be surprising if they actually are
noticeable with any kind of disk bound workload.

> > Isn't it more like you shouldn't be sending read requested by user and
> > read ahead in the same bio?
>
> If the user calls read with 512 bytes, you would send bio for just one
> sector. That's too small and you'd get worse performance because of higher
> command overhead. You need to send larger bios.

All modern FSes are page granular, so the granularity would be
per-page. Also, RAHEAD is treated differently in terms of
error-handling. Do filesystems implement their own rahead
(independent from the common logic in vfs layer) on their own?

> AHCI can interrupt after partial transfer (so for example you can send a
> command to read 1M, but signal interrupt after the first 4k was
> transferred), but no one really wrote code that could use this feature. It
> is questionable if this would improve performance because it would double
> interrupt load.

The feature is pointless for disks anyway. Think about the scales of
latencies of different phases of command processing. The difference
is multiple orders of magnitude.

> > If exposing segmenting limit upwards is a must (I'm kinda skeptical),
> > let's have proper hints (or dynamic hinting interface) instead.
>
> With this patchset, you don't have to expose all the limits. You can
> expose just a few most useful limits to avoid bio split in the cases
> described above.

Yeah, if that actually helps, sure. From what I read, dm is already
(ab)using merge_bvec_fn() like that anyway.

Thanks.

--
tejun

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 

Thread Tools




All times are GMT. The time now is 07:35 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org