FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Device-mapper Development

 
 
LinkBack Thread Tools
 
Old 03-04-2010, 10:47 AM
Mikulas Patocka
 
Default blkdev: fix merge_bvec_fn return value checks

On Wed, 3 Mar 2010, Jens Axboe wrote:

> On Wed, Mar 03 2010, Dmitry Monakhov wrote:
> > Mike Snitzer <snitzer@redhat.com> writes:
> >
> > > Linux has all sorts of internal interfaces that are "odd"... the current
> > > 'q->merge_bvec_fn' interface included. But odd is not a problem (nor is
> > > it "broken") unless you make changes that don't consider how the current
> > > interface is defined.
> > Ok. then cant you please explain more historical questions
> > 1) Why bio_add_page() can not add less data than requested?
> > Seems that it doesn't make caller's code much complicate
> > Off course barrier bio is special case. I don't consider it here.
>
> Because the caller may not expect that, a partial add may not make any
> sense to the caller. The bio code obviously doesn't care. And it
> certainly could complicate the caller a lot, if they need to now issue
> and wait for several bio's instead of just a single one. Now a single
> completion queue and wait_for_completion() is not enough.

The whole thing is lame by design.

As a consequence, the code for splitting these page-sized bios is being
duplicated in md, dm and request-based device (and maybe in other drivers).

And there is a long standing unfixable bug --- adding a device to raid1
may fail if the device has smaller request size than the other raid1 leg.
Think about this:

* thread 1:
bio_add_page() -> returns ok
...
bio_add_page() -> returns ok
* scheduling to thread 2 *
* thread 2:
add a raid1 leg to the block device
* scheduling to thread 1 *
* thread 1:
submit_bio() --- the bio already oversized for the new raid1 leg and there
is no way to shrink it.


I think it should be redesigned in this way:
* the bio should be allowed to have arbitrary size, up to BIO_MAX_PAGES
* the code for splitting bios should be just at one place --- in the
generic block layer and it should work with all drivers. I.e.
q->make_request_fn will return that the request is too big for the given
driver and the generic code will allocate few another bios from
per-request_queue mempool and split the bio.
* then, we would fix the raid1 bug and we would also simplify md and dm
--- remove the splitting code.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 03-04-2010, 04:59 PM
Lars Ellenberg
 
Default blkdev: fix merge_bvec_fn return value checks

On Wed, Mar 03, 2010 at 09:07:34PM +0100, Jens Axboe wrote:
> > 2) What statement "bio_add_page() must accept at least one page"
> > exactly means?
> > IMHO this means that bio_add_page() must accept at least
> > one page with len (PAGE_SIZE - offset). Or more restricted
> > statemnt that first bio_add_page() must be always successfull.
>
> It's really 'first add must succeed', the restriction being that you
> cannot rely on that first add being more than a single page. So the rule
> is that you must accept at least a page at any offset if the bio is
> currently empty, since we know that a page is typically our IO
> granularity.

Speaking of...

dm_set_device_limits is still doing things wrong here, I think.

I posted this about two years ago, but somehow it got lost
and I lost it from my focus as well.
Reading this post reminded me ... there was something:

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 4b22feb..bc34901 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -519,10 +519,22 @@ int dm_set_device_limits(struct dm_target *ti, struct dm_dev *dev,
* smaller I/O, just to be safe.
*/

- if (q->merge_bvec_fn && !ti->type->merge)
+ if (q->merge_bvec_fn && !ti->type->merge) {
limits->max_sectors =
min_not_zero(limits->max_sectors,
(unsigned int) (PAGE_SIZE >> 9));
+
+ /* Restricting max_sectors is not enough.
+ * If someone uses bio_add_page to add 8 disjunct 512 byte
+ * partial pages to a bio, it would succeed,
+ * but could still cross a border of whatever restrictions
+ * are below us (raid0 stripe boundary). An attempted
+ * bio_split would not succeed, because bi_vcnt is 8.
+ * E.g. the xen io layer is known to trigger this.
+ */
+ limits->max_segments = 1;
+ }
+
return 0;
}
EXPORT_SYMBOL_GPL(dm_set_device_limits);


Thanks,
Lars

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 03-04-2010, 08:55 PM
Mikulas Patocka
 
Default blkdev: fix merge_bvec_fn return value checks

> > I think it should be redesigned in this way:
> > * the bio should be allowed to have arbitrary size, up to BIO_MAX_PAGES
> > * the code for splitting bios should be just at one place --- in the
> > generic block layer and it should work with all drivers. I.e.
> > q->make_request_fn will return that the request is too big for the given
> > driver and the generic code will allocate few another bios from
> > per-request_queue mempool and split the bio.
> > * then, we would fix the raid1 bug and we would also simplify md and dm
> > --- remove the splitting code.
>
> The design certainly isn't perfect, but neither is the one that you
> suggest. For instance, what would the point of building bigger bios just
> to split them _everytime_ be? That's not good design, and it's certainly
> not helping performance much.

The point for building a big bio and splitting it is code size reduction.

You must realize that you need to split the request anyway. If the caller
needs to read/write M sectors and the device can process at most N
sectors, where N<M, then the i/o must be split. You can't get rid of that
split.

Now, the question is, where this splitting happens. Currently, the
splitting happens in the issuers of the bio (for example dm-io, which is a
big chunk of code written just to work around the uncertain bio size
limits) as well as in the consumers of the bio (md, dm and request-based
drivers, if their maximum transfer size < PAGE_SIZE).

What I'm proposing is to put that bio splitting just in one place and
remove it from both the issuers and the consumers of the bio.

There is just one case, where it is not desirable to create larger
requests than the physical transfer size because of performance --- that
is page cache readahead. That could use the current approach of querying
the queue for limits and not stuffing more data than the device can
handle.

In all the other (non-readahead) cases, the caller is waiting for all the
data anyway, there is no performance problem with creating a larger
requests and splitting them.

> Alasdair had an idea for doing a map/unmap like interface instead, which
> I think is a lot better. I don't think he ever coded it up, though.
> Perhaps talk to him, if you want to improve our situation in this area.

It would increase code size, not reduce it.

> --
> Jens Axboe

BTW. see fs/bio.c:bio_split --- it uses a shared mempool. It can deadlock
if the devices depend on each other. You need to use per-queue mempool.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 03-05-2010, 06:20 PM
Lars Ellenberg
 
Default blkdev: fix merge_bvec_fn return value checks

On Fri, Mar 05, 2010 at 05:37:16PM +0000, Alasdair G Kergon wrote:
> On Thu, Mar 04, 2010 at 06:59:21PM +0100, Lars Ellenberg wrote:
> > + /* Restricting max_sectors is not enough.
> > + * If someone uses bio_add_page to add 8 disjunct 512 byte
> > + * partial pages to a bio, it would succeed,
> > + * but could still cross a border of whatever restrictions
> > + * are below us (raid0 stripe boundary). An attempted
> > + * bio_split would not succeed, because bi_vcnt is 8.
> > + * E.g. the xen io layer is known to trigger this.
> > + */
>
> Sounds plausible.
>
> Do you or anyone readingt his have example messages demonstrating the failure
> when this patch is not applied?

This has been an issue said two years ago when using Xen VMs on DRBD clusters.
My original summary post once the issue was understood is
http://archives.free.net.ph/message/20080523.161104.054c63ef.html

(the gmane link therein seems to be broken now, is not really of
interesst here anyways, but, just in case, equivalent with
http://archives.free.net.ph/message/20080410.212155.d9e0243d.html)

Actual error messages looked like
kernel: drbd3: bio would need to, but cannot, be split: (vcnt=8,idx=0,size=4096,sector=13113083)
kernel: drbd3: bio would need to, but cannot, be split: (vcnt=8,idx=0,size=4096,sector=13113149)
kernel: drbd3: bio would need to, but cannot, be split: (vcnt=8,idx=0,size=4096,sector=13113215)

bi_size is only 4k, but bi_vcnt is 8.

These had been submitted by Xen virtio on a typical VM image with the
legacy 63 sector offset into the first "partition".

You'd have the exact same issue (with slightly different error printk's)
on md raid0, or any other bio consumer with boundary restrictions not
having "generic" bio split logic, but directly using bio_split(),
as that does
BUG_ON(bi->bi_vcnt != 1);
BUG_ON(bi->bi_idx != 0);

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 03-05-2010, 08:56 PM
Neil Brown
 
Default blkdev: fix merge_bvec_fn return value checks

On Fri, 5 Mar 2010 08:30:59 +0100
Jens Axboe <jens.axboe@oracle.com> wrote:

> On Thu, Mar 04 2010, Mikulas Patocka wrote:
> > > > I think it should be redesigned in this way:
> > > > * the bio should be allowed to have arbitrary size, up to BIO_MAX_PAGES
> > > > * the code for splitting bios should be just at one place --- in the
> > > > generic block layer and it should work with all drivers. I.e.
> > > > q->make_request_fn will return that the request is too big for the given
> > > > driver and the generic code will allocate few another bios from
> > > > per-request_queue mempool and split the bio.
> > > > * then, we would fix the raid1 bug and we would also simplify md and dm
> > > > --- remove the splitting code.
> > >
> > > The design certainly isn't perfect, but neither is the one that you
> > > suggest. For instance, what would the point of building bigger bios just
> > > to split them _everytime_ be? That's not good design, and it's certainly
> > > not helping performance much.
> >
> > The point for building a big bio and splitting it is code size reduction.
> >
> > You must realize that you need to split the request anyway. If the caller
> > needs to read/write M sectors and the device can process at most N
> > sectors, where N<M, then the i/o must be split. You can't get rid of that
> > split.
>
> If we consider the basic (and mosted used) file system IO path, then
> those 'M' sectors will usually be submitted in units that are smaller
> than 'N' anyway. It would be silly to allocate a bio for 'N' sectors
> (those get big), only to split it up again shortly.
>
> The point is not to build it too big to begin with, and keep the
> splitting as the slow path. That's how it was originally designed, and
> it'll surely stay that way. Doing the split handling in generic code is
> a good idea, it definitely should be. But I'm never going to make
> splitting a normal part of IO operations just because we allow
> arbitrarily large bios, sorry but that's insane.

I think that where-ever we put the splitting there is a lot to be gained by
make it easier to split things.

It is a while since I looked closely at this (about 30 months) but last time
I looked, low level devices were allowed to modify the bi_iovec. This means
that a split has to copy the iovec as well as the bio which significantly
increased the complexity. If we stored an index into the current bio in
'struct request' rather than using bi_idx plus modifying bv_offset, then I
think we could make bi_iovec read-only and simplify splitting significantly.

I have some patches that did this, but they are very old. I could probably
refresh them if there was interest.


My preferred end-game would be to allow a bio of any size to be submitted to
any device. The device would be responsible for cutting it up if necessary.
For the elevator code, I wouldn't bother making new bios, but would teach
'struct request' to be able to reference part of a bio. So in the general
case, a 'request' would point to a list of bios, but would exclude a prefix
of the first and a suffix of the last. A single bio could then be referenced
by multiple requests if necessary.

It would probably make sense to include a 'max_sectors' concept to discourage
large bios as doing that would probably allow things to run more smoothly
in general. But if this were a hint rather than a requirement then I think
it would make life a lot easier for a lot of code.

NeilBrown

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 03-05-2010, 10:52 PM
Neil Brown
 
Default blkdev: fix merge_bvec_fn return value checks

On Fri, 5 Mar 2010 22:27:48 +0000
Alasdair G Kergon <agk@redhat.com> wrote:

> On Sat, Mar 06, 2010 at 08:56:51AM +1100, Neil Brown wrote:
> > My preferred end-game would be to allow a bio of any size to be submitted to
> > any device. The device would be responsible for cutting it up if necessary.
>
> >From the dm point of view, splitting is undesirable - memory allocations from
> separate mempools, submitting the split-off parts could reorder/delay but must
> still respect barrier constraints etc. Splitting is the 'slow and complicated'
> path for us. We support it, but it is simpler and more efficient if the bio is
> created a suitable size in the first place - and the merge_bvec_fn does this
> for us most of the time.

I hadn't thought about barriers in this context previously, but on reflection
I don't think splitting makes barriers noticeably more complex than they are
already.
If you have multiple devices, then you pretty much have to handle a barrier
as:
- stall new requests
- send zero length barrier to all devices
- write the barrier request, either with the barrier flag set,
or followed by a zero-length barrier

I don't think splitting adds complexity.
I guess if you have a single device, but still wont to split a request, there
might be some optimisations you have to forego...

But my first suggestion, that splitting could be made easier, still stands.
Maybe it doesn't have to be so complicated - then it wouldn't be so slow?

And do you honour merge_bvec_fn's of underlying devices? A quick grep
suggests you do only for dm-linear and dm-crypt. This suggests to me that it
is actually a hard interface to support completely in a stacked device, so we
might be better off without it.
In md, I check if merge_bvec_fn is defined and if it is, I just drop
max_sectors to 4K so it will never be needed. I figure that a performance
drop is better than a correctness error.

Yes, splitting is undesirable, and if we can arrange that most requests don't
get split, then that is good. But there will always be some requests that
have to be split - whether by bio_add_page or something lower - and I think
that it makes sense to only require that the lowest level does the
splitting, as that is where the specifics on what might be required exists.

Thanks,
NeilBrown

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 03-08-2010, 04:43 AM
Neil Brown
 
Default blkdev: fix merge_bvec_fn return value checks

On Fri, 5 Mar 2010 17:37:16 +0000
Alasdair G Kergon <agk@redhat.com> wrote:

> On Thu, Mar 04, 2010 at 06:59:21PM +0100, Lars Ellenberg wrote:
> > + /* Restricting max_sectors is not enough.
> > + * If someone uses bio_add_page to add 8 disjunct 512 byte
> > + * partial pages to a bio, it would succeed,
> > + * but could still cross a border of whatever restrictions
> > + * are below us (raid0 stripe boundary). An attempted
> > + * bio_split would not succeed, because bi_vcnt is 8.
> > + * E.g. the xen io layer is known to trigger this.
> > + */
>
> Sounds plausible.
>
> Do you or anyone readingt his have example messages demonstrating the failure
> when this patch is not applied?

Yes. This

http://marc.info/?l=linux-raid&m=126672681521073&w=2

almost certainly refers to that problem.

NeilBrown


>
> Alasdair.
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 03-08-2010, 08:01 AM
Mikulas Patocka
 
Default blkdev: fix merge_bvec_fn return value checks

> > The point for building a big bio and splitting it is code size reduction.
> >
> > You must realize that you need to split the request anyway. If the caller
> > needs to read/write M sectors and the device can process at most N
> > sectors, where N<M, then the i/o must be split. You can't get rid of that
> > split.
>
> If we consider the basic (and mosted used) file system IO path, then
> those 'M' sectors will usually be submitted in units that are smaller
> than 'N' anyway. It would be silly to allocate a bio for 'N' sectors
> (those get big), only to split it up again shortly.

I agree that majority of IOs are page cache reads/writes and they
shouldn't be split.

> The point is not to build it too big to begin with, and keep the
> splitting as the slow path. That's how it was originally designed, and
> it'll surely stay that way. Doing the split handling in generic code is
> a good idea, it definitely should be. But I'm never going to make
> splitting a normal part of IO operations just because we allow
> arbitrarily large bios, sorry but that's insane.
>
> > There is just one case, where it is not desirable to create larger
> > requests than the physical transfer size because of performance --- that
> > is page cache readahead. That could use the current approach of querying
> > the queue for limits and not stuffing more data than the device can
> > handle.
> >
> > In all the other (non-readahead) cases, the caller is waiting for all the
> > data anyway, there is no performance problem with creating a larger
> > requests and splitting them.
>
> But this I still think is crazy, sorry.

It doesn't look crazy to me:

Case 1: the caller constructs a request to dm-io. Dm-io allocates its
internal structure, splits the request into several bios, submits them,
waits for them, calls the completion routing when they all finish.

Case 2: the caller constructs a big bio (without care about device
limits), submits it to the block layer, the block layer splits the bio
into several smaller bios (it doesn't even have to allocate the vector if
it splits at page boundaries), submits the smaller bios to the device,
when they finish, signals the completion of the big bio.

Now tell me, why do you think that "Case 1" is better than "Case 2"? If
the bio gets split, you get exactly the same number of allocations in both
cases. If the bio isn't split, "Case 2" is even better (it saves dm-io
overhead).

> > > Alasdair had an idea for doing a map/unmap like interface instead, which
> > > I think is a lot better. I don't think he ever coded it up, though.
> > > Perhaps talk to him, if you want to improve our situation in this area.
> >
> > It would increase code size, not reduce it.
>
> What kind of answer is that?

If you increase code size in the interface, you make the interface more
complex and any further changes will be harder. It kills the software
slowly.

> > BTW. see fs/bio.c:bio_split --- it uses a shared mempool. It can deadlock
> > if the devices depend on each other. You need to use per-queue mempool.
>
> Only if you need to split the same original bio twice on the same IO
> path.

It may happen if you use stacked MD devices with different small stripe
sizes. But it's not common because people usually set stripe size as a
multiple of page size.

Mikulas

> --
> Jens Axboe
>

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 

Thread Tools




All times are GMT. The time now is 11:34 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright ©2007 - 2008, www.linux-archive.org