FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Device-mapper Development

 
 
LinkBack Thread Tools
 
Old 03-04-2010, 11:19 AM
Jens Axboe
 
Default blkdev: fix merge_bvec_fn return value checks

On Thu, Mar 04 2010, Mikulas Patocka wrote:
>
>
> On Wed, 3 Mar 2010, Jens Axboe wrote:
>
> > On Wed, Mar 03 2010, Dmitry Monakhov wrote:
> > > Mike Snitzer <snitzer@redhat.com> writes:
> > >
> > > > Linux has all sorts of internal interfaces that are "odd"... the current
> > > > 'q->merge_bvec_fn' interface included. But odd is not a problem (nor is
> > > > it "broken") unless you make changes that don't consider how the current
> > > > interface is defined.
> > > Ok. then cant you please explain more historical questions
> > > 1) Why bio_add_page() can not add less data than requested?
> > > Seems that it doesn't make caller's code much complicate
> > > Off course barrier bio is special case. I don't consider it here.
> >
> > Because the caller may not expect that, a partial add may not make any
> > sense to the caller. The bio code obviously doesn't care. And it
> > certainly could complicate the caller a lot, if they need to now issue
> > and wait for several bio's instead of just a single one. Now a single
> > completion queue and wait_for_completion() is not enough.
>
> The whole thing is lame by design.
>
> As a consequence, the code for splitting these page-sized bios is being
> duplicated in md, dm and request-based device (and maybe in other drivers).
>
> And there is a long standing unfixable bug --- adding a device to raid1
> may fail if the device has smaller request size than the other raid1 leg.
> Think about this:
>
> * thread 1:
> bio_add_page() -> returns ok
> ...
> bio_add_page() -> returns ok
> * scheduling to thread 2 *
> * thread 2:
> add a raid1 leg to the block device
> * scheduling to thread 1 *
> * thread 1:
> submit_bio() --- the bio already oversized for the new raid1 leg and there
> is no way to shrink it.
>
>
> I think it should be redesigned in this way:
> * the bio should be allowed to have arbitrary size, up to BIO_MAX_PAGES
> * the code for splitting bios should be just at one place --- in the
> generic block layer and it should work with all drivers. I.e.
> q->make_request_fn will return that the request is too big for the given
> driver and the generic code will allocate few another bios from
> per-request_queue mempool and split the bio.
> * then, we would fix the raid1 bug and we would also simplify md and dm
> --- remove the splitting code.

The design certainly isn't perfect, but neither is the one that you
suggest. For instance, what would the point of building bigger bios just
to split them _everytime_ be? That's not good design, and it's certainly
not helping performance much.

Alasdair had an idea for doing a map/unmap like interface instead, which
I think is a lot better. I don't think he ever coded it up, though.
Perhaps talk to him, if you want to improve our situation in this area.

--
Jens Axboe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 03-05-2010, 06:30 AM
Jens Axboe
 
Default blkdev: fix merge_bvec_fn return value checks

On Thu, Mar 04 2010, Mikulas Patocka wrote:
> > > I think it should be redesigned in this way:
> > > * the bio should be allowed to have arbitrary size, up to BIO_MAX_PAGES
> > > * the code for splitting bios should be just at one place --- in the
> > > generic block layer and it should work with all drivers. I.e.
> > > q->make_request_fn will return that the request is too big for the given
> > > driver and the generic code will allocate few another bios from
> > > per-request_queue mempool and split the bio.
> > > * then, we would fix the raid1 bug and we would also simplify md and dm
> > > --- remove the splitting code.
> >
> > The design certainly isn't perfect, but neither is the one that you
> > suggest. For instance, what would the point of building bigger bios just
> > to split them _everytime_ be? That's not good design, and it's certainly
> > not helping performance much.
>
> The point for building a big bio and splitting it is code size reduction.
>
> You must realize that you need to split the request anyway. If the caller
> needs to read/write M sectors and the device can process at most N
> sectors, where N<M, then the i/o must be split. You can't get rid of that
> split.

If we consider the basic (and mosted used) file system IO path, then
those 'M' sectors will usually be submitted in units that are smaller
than 'N' anyway. It would be silly to allocate a bio for 'N' sectors
(those get big), only to split it up again shortly.

The point is not to build it too big to begin with, and keep the
splitting as the slow path. That's how it was originally designed, and
it'll surely stay that way. Doing the split handling in generic code is
a good idea, it definitely should be. But I'm never going to make
splitting a normal part of IO operations just because we allow
arbitrarily large bios, sorry but that's insane.

> Now, the question is, where this splitting happens. Currently, the
> splitting happens in the issuers of the bio (for example dm-io, which is a
> big chunk of code written just to work around the uncertain bio size
> limits) as well as in the consumers of the bio (md, dm and request-based
> drivers, if their maximum transfer size < PAGE_SIZE).
>
> What I'm proposing is to put that bio splitting just in one place and
> remove it from both the issuers and the consumers of the bio.

Agree on that, we can make that generic (and should).

> There is just one case, where it is not desirable to create larger
> requests than the physical transfer size because of performance --- that
> is page cache readahead. That could use the current approach of querying
> the queue for limits and not stuffing more data than the device can
> handle.
>
> In all the other (non-readahead) cases, the caller is waiting for all the
> data anyway, there is no performance problem with creating a larger
> requests and splitting them.

But this I still think is crazy, sorry.

> > Alasdair had an idea for doing a map/unmap like interface instead, which
> > I think is a lot better. I don't think he ever coded it up, though.
> > Perhaps talk to him, if you want to improve our situation in this area.
>
> It would increase code size, not reduce it.

What kind of answer is that?

> BTW. see fs/bio.c:bio_split --- it uses a shared mempool. It can deadlock
> if the devices depend on each other. You need to use per-queue mempool.

Only if you need to split the same original bio twice on the same IO
path.

--
Jens Axboe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 03-05-2010, 04:37 PM
Alasdair G Kergon
 
Default blkdev: fix merge_bvec_fn return value checks

On Thu, Mar 04, 2010 at 06:59:21PM +0100, Lars Ellenberg wrote:
> + /* Restricting max_sectors is not enough.
> + * If someone uses bio_add_page to add 8 disjunct 512 byte
> + * partial pages to a bio, it would succeed,
> + * but could still cross a border of whatever restrictions
> + * are below us (raid0 stripe boundary). An attempted
> + * bio_split would not succeed, because bi_vcnt is 8.
> + * E.g. the xen io layer is known to trigger this.
> + */

Sounds plausible.

Do you or anyone readingt his have example messages demonstrating the failure
when this patch is not applied?

Alasdair.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 03-05-2010, 09:27 PM
Alasdair G Kergon
 
Default blkdev: fix merge_bvec_fn return value checks

On Sat, Mar 06, 2010 at 08:56:51AM +1100, Neil Brown wrote:
> My preferred end-game would be to allow a bio of any size to be submitted to
> any device. The device would be responsible for cutting it up if necessary.

>From the dm point of view, splitting is undesirable - memory allocations from
separate mempools, submitting the split-off parts could reorder/delay but must
still respect barrier constraints etc. Splitting is the 'slow and complicated'
path for us. We support it, but it is simpler and more efficient if the bio is
created a suitable size in the first place - and the merge_bvec_fn does this
for us most of the time.

Alasdair

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 03-06-2010, 01:20 AM
Alasdair G Kergon
 
Default blkdev: fix merge_bvec_fn return value checks

On Sat, Mar 06, 2010 at 10:52:37AM +1100, Neil Brown wrote:
> But my first suggestion, that splitting could be made easier, still stands.

In full agreement we should try to make it easier.

> And do you honour merge_bvec_fn's of underlying devices? A quick grep
> suggests you do only for dm-linear and dm-crypt.

We implemented it to address some specific performance problems reported by
people using those targets. So far there hasn't been any pressing need to
implement it for other targets.

> This suggests to me that it
> is actually a hard interface to support completely in a stacked device, so we
> might be better off without it.

All this code is hard - but very useful. The direction I want to explore when
I finally get some time to work on this is one which will probably make more
use of, and extend, merge_bvec.

> that it makes sense to only require that the lowest level does the
> splitting, as that is where the specifics on what might be required exists.

We have a hybrid model at the moment - some stuff dealt with top-down, other
stuff bottom-up. This thread describes just one of several problems related to
this. In the next incarnation of device-mapper I hope we'll find a way to
eliminate all of them together. Maybe I'll finally get time later this year to
start working on these ideas properly.

Alasdair

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 

Thread Tools




All times are GMT. The time now is 02:05 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org