FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Device-mapper Development

 
 
LinkBack Thread Tools
 
Old 04-12-2011, 06:20 AM
Jens Axboe
 
Default block: remove per-queue plugging

On 2011-04-12 00:58, hch@infradead.org wrote:
> Looking at the patch
> (http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=761e433f3de6fb8e369af9e5c 08beb86286d023f)
>
> I'm not sure it's an optimal design. The flush callback really
> is a per-queue thing. Why isn't it a function pointer in the request
> queue when doing the blk_run_queue call once we're done with a given
> queue before moving on to the next one?

I was thinking about this yesterday as well, the design didn't quite
feel just right. Additionally the user now must track this state too,
and whether he's plugged on that task or not.

I'll rewrite this.

--
Jens Axboe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 04-12-2011, 08:36 AM
Jens Axboe
 
Default block: remove per-queue plugging

On 2011-04-12 03:12, hch@infradead.org wrote:
> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
>> then we can remove that export and make it internal only.
>
> Linus pulled the tree, so they are gone now. Btw, there's still some
> bits in the area that confuse me:

Great!

> - what's the point of the queue_sync_plugs? It has a lot of comment
> that seem to pre-data the onstack plugging, but except for that
> it's trivial wrapper around blk_flush_plug, with an argument
> that is not used.

There's really no point to it anymore. It's existance was due to the
older revision that had to track write requests for serializaing around
a barrier. I'll kill it, since we don't do that anymore.

> - is there a good reason for the existance of __blk_flush_plug? You'd
> get one additional instruction in the inlined version of
> blk_flush_plug when opencoding, but avoid the need for chained
> function calls.
> - Why is having a plug in blk_flush_plug marked unlikely? Note that
> unlikely is the static branch prediction hint to mark the case
> extremly unlikely and is even used for hot/cold partitioning. But
> when we call it we usually check beforehand if we actually have
> plugs, so it's actually likely to happen.

The existance and out-of-line is for the scheduler() hook. It should be
an unlikely event to schedule with a plug held, normally the plug should
have been explicitly unplugged before that happens.

> - what is the point of blk_finish_plug? All callers have
> the plug on stack, and there's no good reason for adding the NULL
> check. Note that blk_start_plug doesn't have the NULL check either.

That one can probably go, I need to double check that part since some
things changed.

> - Why does __blk_flush_plug call __blk_finish_plug which might clear
> tsk->plug, just to set it back after the call? When manually inlining
> __blk_finish_plug ino __blk_flush_plug it looks like:
>
> void __blk_flush_plug(struct task_struct *tsk, struct blk_plug *plug)
> {
> flush_plug_list(plug);
> if (plug == tsk->plug)
> tsk->plug = NULL;
> tsk->plug = plug;
> }
>
> it would seem much smarted to just call flush_plug_list directly.
> In fact it seems like the tsk->plug is not nessecary at all and
> all remaining __blk_flush_plug callers could be replaced with
> flush_plug_list.

It depends on whether this was an explicit unplug (eg
blk_finish_plug()), or whether it was an implicit event (eg on
schedule()). If we do it on schedule(), then we retain the plug after
the flush. Otherwise we clear it.

> - and of course the remaining issue of why io_schedule needs an
> expliciy blk_flush_plug when schedule() already does one in
> case it actually needs to schedule.

Already answered in other email.

--
Jens Axboe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 04-12-2011, 12:22 PM
Dave Chinner
 
Default block: remove per-queue plugging

On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
> On 2011-04-12 03:12, hch@infradead.org wrote:
> > On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
> >> Great, once you do that and XFS kills the blk_flush_plug() calls too,
> >> then we can remove that export and make it internal only.
> >
> > Linus pulled the tree, so they are gone now. Btw, there's still some
> > bits in the area that confuse me:
>
> Great!
>
> > - what's the point of the queue_sync_plugs? It has a lot of comment
> > that seem to pre-data the onstack plugging, but except for that
> > it's trivial wrapper around blk_flush_plug, with an argument
> > that is not used.
>
> There's really no point to it anymore. It's existance was due to the
> older revision that had to track write requests for serializaing around
> a barrier. I'll kill it, since we don't do that anymore.
>
> > - is there a good reason for the existance of __blk_flush_plug? You'd
> > get one additional instruction in the inlined version of
> > blk_flush_plug when opencoding, but avoid the need for chained
> > function calls.
> > - Why is having a plug in blk_flush_plug marked unlikely? Note that
> > unlikely is the static branch prediction hint to mark the case
> > extremly unlikely and is even used for hot/cold partitioning. But
> > when we call it we usually check beforehand if we actually have
> > plugs, so it's actually likely to happen.
>
> The existance and out-of-line is for the scheduler() hook. It should be
> an unlikely event to schedule with a plug held, normally the plug should
> have been explicitly unplugged before that happens.

Though if it does, haven't you just added a significant amount of
depth to the worst case stack usage? I'm seeing this sort of thing
from io_schedule():

Depth Size Location (40 entries)
----- ---- --------
0) 4256 16 mempool_alloc_slab+0x15/0x20
1) 4240 144 mempool_alloc+0x63/0x160
2) 4096 16 scsi_sg_alloc+0x4c/0x60
3) 4080 112 __sg_alloc_table+0x66/0x140
4) 3968 32 scsi_init_sgtable+0x33/0x90
5) 3936 48 scsi_init_io+0x31/0xc0
6) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0
7) 3856 112 sd_prep_fn+0x150/0xa90
8) 3744 48 blk_peek_request+0x6a/0x1f0
9) 3696 96 scsi_request_fn+0x60/0x510
10) 3600 32 __blk_run_queue+0x57/0x100
11) 3568 80 flush_plug_list+0x133/0x1d0
12) 3488 32 __blk_flush_plug+0x24/0x50
13) 3456 32 io_schedule+0x79/0x80

(This is from a page fault on ext3 that is doing page cache
readahead and blocking on a locked buffer.)

I've seen traces where mempool_alloc_slab enters direct reclaim
which adds another 1.5k of stack usage to this path. So I'm
extremely concerned that you've just reduced the stack available to
every thread by at least 2.5k of space...

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 04-12-2011, 12:28 PM
Jens Axboe
 
Default block: remove per-queue plugging

On 2011-04-12 14:22, Dave Chinner wrote:
> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
>> On 2011-04-12 03:12, hch@infradead.org wrote:
>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
>>>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
>>>> then we can remove that export and make it internal only.
>>>
>>> Linus pulled the tree, so they are gone now. Btw, there's still some
>>> bits in the area that confuse me:
>>
>> Great!
>>
>>> - what's the point of the queue_sync_plugs? It has a lot of comment
>>> that seem to pre-data the onstack plugging, but except for that
>>> it's trivial wrapper around blk_flush_plug, with an argument
>>> that is not used.
>>
>> There's really no point to it anymore. It's existance was due to the
>> older revision that had to track write requests for serializaing around
>> a barrier. I'll kill it, since we don't do that anymore.
>>
>>> - is there a good reason for the existance of __blk_flush_plug? You'd
>>> get one additional instruction in the inlined version of
>>> blk_flush_plug when opencoding, but avoid the need for chained
>>> function calls.
>>> - Why is having a plug in blk_flush_plug marked unlikely? Note that
>>> unlikely is the static branch prediction hint to mark the case
>>> extremly unlikely and is even used for hot/cold partitioning. But
>>> when we call it we usually check beforehand if we actually have
>>> plugs, so it's actually likely to happen.
>>
>> The existance and out-of-line is for the scheduler() hook. It should be
>> an unlikely event to schedule with a plug held, normally the plug should
>> have been explicitly unplugged before that happens.
>
> Though if it does, haven't you just added a significant amount of
> depth to the worst case stack usage? I'm seeing this sort of thing
> from io_schedule():
>
> Depth Size Location (40 entries)
> ----- ---- --------
> 0) 4256 16 mempool_alloc_slab+0x15/0x20
> 1) 4240 144 mempool_alloc+0x63/0x160
> 2) 4096 16 scsi_sg_alloc+0x4c/0x60
> 3) 4080 112 __sg_alloc_table+0x66/0x140
> 4) 3968 32 scsi_init_sgtable+0x33/0x90
> 5) 3936 48 scsi_init_io+0x31/0xc0
> 6) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0
> 7) 3856 112 sd_prep_fn+0x150/0xa90
> 8) 3744 48 blk_peek_request+0x6a/0x1f0
> 9) 3696 96 scsi_request_fn+0x60/0x510
> 10) 3600 32 __blk_run_queue+0x57/0x100
> 11) 3568 80 flush_plug_list+0x133/0x1d0
> 12) 3488 32 __blk_flush_plug+0x24/0x50
> 13) 3456 32 io_schedule+0x79/0x80
>
> (This is from a page fault on ext3 that is doing page cache
> readahead and blocking on a locked buffer.)
>
> I've seen traces where mempool_alloc_slab enters direct reclaim
> which adds another 1.5k of stack usage to this path. So I'm
> extremely concerned that you've just reduced the stack available to
> every thread by at least 2.5k of space...

Yeah, that does not look great. If this turns out to be problematic, we
can turn the queue runs from the unlikely case into out-of-line from
kblockd.

But this really isn't that new, you could enter the IO dispatch path
when doing IO already (when submitting it). So we better be able to
handle that.

If it's a problem from the schedule()/io_schedule() path, then lets
ensure that those are truly unlikely events so we can punt them to
kblockd.


--
Jens Axboe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 04-12-2011, 12:41 PM
Dave Chinner
 
Default block: remove per-queue plugging

On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
> On 2011-04-12 14:22, Dave Chinner wrote:
> > On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
> >> On 2011-04-12 03:12, hch@infradead.org wrote:
> >>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
> >>>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
> >>>> then we can remove that export and make it internal only.
> >>>
> >>> Linus pulled the tree, so they are gone now. Btw, there's still some
> >>> bits in the area that confuse me:
> >>
> >> Great!
> >>
> >>> - what's the point of the queue_sync_plugs? It has a lot of comment
> >>> that seem to pre-data the onstack plugging, but except for that
> >>> it's trivial wrapper around blk_flush_plug, with an argument
> >>> that is not used.
> >>
> >> There's really no point to it anymore. It's existance was due to the
> >> older revision that had to track write requests for serializaing around
> >> a barrier. I'll kill it, since we don't do that anymore.
> >>
> >>> - is there a good reason for the existance of __blk_flush_plug? You'd
> >>> get one additional instruction in the inlined version of
> >>> blk_flush_plug when opencoding, but avoid the need for chained
> >>> function calls.
> >>> - Why is having a plug in blk_flush_plug marked unlikely? Note that
> >>> unlikely is the static branch prediction hint to mark the case
> >>> extremly unlikely and is even used for hot/cold partitioning. But
> >>> when we call it we usually check beforehand if we actually have
> >>> plugs, so it's actually likely to happen.
> >>
> >> The existance and out-of-line is for the scheduler() hook. It should be
> >> an unlikely event to schedule with a plug held, normally the plug should
> >> have been explicitly unplugged before that happens.
> >
> > Though if it does, haven't you just added a significant amount of
> > depth to the worst case stack usage? I'm seeing this sort of thing
> > from io_schedule():
> >
> > Depth Size Location (40 entries)
> > ----- ---- --------
> > 0) 4256 16 mempool_alloc_slab+0x15/0x20
> > 1) 4240 144 mempool_alloc+0x63/0x160
> > 2) 4096 16 scsi_sg_alloc+0x4c/0x60
> > 3) 4080 112 __sg_alloc_table+0x66/0x140
> > 4) 3968 32 scsi_init_sgtable+0x33/0x90
> > 5) 3936 48 scsi_init_io+0x31/0xc0
> > 6) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0
> > 7) 3856 112 sd_prep_fn+0x150/0xa90
> > 8) 3744 48 blk_peek_request+0x6a/0x1f0
> > 9) 3696 96 scsi_request_fn+0x60/0x510
> > 10) 3600 32 __blk_run_queue+0x57/0x100
> > 11) 3568 80 flush_plug_list+0x133/0x1d0
> > 12) 3488 32 __blk_flush_plug+0x24/0x50
> > 13) 3456 32 io_schedule+0x79/0x80
> >
> > (This is from a page fault on ext3 that is doing page cache
> > readahead and blocking on a locked buffer.)
> >
> > I've seen traces where mempool_alloc_slab enters direct reclaim
> > which adds another 1.5k of stack usage to this path. So I'm
> > extremely concerned that you've just reduced the stack available to
> > every thread by at least 2.5k of space...
>
> Yeah, that does not look great. If this turns out to be problematic, we
> can turn the queue runs from the unlikely case into out-of-line from
> kblockd.
>
> But this really isn't that new, you could enter the IO dispatch path
> when doing IO already (when submitting it). So we better be able to
> handle that.

The problem I see is that IO is submitted when there's plenty of
stack available whould have previously been fine. However now it
hits the plug, and then later on after the thread consumes a lot
more stack it, say, waits for a completion. We then schedule, it
unplugs the queue and we add the IO stack to a place where there
isn't much space available.

So effectively we are moving the places where stack is consumed
about, and it's complete unpredictable where that stack is going to
land now.

> If it's a problem from the schedule()/io_schedule() path, then
> lets ensure that those are truly unlikely events so we can punt
> them to kblockd.

Rather than wait for an explosion to be reported before doing this,
why not just punt unplugs to kblockd unconditionally?

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 04-12-2011, 12:58 PM
Jens Axboe
 
Default block: remove per-queue plugging

On 2011-04-12 14:41, Dave Chinner wrote:
> On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
>> On 2011-04-12 14:22, Dave Chinner wrote:
>>> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
>>>> On 2011-04-12 03:12, hch@infradead.org wrote:
>>>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
>>>>>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
>>>>>> then we can remove that export and make it internal only.
>>>>>
>>>>> Linus pulled the tree, so they are gone now. Btw, there's still some
>>>>> bits in the area that confuse me:
>>>>
>>>> Great!
>>>>
>>>>> - what's the point of the queue_sync_plugs? It has a lot of comment
>>>>> that seem to pre-data the onstack plugging, but except for that
>>>>> it's trivial wrapper around blk_flush_plug, with an argument
>>>>> that is not used.
>>>>
>>>> There's really no point to it anymore. It's existance was due to the
>>>> older revision that had to track write requests for serializaing around
>>>> a barrier. I'll kill it, since we don't do that anymore.
>>>>
>>>>> - is there a good reason for the existance of __blk_flush_plug? You'd
>>>>> get one additional instruction in the inlined version of
>>>>> blk_flush_plug when opencoding, but avoid the need for chained
>>>>> function calls.
>>>>> - Why is having a plug in blk_flush_plug marked unlikely? Note that
>>>>> unlikely is the static branch prediction hint to mark the case
>>>>> extremly unlikely and is even used for hot/cold partitioning. But
>>>>> when we call it we usually check beforehand if we actually have
>>>>> plugs, so it's actually likely to happen.
>>>>
>>>> The existance and out-of-line is for the scheduler() hook. It should be
>>>> an unlikely event to schedule with a plug held, normally the plug should
>>>> have been explicitly unplugged before that happens.
>>>
>>> Though if it does, haven't you just added a significant amount of
>>> depth to the worst case stack usage? I'm seeing this sort of thing
>>> from io_schedule():
>>>
>>> Depth Size Location (40 entries)
>>> ----- ---- --------
>>> 0) 4256 16 mempool_alloc_slab+0x15/0x20
>>> 1) 4240 144 mempool_alloc+0x63/0x160
>>> 2) 4096 16 scsi_sg_alloc+0x4c/0x60
>>> 3) 4080 112 __sg_alloc_table+0x66/0x140
>>> 4) 3968 32 scsi_init_sgtable+0x33/0x90
>>> 5) 3936 48 scsi_init_io+0x31/0xc0
>>> 6) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0
>>> 7) 3856 112 sd_prep_fn+0x150/0xa90
>>> 8) 3744 48 blk_peek_request+0x6a/0x1f0
>>> 9) 3696 96 scsi_request_fn+0x60/0x510
>>> 10) 3600 32 __blk_run_queue+0x57/0x100
>>> 11) 3568 80 flush_plug_list+0x133/0x1d0
>>> 12) 3488 32 __blk_flush_plug+0x24/0x50
>>> 13) 3456 32 io_schedule+0x79/0x80
>>>
>>> (This is from a page fault on ext3 that is doing page cache
>>> readahead and blocking on a locked buffer.)
>>>
>>> I've seen traces where mempool_alloc_slab enters direct reclaim
>>> which adds another 1.5k of stack usage to this path. So I'm
>>> extremely concerned that you've just reduced the stack available to
>>> every thread by at least 2.5k of space...
>>
>> Yeah, that does not look great. If this turns out to be problematic, we
>> can turn the queue runs from the unlikely case into out-of-line from
>> kblockd.
>>
>> But this really isn't that new, you could enter the IO dispatch path
>> when doing IO already (when submitting it). So we better be able to
>> handle that.
>
> The problem I see is that IO is submitted when there's plenty of
> stack available whould have previously been fine. However now it
> hits the plug, and then later on after the thread consumes a lot
> more stack it, say, waits for a completion. We then schedule, it
> unplugs the queue and we add the IO stack to a place where there
> isn't much space available.
>
> So effectively we are moving the places where stack is consumed
> about, and it's complete unpredictable where that stack is going to
> land now.

Isn't that example fairly contrived? If we ended up doing the IO
dispatch before, then the only difference now is the stack usage of
schedule() itself. Apart from that, as far as I can tell, there should
not be much difference.


>> If it's a problem from the schedule()/io_schedule() path, then
>> lets ensure that those are truly unlikely events so we can punt
>> them to kblockd.
>
> Rather than wait for an explosion to be reported before doing this,
> why not just punt unplugs to kblockd unconditionally?

Supposedly it's faster to do it inline rather than punt the dispatch.
But that may actually not be true, if you have multiple plugs going (and
thus multiple contenders for the queue lock on dispatch). So lets play
it safe and punt to kblockd, we can always revisit this later.

diff --git a/block/blk-core.c b/block/blk-core.c
index c6eaa1f..36b1a75 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2665,7 +2665,7 @@ static int plug_rq_cmp(void *priv, struct list_head *a, struct list_head *b)
static void queue_unplugged(struct request_queue *q, unsigned int depth)
{
trace_block_unplug_io(q, depth);
- __blk_run_queue(q, false);
+ __blk_run_queue(q, true);

if (q->unplugged_fn)
q->unplugged_fn(q);


--
Jens Axboe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 04-12-2011, 01:31 PM
Dave Chinner
 
Default block: remove per-queue plugging

On Tue, Apr 12, 2011 at 02:58:46PM +0200, Jens Axboe wrote:
> On 2011-04-12 14:41, Dave Chinner wrote:
> > On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
> >> On 2011-04-12 14:22, Dave Chinner wrote:
> >>> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
> >>>> On 2011-04-12 03:12, hch@infradead.org wrote:
> >>>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
> >>>>>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
> >>>>>> then we can remove that export and make it internal only.
> >>>>>
> >>>>> Linus pulled the tree, so they are gone now. Btw, there's still some
> >>>>> bits in the area that confuse me:
> >>>>
> >>>> Great!
> >>>>
> >>>>> - what's the point of the queue_sync_plugs? It has a lot of comment
> >>>>> that seem to pre-data the onstack plugging, but except for that
> >>>>> it's trivial wrapper around blk_flush_plug, with an argument
> >>>>> that is not used.
> >>>>
> >>>> There's really no point to it anymore. It's existance was due to the
> >>>> older revision that had to track write requests for serializaing around
> >>>> a barrier. I'll kill it, since we don't do that anymore.
> >>>>
> >>>>> - is there a good reason for the existance of __blk_flush_plug? You'd
> >>>>> get one additional instruction in the inlined version of
> >>>>> blk_flush_plug when opencoding, but avoid the need for chained
> >>>>> function calls.
> >>>>> - Why is having a plug in blk_flush_plug marked unlikely? Note that
> >>>>> unlikely is the static branch prediction hint to mark the case
> >>>>> extremly unlikely and is even used for hot/cold partitioning. But
> >>>>> when we call it we usually check beforehand if we actually have
> >>>>> plugs, so it's actually likely to happen.
> >>>>
> >>>> The existance and out-of-line is for the scheduler() hook. It should be
> >>>> an unlikely event to schedule with a plug held, normally the plug should
> >>>> have been explicitly unplugged before that happens.
> >>>
> >>> Though if it does, haven't you just added a significant amount of
> >>> depth to the worst case stack usage? I'm seeing this sort of thing
> >>> from io_schedule():
> >>>
> >>> Depth Size Location (40 entries)
> >>> ----- ---- --------
> >>> 0) 4256 16 mempool_alloc_slab+0x15/0x20
> >>> 1) 4240 144 mempool_alloc+0x63/0x160
> >>> 2) 4096 16 scsi_sg_alloc+0x4c/0x60
> >>> 3) 4080 112 __sg_alloc_table+0x66/0x140
> >>> 4) 3968 32 scsi_init_sgtable+0x33/0x90
> >>> 5) 3936 48 scsi_init_io+0x31/0xc0
> >>> 6) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0
> >>> 7) 3856 112 sd_prep_fn+0x150/0xa90
> >>> 8) 3744 48 blk_peek_request+0x6a/0x1f0
> >>> 9) 3696 96 scsi_request_fn+0x60/0x510
> >>> 10) 3600 32 __blk_run_queue+0x57/0x100
> >>> 11) 3568 80 flush_plug_list+0x133/0x1d0
> >>> 12) 3488 32 __blk_flush_plug+0x24/0x50
> >>> 13) 3456 32 io_schedule+0x79/0x80
> >>>
> >>> (This is from a page fault on ext3 that is doing page cache
> >>> readahead and blocking on a locked buffer.)
> >>>
> >>> I've seen traces where mempool_alloc_slab enters direct reclaim
> >>> which adds another 1.5k of stack usage to this path. So I'm
> >>> extremely concerned that you've just reduced the stack available to
> >>> every thread by at least 2.5k of space...
> >>
> >> Yeah, that does not look great. If this turns out to be problematic, we
> >> can turn the queue runs from the unlikely case into out-of-line from
> >> kblockd.
> >>
> >> But this really isn't that new, you could enter the IO dispatch path
> >> when doing IO already (when submitting it). So we better be able to
> >> handle that.
> >
> > The problem I see is that IO is submitted when there's plenty of
> > stack available whould have previously been fine. However now it
> > hits the plug, and then later on after the thread consumes a lot
> > more stack it, say, waits for a completion. We then schedule, it
> > unplugs the queue and we add the IO stack to a place where there
> > isn't much space available.
> >
> > So effectively we are moving the places where stack is consumed
> > about, and it's complete unpredictable where that stack is going to
> > land now.
>
> Isn't that example fairly contrived?

I don't think so. e.g. in the XFS allocation path we do btree block
readahead, then go do the real work. The real work can end up with a
deeper stack before blocking on locks or completions unrelated to
the readahead, leading to schedule() being called and an unplug
being issued at that point. You might think it contrived, but if
you can't provide a guarantee that it can't happen then I have to
assume it will happen.

My concern is that we're already under stack space stress in the
writeback path, so anything that has the potential to increase it
significantly is a major worry from my point of view...

> If we ended up doing the IO
> dispatch before, then the only difference now is the stack usage of
> schedule() itself. Apart from that, as far as I can tell, there should
> not be much difference.

There's a difference between IO submission and IO dispatch. IO
submission is submit_bio thru to the plug; IO dispatch is from the
plug down to the disk. If they happen at the same place, there's no
problem. If IO dispatch is moved to schedule() via a plug....

> >> If it's a problem from the schedule()/io_schedule() path, then
> >> lets ensure that those are truly unlikely events so we can punt
> >> them to kblockd.
> >
> > Rather than wait for an explosion to be reported before doing this,
> > why not just punt unplugs to kblockd unconditionally?
>
> Supposedly it's faster to do it inline rather than punt the dispatch.
> But that may actually not be true, if you have multiple plugs going (and
> thus multiple contenders for the queue lock on dispatch). So lets play
> it safe and punt to kblockd, we can always revisit this later.

It's always best to play it safe when it comes to other peoples
data....

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 04-12-2011, 01:40 PM
Dave Chinner
 
Default block: remove per-queue plugging

On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
> On 2011-04-12 14:22, Dave Chinner wrote:
> > On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
> >> On 2011-04-12 03:12, hch@infradead.org wrote:
> >>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
> >>> function calls.
> >>> - Why is having a plug in blk_flush_plug marked unlikely? Note that
> >>> unlikely is the static branch prediction hint to mark the case
> >>> extremly unlikely and is even used for hot/cold partitioning. But
> >>> when we call it we usually check beforehand if we actually have
> >>> plugs, so it's actually likely to happen.
> >>
> >> The existance and out-of-line is for the scheduler() hook. It should be
> >> an unlikely event to schedule with a plug held, normally the plug should
> >> have been explicitly unplugged before that happens.
> >
> > Though if it does, haven't you just added a significant amount of
> > depth to the worst case stack usage? I'm seeing this sort of thing
> > from io_schedule():
> >
> > Depth Size Location (40 entries)
> > ----- ---- --------
> > 0) 4256 16 mempool_alloc_slab+0x15/0x20
> > 1) 4240 144 mempool_alloc+0x63/0x160
> > 2) 4096 16 scsi_sg_alloc+0x4c/0x60
> > 3) 4080 112 __sg_alloc_table+0x66/0x140
> > 4) 3968 32 scsi_init_sgtable+0x33/0x90
> > 5) 3936 48 scsi_init_io+0x31/0xc0
> > 6) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0
> > 7) 3856 112 sd_prep_fn+0x150/0xa90
> > 8) 3744 48 blk_peek_request+0x6a/0x1f0
> > 9) 3696 96 scsi_request_fn+0x60/0x510
> > 10) 3600 32 __blk_run_queue+0x57/0x100
> > 11) 3568 80 flush_plug_list+0x133/0x1d0
> > 12) 3488 32 __blk_flush_plug+0x24/0x50
> > 13) 3456 32 io_schedule+0x79/0x80
> >
> > (This is from a page fault on ext3 that is doing page cache
> > readahead and blocking on a locked buffer.)

FYI, the next step in the allocation chain adds >900 bytes to that
stack:

$ cat /sys/kernel/debug/tracing/stack_trace
Depth Size Location (47 entries)
----- ---- --------
0) 5176 40 zone_statistics+0xad/0xc0
1) 5136 288 get_page_from_freelist+0x2cf/0x840
2) 4848 304 __alloc_pages_nodemask+0x121/0x930
3) 4544 48 kmem_getpages+0x62/0x160
4) 4496 96 cache_grow+0x308/0x330
5) 4400 80 cache_alloc_refill+0x21c/0x260
6) 4320 64 kmem_cache_alloc+0x1b7/0x1e0
7) 4256 16 mempool_alloc_slab+0x15/0x20
8) 4240 144 mempool_alloc+0x63/0x160
9) 4096 16 scsi_sg_alloc+0x4c/0x60
10) 4080 112 __sg_alloc_table+0x66/0x140
11) 3968 32 scsi_init_sgtable+0x33/0x90
12) 3936 48 scsi_init_io+0x31/0xc0
13) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0
14) 3856 112 sd_prep_fn+0x150/0xa90
15) 3744 48 blk_peek_request+0x6a/0x1f0
16) 3696 96 scsi_request_fn+0x60/0x510
17) 3600 32 __blk_run_queue+0x57/0x100
18) 3568 80 flush_plug_list+0x133/0x1d0
19) 3488 32 __blk_flush_plug+0x24/0x50
20) 3456 32 io_schedule+0x79/0x80

That's close to 1800 bytes now, and that's not entering the reclaim
path. If i get one deeper than that, I'll be sure to post it.

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 04-12-2011, 01:45 PM
Jens Axboe
 
Default block: remove per-queue plugging

On 2011-04-12 15:31, Dave Chinner wrote:
> On Tue, Apr 12, 2011 at 02:58:46PM +0200, Jens Axboe wrote:
>> On 2011-04-12 14:41, Dave Chinner wrote:
>>> On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
>>>> On 2011-04-12 14:22, Dave Chinner wrote:
>>>>> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
>>>>>> On 2011-04-12 03:12, hch@infradead.org wrote:
>>>>>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
>>>>>>>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
>>>>>>>> then we can remove that export and make it internal only.
>>>>>>>
>>>>>>> Linus pulled the tree, so they are gone now. Btw, there's still some
>>>>>>> bits in the area that confuse me:
>>>>>>
>>>>>> Great!
>>>>>>
>>>>>>> - what's the point of the queue_sync_plugs? It has a lot of comment
>>>>>>> that seem to pre-data the onstack plugging, but except for that
>>>>>>> it's trivial wrapper around blk_flush_plug, with an argument
>>>>>>> that is not used.
>>>>>>
>>>>>> There's really no point to it anymore. It's existance was due to the
>>>>>> older revision that had to track write requests for serializaing around
>>>>>> a barrier. I'll kill it, since we don't do that anymore.
>>>>>>
>>>>>>> - is there a good reason for the existance of __blk_flush_plug? You'd
>>>>>>> get one additional instruction in the inlined version of
>>>>>>> blk_flush_plug when opencoding, but avoid the need for chained
>>>>>>> function calls.
>>>>>>> - Why is having a plug in blk_flush_plug marked unlikely? Note that
>>>>>>> unlikely is the static branch prediction hint to mark the case
>>>>>>> extremly unlikely and is even used for hot/cold partitioning. But
>>>>>>> when we call it we usually check beforehand if we actually have
>>>>>>> plugs, so it's actually likely to happen.
>>>>>>
>>>>>> The existance and out-of-line is for the scheduler() hook. It should be
>>>>>> an unlikely event to schedule with a plug held, normally the plug should
>>>>>> have been explicitly unplugged before that happens.
>>>>>
>>>>> Though if it does, haven't you just added a significant amount of
>>>>> depth to the worst case stack usage? I'm seeing this sort of thing
>>>>> from io_schedule():
>>>>>
>>>>> Depth Size Location (40 entries)
>>>>> ----- ---- --------
>>>>> 0) 4256 16 mempool_alloc_slab+0x15/0x20
>>>>> 1) 4240 144 mempool_alloc+0x63/0x160
>>>>> 2) 4096 16 scsi_sg_alloc+0x4c/0x60
>>>>> 3) 4080 112 __sg_alloc_table+0x66/0x140
>>>>> 4) 3968 32 scsi_init_sgtable+0x33/0x90
>>>>> 5) 3936 48 scsi_init_io+0x31/0xc0
>>>>> 6) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0
>>>>> 7) 3856 112 sd_prep_fn+0x150/0xa90
>>>>> 8) 3744 48 blk_peek_request+0x6a/0x1f0
>>>>> 9) 3696 96 scsi_request_fn+0x60/0x510
>>>>> 10) 3600 32 __blk_run_queue+0x57/0x100
>>>>> 11) 3568 80 flush_plug_list+0x133/0x1d0
>>>>> 12) 3488 32 __blk_flush_plug+0x24/0x50
>>>>> 13) 3456 32 io_schedule+0x79/0x80
>>>>>
>>>>> (This is from a page fault on ext3 that is doing page cache
>>>>> readahead and blocking on a locked buffer.)
>>>>>
>>>>> I've seen traces where mempool_alloc_slab enters direct reclaim
>>>>> which adds another 1.5k of stack usage to this path. So I'm
>>>>> extremely concerned that you've just reduced the stack available to
>>>>> every thread by at least 2.5k of space...
>>>>
>>>> Yeah, that does not look great. If this turns out to be problematic, we
>>>> can turn the queue runs from the unlikely case into out-of-line from
>>>> kblockd.
>>>>
>>>> But this really isn't that new, you could enter the IO dispatch path
>>>> when doing IO already (when submitting it). So we better be able to
>>>> handle that.
>>>
>>> The problem I see is that IO is submitted when there's plenty of
>>> stack available whould have previously been fine. However now it
>>> hits the plug, and then later on after the thread consumes a lot
>>> more stack it, say, waits for a completion. We then schedule, it
>>> unplugs the queue and we add the IO stack to a place where there
>>> isn't much space available.
>>>
>>> So effectively we are moving the places where stack is consumed
>>> about, and it's complete unpredictable where that stack is going to
>>> land now.
>>
>> Isn't that example fairly contrived?
>
> I don't think so. e.g. in the XFS allocation path we do btree block
> readahead, then go do the real work. The real work can end up with a
> deeper stack before blocking on locks or completions unrelated to
> the readahead, leading to schedule() being called and an unplug
> being issued at that point. You might think it contrived, but if
> you can't provide a guarantee that it can't happen then I have to
> assume it will happen.

If you ended up in lock_page() somewhere along the way, the path would
have been pretty much the same as it is now:

lock_page()
__lock_page()
__wait_on_bit_lock()
sync_page()
aops->sync_page();
block_sync_page()
__blk_run_backing_dev()

and the dispatch follows after that. If your schedules are only due to,
say, blocking on a mutex, then yes it'll be different. But is that
really the case?

I bet that worst case stack usage is exactly the same as before, and
that's the only metric we really care about.

> My concern is that we're already under stack space stress in the
> writeback path, so anything that has the potential to increase it
> significantly is a major worry from my point of view...

I agree on writeback being a worry, and that's why I made the change
(since it makes sense for other reasons, too). I just don't think we are
worse of than before.

>> If we ended up doing the IO
>> dispatch before, then the only difference now is the stack usage of
>> schedule() itself. Apart from that, as far as I can tell, there should
>> not be much difference.
>
> There's a difference between IO submission and IO dispatch. IO
> submission is submit_bio thru to the plug; IO dispatch is from the
> plug down to the disk. If they happen at the same place, there's no
> problem. If IO dispatch is moved to schedule() via a plug....

The IO submission can easily and non-deterministically turn into an IO
dispatch, so there's no real difference for the submitter. That was the
case before. With the explicit plug now, you _know_ that the IO
submission is only that and doesn't include IO dispatch. Not until you
schedule() or call blk_finish_plug(), both of which are events that you
can control.

>>>> If it's a problem from the schedule()/io_schedule() path, then
>>>> lets ensure that those are truly unlikely events so we can punt
>>>> them to kblockd.
>>>
>>> Rather than wait for an explosion to be reported before doing this,
>>> why not just punt unplugs to kblockd unconditionally?
>>
>> Supposedly it's faster to do it inline rather than punt the dispatch.
>> But that may actually not be true, if you have multiple plugs going (and
>> thus multiple contenders for the queue lock on dispatch). So lets play
>> it safe and punt to kblockd, we can always revisit this later.
>
> It's always best to play it safe when it comes to other peoples
> data....

Certainly, but so far I see no real evidence that this is in fact any
safer.

--
Jens Axboe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 04-12-2011, 01:48 PM
Jens Axboe
 
Default block: remove per-queue plugging

On 2011-04-12 15:40, Dave Chinner wrote:
> On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
>> On 2011-04-12 14:22, Dave Chinner wrote:
>>> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
>>>> On 2011-04-12 03:12, hch@infradead.org wrote:
>>>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
>>>>> function calls.
>>>>> - Why is having a plug in blk_flush_plug marked unlikely? Note that
>>>>> unlikely is the static branch prediction hint to mark the case
>>>>> extremly unlikely and is even used for hot/cold partitioning. But
>>>>> when we call it we usually check beforehand if we actually have
>>>>> plugs, so it's actually likely to happen.
>>>>
>>>> The existance and out-of-line is for the scheduler() hook. It should be
>>>> an unlikely event to schedule with a plug held, normally the plug should
>>>> have been explicitly unplugged before that happens.
>>>
>>> Though if it does, haven't you just added a significant amount of
>>> depth to the worst case stack usage? I'm seeing this sort of thing
>>> from io_schedule():
>>>
>>> Depth Size Location (40 entries)
>>> ----- ---- --------
>>> 0) 4256 16 mempool_alloc_slab+0x15/0x20
>>> 1) 4240 144 mempool_alloc+0x63/0x160
>>> 2) 4096 16 scsi_sg_alloc+0x4c/0x60
>>> 3) 4080 112 __sg_alloc_table+0x66/0x140
>>> 4) 3968 32 scsi_init_sgtable+0x33/0x90
>>> 5) 3936 48 scsi_init_io+0x31/0xc0
>>> 6) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0
>>> 7) 3856 112 sd_prep_fn+0x150/0xa90
>>> 8) 3744 48 blk_peek_request+0x6a/0x1f0
>>> 9) 3696 96 scsi_request_fn+0x60/0x510
>>> 10) 3600 32 __blk_run_queue+0x57/0x100
>>> 11) 3568 80 flush_plug_list+0x133/0x1d0
>>> 12) 3488 32 __blk_flush_plug+0x24/0x50
>>> 13) 3456 32 io_schedule+0x79/0x80
>>>
>>> (This is from a page fault on ext3 that is doing page cache
>>> readahead and blocking on a locked buffer.)
>
> FYI, the next step in the allocation chain adds >900 bytes to that
> stack:
>
> $ cat /sys/kernel/debug/tracing/stack_trace
> Depth Size Location (47 entries)
> ----- ---- --------
> 0) 5176 40 zone_statistics+0xad/0xc0
> 1) 5136 288 get_page_from_freelist+0x2cf/0x840
> 2) 4848 304 __alloc_pages_nodemask+0x121/0x930
> 3) 4544 48 kmem_getpages+0x62/0x160
> 4) 4496 96 cache_grow+0x308/0x330
> 5) 4400 80 cache_alloc_refill+0x21c/0x260
> 6) 4320 64 kmem_cache_alloc+0x1b7/0x1e0
> 7) 4256 16 mempool_alloc_slab+0x15/0x20
> 8) 4240 144 mempool_alloc+0x63/0x160
> 9) 4096 16 scsi_sg_alloc+0x4c/0x60
> 10) 4080 112 __sg_alloc_table+0x66/0x140
> 11) 3968 32 scsi_init_sgtable+0x33/0x90
> 12) 3936 48 scsi_init_io+0x31/0xc0
> 13) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0
> 14) 3856 112 sd_prep_fn+0x150/0xa90
> 15) 3744 48 blk_peek_request+0x6a/0x1f0
> 16) 3696 96 scsi_request_fn+0x60/0x510
> 17) 3600 32 __blk_run_queue+0x57/0x100
> 18) 3568 80 flush_plug_list+0x133/0x1d0
> 19) 3488 32 __blk_flush_plug+0x24/0x50
> 20) 3456 32 io_schedule+0x79/0x80
>
> That's close to 1800 bytes now, and that's not entering the reclaim
> path. If i get one deeper than that, I'll be sure to post it.

Do you have traces from 2.6.38, or are you just doing them now?

The path you quote above should not go into reclaim, it's a GFP_ATOMIC
allocation.

--
Jens Axboe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 

Thread Tools




All times are GMT. The time now is 02:35 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org