FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Device-mapper Development

 
 
LinkBack Thread Tools
 
Old 01-22-2012, 11:21 AM
Boaz Harrosh
 
Default a few storage topics

On 01/19/2012 11:46 AM, Jan Kara wrote:
>>
>> OK That one is interesting. Because I'd imagine that the Kernel would not
>> start write-out on a busily modified page.
> So currently writeback doesn't use the fact how busily is page modified.
> After all whole mm has only two sorts of pages - active & inactive - which
> reflects how often page is accessed but says nothing about how often is it
> dirtied. So we don't have this information in the kernel and it would be
> relatively (memory) expensive to keep it.
>

Don't we? what about the information used by the IO elevators per-io-group.
Is it not collected at redirty time. Is it only recorded by the time a bio
is submitted? How does the io-elevator keeps small IO behind heavy writer
latency bound? We could use the reverse of that to not IO the "too soon"

>> Some heavy modifying then a single write. If it's not so then there is
>> already great inefficiency, just now exposed, but was always there. The
>> "page-migrate" mentioned here will not help.
> Yes, but I believe RT guy doesn't redirty the page that often. It is just
> that if you have to meet certain latency criteria, you cannot afford a
> single case where you have to wait. And if you redirty pages, you are bound
> to hit PageWriteback case sooner or later.
>

OK, thanks. I need this overview. What you mean is that since the writeback
fires periodically then there must be times when the page or group of pages
are just in the stage of changing and the writeback takes only half of the
modification.

So What if we let the dirty data always wait that writeback timeout, if
the pages are "to-new" and memory condition is fine, then postpone the
writeout to the next round. (Assuming we have that information from the
first part)

>> Could we not better our page write-out algorithms to avoid heavy
>> contended pages?
> That's not so easy. Firstly, you'll have track and keep that information
> somehow. Secondly, it is better to writeout a busily dirtied page than to
> introduce a seek.

Sure I'd say we just go on the timestamp of the first page in the group.
Because I'd imagine that the application has changed that group of pages
ruffly at the same time.

> Also definition of 'busy' differs for different purposes.
> So to make this useful the logic won't be trivial.

I don't think so. 1st: io the oldest data. 2nd: Postpone the IO of
"too new data". So any dirtying has some "aging time" before attack. The
aging time is very much related to your writeback timer. (Which is
"the amount of memory buffer you want to keep" divide by your writeout-rate)

> Thirdly, the benefit is
> questionable anyway (at least for most of realistic workloads) because
> flusher thread doesn't write the pages all that often - when there are not
> many pages, we write them out just once every couple of seconds, when we
> have lots of dirty pages we cycle through all of them so one page is not
> written that often.
>

Exactly, so lets make sure dirty is always "couple of seconds" old. Don't let
that timer sample data that is just been dirtied.

Which brings me to another subject in the second case "when we have lots of
dirty pages". I wish we could talk at LSF/MM about how to not do a dumb cycle
on sb's inodes but do a time sort write-out. The writeout is always started
from the lowest addressed page (inode->i_index) so take the time-of-dirty of
that page as the sorting factor of the inode. And maybe keep a min-inode-dirty-time
per SB to prioritize on SBs.

Because you see elevator-less FileSystems. Which are none-block-dev BDIs like
NFS or exofs have a problem. An heavy writer can easily totally starve a slow
IOer (read or write). I can easily demonstrate how an NFS heavy writer starves
a KDE desktop to a crawl. We should be starting to think on IO fairness and
interactivity at the VFS layer. So to not let every none-block-FS solve it's
own problem all over again.

>> Do you have a more detailed description of the workload? Is it theoretically
>> avoidable?
> See https://lkml.org/lkml/2011/10/23/156. Using page migration or copyout
> would solve the problems of this guy.
>
> Honza

Thanks
Boaz

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 01-23-2012, 03:18 PM
Jan Kara
 
Default a few storage topics

On Sun 22-01-12 14:21:51, Boaz Harrosh wrote:
> On 01/19/2012 11:46 AM, Jan Kara wrote:
> >>
> >> OK That one is interesting. Because I'd imagine that the Kernel would not
> >> start write-out on a busily modified page.
> > So currently writeback doesn't use the fact how busily is page modified.
> > After all whole mm has only two sorts of pages - active & inactive - which
> > reflects how often page is accessed but says nothing about how often is it
> > dirtied. So we don't have this information in the kernel and it would be
> > relatively (memory) expensive to keep it.
> >
>
> Don't we? what about the information used by the IO elevators per-io-group.
> Is it not collected at redirty time. Is it only recorded by the time a bio
> is submitted? How does the io-elevator keeps small IO behind heavy writer
> latency bound? We could use the reverse of that to not IO the "too soon"
IO elevator is at rather different level. It only starts tracking
something once we have struct request. So it knows nothing about
redirtying, or even pages as such. Also prioritization works only with the
requst granularity. Sure, big requests will take longer to complete but
maximum request size is relatively low (512k by default) so writing maximum
sized request isn't that much slower than writing 4k. So it works OK in
practice.

> >> Some heavy modifying then a single write. If it's not so then there is
> >> already great inefficiency, just now exposed, but was always there. The
> >> "page-migrate" mentioned here will not help.
> > Yes, but I believe RT guy doesn't redirty the page that often. It is just
> > that if you have to meet certain latency criteria, you cannot afford a
> > single case where you have to wait. And if you redirty pages, you are bound
> > to hit PageWriteback case sooner or later.
> >
>
> OK, thanks. I need this overview. What you mean is that since the writeback
> fires periodically then there must be times when the page or group of pages
> are just in the stage of changing and the writeback takes only half of the
> modification.
>
> So What if we let the dirty data always wait that writeback timeout, if
What do you mean by writeback timeout?

> the pages are "to-new" and memory condition is fine, then postpone the
And what do you mean by "to-new"?

> writeout to the next round. (Assuming we have that information from the
> first part)
Sorry, I don't understand your idea...

> >> Could we not better our page write-out algorithms to avoid heavy
> >> contended pages?
> > That's not so easy. Firstly, you'll have track and keep that information
> > somehow. Secondly, it is better to writeout a busily dirtied page than to
> > introduce a seek.
>
> Sure I'd say we just go on the timestamp of the first page in the group.
> Because I'd imagine that the application has changed that group of pages
> ruffly at the same time.
We don't have a timestamp on a page. What we have is a timestamp on an
inode. Ideally that would be a time when the oldest dirty page in the inode
was dirtied. Practically, we cannot really keep that information (e.g.
after writing just some dirty pages in an inode) so it is rather crude
approximation of that.

> > Also definition of 'busy' differs for different purposes.
> > So to make this useful the logic won't be trivial.
>
> I don't think so. 1st: io the oldest data. 2nd: Postpone the IO of
> "too new data". So any dirtying has some "aging time" before attack. The
> aging time is very much related to your writeback timer. (Which is
> "the amount of memory buffer you want to keep" divide by your writeout-rate)
Again I repeat - you don't want to introduce seek into your IO stream
only because that single page got dirtied too recently. For randomly
written files there's always some compromise between how linear IO you want
and how much you want to reflect page aging. Currently to go for 'totally
linear' which is easier to do and generally better for throughput.

> > Thirdly, the benefit is
> > questionable anyway (at least for most of realistic workloads) because
> > flusher thread doesn't write the pages all that often - when there are not
> > many pages, we write them out just once every couple of seconds, when we
> > have lots of dirty pages we cycle through all of them so one page is not
> > written that often.
>
> Exactly, so lets make sure dirty is always "couple of seconds" old. Don't let
> that timer sample data that is just been dirtied.
>
> Which brings me to another subject in the second case "when we have lots of
> dirty pages". I wish we could talk at LSF/MM about how to not do a dumb cycle
> on sb's inodes but do a time sort write-out. The writeout is always started
> from the lowest addressed page (inode->i_index) so take the time-of-dirty of
> that page as the sorting factor of the inode. And maybe keep a min-inode-dirty-time
> per SB to prioritize on SBs.
Boaz, we already do track inodes by dirty time and do writeback in that
order. Go read that code in fs/fs-writeback.c.

> Because you see elevator-less FileSystems. Which are none-block-dev BDIs like
> NFS or exofs have a problem. An heavy writer can easily totally starve a slow
> IOer (read or write). I can easily demonstrate how an NFS heavy writer starves
> a KDE desktop to a crawl.
Currently, we rely on IO scheduler to protect light writers / readers.
You are right that for non-block filesystems that is problematic because
for them it is not hard to starve light readers by heavy writers. But
that doesn't seem like a problem of writeback but rather as a problem of
NFS client or exofs? Especially in the reader-vs-writer case writeback
simply doesn't have enough information and isn't the right place to solve
your problems. And I agree it would be stupid to duplicate code in CFQ in
several places so maybe you could lift some parts of it and generalize them
enough so that they can be used by others.

> We should be starting to think on IO fairness and interactivity at the
> VFS layer. So to not let every none-block-FS solve it's own problem all
> over again.

Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 01-23-2012, 03:30 PM
Jan Kara
 
Default a few storage topics

On Sun 22-01-12 13:31:38, Boaz Harrosh wrote:
> On 01/19/2012 11:39 PM, Andrea Arcangeli wrote:
> > On Thu, Jan 19, 2012 at 09:52:11PM +0100, Jan Kara wrote:
> >> anything. So what will be cheaper depends on how often are redirtied pages
> >> under IO. This is rather rare because pages aren't flushed all that often.
> >> So the effect of stable pages in not observable on throughput. But you can
> >> certainly see it on max latency...
> >
> > I see your point. A problem with migrate though is that the page must
> > be pinned by the I/O layer to prevent migration to free the page under
> > I/O, or how else it could be safe to read from a freed page? And if
> > the page is pinned migration won't work at all. See page_freeze_refs
> > in migrate_page_move_mapping. So the pinning issue would need to be
> > handled somehow. It's needed for example when there's an O_DIRECT
> > read, and the I/O is going to the page, if the page is migrated in
> > that case, we'd lose a part of the I/O. Differentiating how many page
> > pins are ok to be ignored by migration won't be trivial but probably
> > possible to do.
> >
> > Another way maybe would be to detect when there's too much re-dirtying
> > of pages in flight in a short amount of time, and to start the bounce
> > buffering and stop waiting, until the re-dirtying stops, and then you
> > stop the bounce buffering. But unlike migration, it can't prevent an
> > initial burst of high fault latency...
>
> Or just change that RT program that is one - latency bound but, two - does
> unpredictable, statistically bad, things to a memory mapped file.
Right. That's what I told the RT guy as well But he didn't like to
hear that because it meant more coding for him.

> Can a memory-mapped-file writer have some control on the time of
> writeback with data_sync or such, or it's purely: Timer fired, Kernel see
> a dirty page, start a writeout? What about if the application maps a
> portion of the file at a time, and the Kernel gets more lazy on an active
> memory mapped region. (That's what windows NT do. It will never IO a mapped
> section unless in OOM conditions. The application needs to map small sections
> and unmap to IO. It's more of a direct_io than mmap)
You can always start writeback by sync_file_range() but you have no
guarantees what writeback does. Also if you need to redirty the page
pernamently (e.g. it's a head of your transaction log), there's simply no
good time when it can be written when you also want stable pages.

> In any case, if you are very latency sensitive an mmap writeout is bad for
> you. Not only because of this new problem, but because mmap writeout can
> sync with tones of other things, that are do to memory management. (As mentioned
> by Andrea). The best for latency sensitive application is asynchronous direct-io
> by far. Only with asynchronous and direct-io you can have any real control on
> your latency. (I understand they used to have empirically observed latency bound
> but that is just luck, not real control)
>
> BTW: The application mentioned would probably not want it's IO bounced at
> the block layer, other wise why would it use mmap if not for preventing
> the copy induced by buffer IO?
Yeah, I'm not sure why their design was as it was.

> All that said, a mount option to ext4 (Is ext4 used?) to revert to the old
> behavior is the easiest solution. When originally we brought this up in LSF
> my thought was that the block request Q should have some flag that says
> need_stable_pages. If set by the likes of dm/md-raid, iscsi-with-data-signed, DIFF
> enabled devices and so on, and the FS does not guaranty/wants stable pages
> then an IO bounce is set up. But if not set then the like of ext4 need not
> bother.
There's no mount option. The behavior is on unconditionally. And so far I
have not seen enough people complain to introduce something like that -
automatic logic is a different thing of course. That might be nice to have.

Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 01-23-2012, 04:53 PM
Andrea Arcangeli
 
Default a few storage topics

On Mon, Jan 23, 2012 at 05:18:57PM +0100, Jan Kara wrote:
> requst granularity. Sure, big requests will take longer to complete but
> maximum request size is relatively low (512k by default) so writing maximum
> sized request isn't that much slower than writing 4k. So it works OK in
> practice.

Totally unrelated to the writeback, but the merged big 512k requests
actually adds up some measurable I/O scheduler latencies and they in
turn slightly diminish the fairness that cfq could provide with
smaller max request size. Probably even more measurable with SSDs (but
then SSDs are even faster).

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 01-23-2012, 05:28 PM
Jeff Moyer
 
Default a few storage topics

Andrea Arcangeli <aarcange@redhat.com> writes:

> On Mon, Jan 23, 2012 at 05:18:57PM +0100, Jan Kara wrote:
>> requst granularity. Sure, big requests will take longer to complete but
>> maximum request size is relatively low (512k by default) so writing maximum
>> sized request isn't that much slower than writing 4k. So it works OK in
>> practice.
>
> Totally unrelated to the writeback, but the merged big 512k requests
> actually adds up some measurable I/O scheduler latencies and they in
> turn slightly diminish the fairness that cfq could provide with
> smaller max request size. Probably even more measurable with SSDs (but
> then SSDs are even faster).

Are you speaking from experience? If so, what workloads were negatively
affected by merging, and how did you measure that?

Cheers,
Jeff

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 01-23-2012, 05:56 PM
Andrea Arcangeli
 
Default a few storage topics

On Mon, Jan 23, 2012 at 01:28:08PM -0500, Jeff Moyer wrote:
> Are you speaking from experience? If so, what workloads were negatively
> affected by merging, and how did you measure that?

Any workload where two processes compete for accessing the same disk
and one process writes big requests (usually async writes), the other
small (usually sync reads). The one with the small 4k requests
(usually reads) gets some artificial latency if the big requests are
512k. Vivek did a recent measurement to verify the issue is still
there, and it's basically an hardware issue. Software can't do much
other than possibly reducing the max request size when we notice such
an I/O pattern coming in cfq. I did old measurements that's how I knew
it, but they were so ancient they're worthless by now, this is why
Vivek had to repeat it to verify before we could assume it still
existed on recent hardware.

These days with cgroups it may be a bit more relevant as max write
bandwidth may be secondary to latency/QoS.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 01-23-2012, 06:19 PM
Jeff Moyer
 
Default a few storage topics

Andrea Arcangeli <aarcange@redhat.com> writes:

> On Mon, Jan 23, 2012 at 01:28:08PM -0500, Jeff Moyer wrote:
>> Are you speaking from experience? If so, what workloads were negatively
>> affected by merging, and how did you measure that?
>
> Any workload where two processes compete for accessing the same disk
> and one process writes big requests (usually async writes), the other
> small (usually sync reads). The one with the small 4k requests
> (usually reads) gets some artificial latency if the big requests are
> 512k. Vivek did a recent measurement to verify the issue is still
> there, and it's basically an hardware issue. Software can't do much
> other than possibly reducing the max request size when we notice such
> an I/O pattern coming in cfq. I did old measurements that's how I knew
> it, but they were so ancient they're worthless by now, this is why
> Vivek had to repeat it to verify before we could assume it still
> existed on recent hardware.
>
> These days with cgroups it may be a bit more relevant as max write
> bandwidth may be secondary to latency/QoS.

Thanks, Vivek was able to point me at the old thread:
http://www.spinics.net/lists/linux-fsdevel/msg44191.html

Cheers,
Jeff

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 01-24-2012, 02:15 PM
Chris Mason
 
Default a few storage topics

On Mon, Jan 23, 2012 at 01:28:08PM -0500, Jeff Moyer wrote:
> Andrea Arcangeli <aarcange@redhat.com> writes:
>
> > On Mon, Jan 23, 2012 at 05:18:57PM +0100, Jan Kara wrote:
> >> requst granularity. Sure, big requests will take longer to complete but
> >> maximum request size is relatively low (512k by default) so writing maximum
> >> sized request isn't that much slower than writing 4k. So it works OK in
> >> practice.
> >
> > Totally unrelated to the writeback, but the merged big 512k requests
> > actually adds up some measurable I/O scheduler latencies and they in
> > turn slightly diminish the fairness that cfq could provide with
> > smaller max request size. Probably even more measurable with SSDs (but
> > then SSDs are even faster).
>
> Are you speaking from experience? If so, what workloads were negatively
> affected by merging, and how did you measure that?

https://lkml.org/lkml/2011/12/13/326

This patch is another example, although for a slight different reason.
I really have no idea yet what the right answer is in a generic sense,
but you don't need a 512K request to see higher latencies from merging.

-chris

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 01-24-2012, 03:56 PM
Christoph Hellwig
 
Default a few storage topics

On Tue, Jan 24, 2012 at 10:15:04AM -0500, Chris Mason wrote:
> https://lkml.org/lkml/2011/12/13/326
>
> This patch is another example, although for a slight different reason.
> I really have no idea yet what the right answer is in a generic sense,
> but you don't need a 512K request to see higher latencies from merging.

That assumes the 512k requests is created by merging. We have enough
workloads that create large I/O from the get go, and not splitting them
and eventually merging them again would be a big win. E.g. I'm
currently looking at a distributed block device which uses internal 4MB
chunks, and increasing the maximum request size to that dramatically
increases the read performance.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 01-24-2012, 04:01 PM
Andreas Dilger
 
Default a few storage topics

Cheers, Andreas

On 2012-01-24, at 9:56, Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Jan 24, 2012 at 10:15:04AM -0500, Chris Mason wrote:
>> https://lkml.org/lkml/2011/12/13/326
>>
>> This patch is another example, although for a slight different reason.
>> I really have no idea yet what the right answer is in a generic sense,
>> but you don't need a 512K request to see higher latencies from merging.
>
> That assumes the 512k requests is created by merging. We have enough
> workloads that create large I/O from the get go, and not splitting them
> and eventually merging them again would be a big win. E.g. I'm
> currently looking at a distributed block device which uses internal 4MB
> chunks, and increasing the maximum request size to that dramatically
> increases the read performance.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 

Thread Tools




All times are GMT. The time now is 09:46 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org