FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Device-mapper Development

 
 
LinkBack Thread Tools
 
Old 01-27-2012, 04:25 AM
Andreas Dilger
 
Default a few storage topics

On 2012-01-26, at 8:27 PM, Wu Fengguang wrote:
> On Fri, Jan 27, 2012 at 09:34:49AM +1100, Dave Chinner wrote:
>> On Thu, Jan 26, 2012 at 11:25:56AM -0500, Vivek Goyal wrote:
>>> On Wed, Jan 25, 2012 at 02:35:52PM +0800, Wu Fengguang wrote:
>>>>> It would also be
>>>>> possible to trigger on the size of the device so that the 32MB USB stick
>>>>> doesn't sit busy for a minute with readahead that is useless.
>>>>
>>>> Yeah, I do have a patch for shrinking readahead size based on device size.
>>>
>>> Should it be a udev rule to change read_ahead_kb on device based on device
>>> size, instead of a kernel patch?
>>
>> That's effectively what vendors like SGI have been doing since udev
>> was first introduced, though more often the rules are based on device
>> type rather than size. e.g. a 64GB device might be a USB flash drive
>> now, but a 40GB device might be a really fast SSD....
>
> Fair enough. I'll drop this kernel policy patch
>
> block: limit default readahead size for small devices
> https://lkml.org/lkml/2011/12/19/89

Fengguang,
Doesn't the kernel derive at least some idea of the speed of a device
due to the writeback changes that you made? It would be very useful
if we could derive at least some rough metric for the device performance
in the kernel and use that as input to the readahead window size as well.

Cheers, Andreas






--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 01-27-2012, 06:53 AM
Wu Fengguang
 
Default a few storage topics

On Thu, Jan 26, 2012 at 10:25:33PM -0700, Andreas Dilger wrote:
[snip]
> Doesn't the kernel derive at least some idea of the speed of a device
> due to the writeback changes that you made? It would be very useful
> if we could derive at least some rough metric for the device performance
> in the kernel and use that as input to the readahead window size as well.

Yeah we now have bdi->write_bandwidth (exported as "BdiWriteBandwidth"
in /debug/bdi/8:0/stats) for estimating the bdi write bandwidth.

However the value is not reflecting the sequential throughput in some
cases:

1) when doing random writes
2) when doing mixed reads+writes
3) when not enough IO have been issued
4) in the rare case, when writing to a small area repeatedly so that
it's effectively writing to the internal disk buffer at high speed

So there are still some challenges in getting a reliably usable
runtime estimation.

Thanks,
Fengguang

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 01-27-2012, 04:03 PM
"Ted Ts'o"
 
Default a few storage topics

On Thu, Jan 26, 2012 at 05:29:03AM -0700, Andreas Dilger wrote:
>
> Ext4 will also align IO to 1MB boundaries (from the start of
> LUN/partition) by default. If the mke2fs code detects the
> underlying RAID geometry (or the sysadmin sets this manually with
> tune2fs) it will store this in the superblock for the allocator to
> pick a better alignment.

(Still in Hawaii on vacation, but picked this up while I was quickly
scanning through e-mail.)

This is true only if you're using the special (non-upstream'ed) Lustre
interfaces for writing Lustre objects. The writepages interface
doesn't have all of the necessary smarts to do the right thing. It's
been on my todo list to look at, but I've been mostly concentrated on
single disk file systems since that's what we use at Google. (GFS can
scale to many many file systems and servers, and avoiding RAID means
fast FSCK recoveries, simplifying things since we don't have to worry
about RAID-related failures, etc.)

Eventually I'd like ext4 to handle RAID better, but unless you're
forced to support really large files, I've come around to believing
that n=3 replication or Reed-Solomon encoding across multiple servers
is a much better way of achieving data robustness, so it's just not
been high on my list of priorities. I'm much more interested in
making sure ext4 works well under high memory pressure, and other
cloud-related issues.

- Ted

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 02-03-2012, 11:37 AM
Wu Fengguang
 
Default a few storage topics

On Thu, Jan 26, 2012 at 11:40:47AM -0500, Loke, Chetan wrote:
> > From: Andrea Arcangeli [mailto:aarcange@redhat.com]
> > Sent: January 25, 2012 5:46 PM
>
> ....
>
> > Way more important is to have feedback on the readahead hits and be
> > sure when readahead is raised to the maximum the hit rate is near 100%
> > and fallback to lower readaheads if we don't get that hit rate. But
> > that's not a VM problem and it's a readahead issue only.
> >
>
> A quick google showed up - http://kerneltrap.org/node/6642
>
> Interesting thread to follow. I haven't looked further as to what was
> merged and what wasn't.
>
> A quote from the patch - " It works by peeking into the file cache and
> check if there are any history pages present or accessed."
> Now I don't understand anything about this but I would think digging the
> file-cache isn't needed(?). So, yes, a simple RA hit-rate feedback could
> be fine.
>
> And 'maybe' for adaptive RA just increase the RA-blocks by '1'(or some
> N) over period of time. No more smartness. A simple 10 line function is
> easy to debug/maintain. That is, a scaled-down version of
> ramp-up/ramp-down. Don't go crazy by ramping-up/down after every RA(like
> SCSI LLDD madness). Wait for some event to happen.
>
> I can see where Andrew Morton's concerns could be(just my
> interpretation). We may not want to end up like a protocol state machine
> code: tcp slow-start, then increase , then congestion, then let's
> back-off. hmmm, slow-start is a problem for my business logic, so let's
> speed-up slow-start .

Loke,

Thrashing safe readahead can work as simple as:

readahead_size = min(nr_history_pages, MAX_READAHEAD_PAGES)

No need for more slow-start or back-off magics.

This is because nr_history_pages is a lower estimation of the threshing
threshold:

chunk A chunk B chunk C head

l01 l11 l12 l21 l22
| |-->|-->| |------>|-->| |------>|
| +-------+ +-----------+ +-------------+ |
| | # | | # | | # | |
| +-------+ +-----------+ +-------------+ |
| |<==============|<===========================|<=== =========================|
L0 L1 L2

Let f(l) = L be a map from
l: the number of pages read by the stream
to
L: the number of pages pushed into inactive_list in the mean time
then
f(l01) <= L0
f(l11 + l12) = L1
f(l21 + l22) = L2
...
f(l01 + l11 + ...) <= Sum(L0 + L1 + ...)
<= Length(inactive_list) = f(thrashing-threshold)

So the count of continuous history pages left in inactive_list is always a
lower estimation of the true thrashing-threshold. Given a stable workload,
the readahead size will keep ramping up and then stabilize in range

(thrashing_threshold/2, thrashing_threshold)

Thanks,
Fengguang

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 02-03-2012, 11:55 AM
Wu Fengguang
 
Default a few storage topics

On Wed, Jan 25, 2012 at 04:40:23PM +0000, Steven Whitehouse wrote:
> Hi,
>
> On Wed, 2012-01-25 at 11:22 -0500, Loke, Chetan wrote:
> > > If the reason for not setting a larger readahead value is just that it
> > > might increase memory pressure and thus decrease performance, is it
> > > possible to use a suitable metric from the VM in order to set the value
> > > automatically according to circumstances?
> > >
> >
> > How about tracking heuristics for 'read-hits from previous read-aheads'? If the hits are in acceptable range(user-configurable knob?) then keep seeking else back-off a little on the read-ahead?
> >
> > > Steve.
> >
> > Chetan Loke
>
> I'd been wondering about something similar to that. The basic scheme
> would be:
>
> - Set a page flag when readahead is performed
> - Clear the flag when the page is read (or on page fault for mmap)
> (i.e. when it is first used after readahead)
>
> Then when the VM scans for pages to eject from cache, check the flag and
> keep an exponential average (probably on a per-cpu basis) of the rate at
> which such flagged pages are ejected. That number can then be used to
> reduce the max readahead value.
>
> The questions are whether this would provide a fast enough reduction in
> readahead size to avoid problems? and whether the extra complication is
> worth it compared with using an overall metric for memory pressure?
>
> There may well be better solutions though,

The caveat is, on a consistently thrashed machine, the readahead size
should better be determined for each read stream.

Repeated readahead thrashing typically happen in a file server with
large number of concurrent clients. For example, if there are 1000
read streams each doing 1MB readahead, since there are 2 readahead
window for each stream, there could be up to 2GB readahead pages that
will sure be thrashed in a server with only 1GB memory.

Typically the 1000 clients will have different read speeds. A few of
them will be doing 1MB/s, most others may be doing 100KB/s. In this
case, we shall only decrease readahead size for the 100KB/s clients.
The 1MB/s clients actually won't see readahead thrashing at all and
we'll want them to do large 1MB I/O to achieve good disk utilization.

So we need something better than the "global feedback" scheme, and we
do have such a solution As said in my other email, the number of
history pages remained in the page cache is a good estimation of that
particular read stream's thrashing safe readahead size.

Thanks,
Fengguang

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 

Thread Tools




All times are GMT. The time now is 08:34 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org