FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Device-mapper Development

 
 
LinkBack Thread Tools
 
Old 01-24-2012, 04:06 PM
Andrea Arcangeli
 
Default a few storage topics

On Tue, Jan 24, 2012 at 11:56:31AM -0500, Christoph Hellwig wrote:
> That assumes the 512k requests is created by merging. We have enough
> workloads that create large I/O from the get go, and not splitting them
> and eventually merging them again would be a big win. E.g. I'm
> currently looking at a distributed block device which uses internal 4MB
> chunks, and increasing the maximum request size to that dramatically
> increases the read performance.

Depends on the device though, if it's a normal disk, it likely only
reduces the number of dma ops without increasing performance too
much. Most disks should reach platter speed at 64KB, so larger request
only saves a bit of cpu in interrutps and stuff.

But I think nobody here was suggesting to reduce the request size by
default. cfq should easily notice when there are multiple queues that
are being submitted in the same time range. A device in addition to
specifying the max request dma size it can handle it could specify the
minimum it runs at platter speed and cfq could degrade to it when
there's multiple queues running in parallel over the same millisecond
or so. Reads will return in the I/O queue almost immediately but
they'll be out for a little while until the data is copied to
userland. So it'd need to keep it down to the min request size the
device allows to reach platter speed, for a little while. Then if no
other queue presents itself it double up the request size for each
unit of time until it reaches the max again. Maybe that could work, maybe
not . Waiting only once for 4MB sounds better than waiting every
time 4MB for each 4k metadata seeking read.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 01-24-2012, 04:08 PM
Chris Mason
 
Default a few storage topics

On Tue, Jan 24, 2012 at 11:56:31AM -0500, Christoph Hellwig wrote:
> On Tue, Jan 24, 2012 at 10:15:04AM -0500, Chris Mason wrote:
> > https://lkml.org/lkml/2011/12/13/326
> >
> > This patch is another example, although for a slight different reason.
> > I really have no idea yet what the right answer is in a generic sense,
> > but you don't need a 512K request to see higher latencies from merging.
>
> That assumes the 512k requests is created by merging. We have enough
> workloads that create large I/O from the get go, and not splitting them
> and eventually merging them again would be a big win. E.g. I'm
> currently looking at a distributed block device which uses internal 4MB
> chunks, and increasing the maximum request size to that dramatically
> increases the read performance.

Is this read latency or read tput? If you're waiting on the whole 4MB
anyway, I'd expect one request to be better for both. But Andrea's
original question was on the impact of the big request on other requests
being serviced by the drive....there's really not much we can do about
that outside of more knobs for the admin.

-chris

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 01-24-2012, 04:08 PM
Andreas Dilger
 
Default a few storage topics

On 2012-01-24, at 9:56, Christoph Hellwig <hch@infradead.org> wrote:
> On Tue, Jan 24, 2012 at 10:15:04AM -0500, Chris Mason wrote:
>> https://lkml.org/lkml/2011/12/13/326
>>
>> This patch is another example, although for a slight different reason.
>> I really have no idea yet what the right answer is in a generic sense,
>> but you don't need a 512K request to see higher latencies from merging.
>
> That assumes the 512k requests is created by merging. We have enough
> workloads that create large I/O from the get go, and not splitting them
> and eventually merging them again would be a big win. E.g. I'm
> currently looking at a distributed block device which uses internal 4MB
> chunks, and increasing the maximum request size to that dramatically
> increases the read performance.

(sorry about last email, hit send by accident)

I don't think we can have a "one size fits all" policy here. In most RAID devices the IO size needs to be at least 1MB, and with newer devices 4MB gives better performance.

One of the reasons that Lustre used to hack so much around the VFS and VM APIs is exactly to avoid the splitting of read/write requests into pages and then depend on the elevator to reconstruct a good-sized IO out of it.

Things have gotten better with newer kernels, but there is still a ways to go w.r.t. allowing large IO requests to pass unhindered through to disk (or at least as far as enduring that the IO is aligned to the underlying disk geometry).

Cheers, Andreas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 01-24-2012, 04:12 PM
Jeff Moyer
 
Default a few storage topics

Chris Mason <chris.mason@oracle.com> writes:

> On Mon, Jan 23, 2012 at 01:28:08PM -0500, Jeff Moyer wrote:
>> Andrea Arcangeli <aarcange@redhat.com> writes:
>>
>> > On Mon, Jan 23, 2012 at 05:18:57PM +0100, Jan Kara wrote:
>> >> requst granularity. Sure, big requests will take longer to complete but
>> >> maximum request size is relatively low (512k by default) so writing maximum
>> >> sized request isn't that much slower than writing 4k. So it works OK in
>> >> practice.
>> >
>> > Totally unrelated to the writeback, but the merged big 512k requests
>> > actually adds up some measurable I/O scheduler latencies and they in
>> > turn slightly diminish the fairness that cfq could provide with
>> > smaller max request size. Probably even more measurable with SSDs (but
>> > then SSDs are even faster).
>>
>> Are you speaking from experience? If so, what workloads were negatively
>> affected by merging, and how did you measure that?
>
> https://lkml.org/lkml/2011/12/13/326
>
> This patch is another example, although for a slight different reason.
> I really have no idea yet what the right answer is in a generic sense,
> but you don't need a 512K request to see higher latencies from merging.

Well, this patch has almost nothing to with merging, right? It's about
keeping I/O from the I/O scheduler for too long (or, prior to on-stack
plugging, it was about keeping the queue plugged for too long). And,
I'm pretty sure that the testing involved there was with deadline or
noop, nothing to do with CFQ fairness. ;-)

However, this does bring to light the bigger problem of optimizing for
the underlying storage and the workload requirements. Some tuning can
be done in the I/O scheduler, but the plugging definitely circumvents
that a little bit.

-Jeff

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 01-24-2012, 04:32 PM
Chris Mason
 
Default a few storage topics

On Tue, Jan 24, 2012 at 12:12:30PM -0500, Jeff Moyer wrote:
> Chris Mason <chris.mason@oracle.com> writes:
>
> > On Mon, Jan 23, 2012 at 01:28:08PM -0500, Jeff Moyer wrote:
> >> Andrea Arcangeli <aarcange@redhat.com> writes:
> >>
> >> > On Mon, Jan 23, 2012 at 05:18:57PM +0100, Jan Kara wrote:
> >> >> requst granularity. Sure, big requests will take longer to complete but
> >> >> maximum request size is relatively low (512k by default) so writing maximum
> >> >> sized request isn't that much slower than writing 4k. So it works OK in
> >> >> practice.
> >> >
> >> > Totally unrelated to the writeback, but the merged big 512k requests
> >> > actually adds up some measurable I/O scheduler latencies and they in
> >> > turn slightly diminish the fairness that cfq could provide with
> >> > smaller max request size. Probably even more measurable with SSDs (but
> >> > then SSDs are even faster).
> >>
> >> Are you speaking from experience? If so, what workloads were negatively
> >> affected by merging, and how did you measure that?
> >
> > https://lkml.org/lkml/2011/12/13/326
> >
> > This patch is another example, although for a slight different reason.
> > I really have no idea yet what the right answer is in a generic sense,
> > but you don't need a 512K request to see higher latencies from merging.
>
> Well, this patch has almost nothing to with merging, right? It's about
> keeping I/O from the I/O scheduler for too long (or, prior to on-stack
> plugging, it was about keeping the queue plugged for too long). And,
> I'm pretty sure that the testing involved there was with deadline or
> noop, nothing to do with CFQ fairness. ;-)
>
> However, this does bring to light the bigger problem of optimizing for
> the underlying storage and the workload requirements. Some tuning can
> be done in the I/O scheduler, but the plugging definitely circumvents
> that a little bit.

Well, its merging in the sense that we know with perfect accuracy how
often it happens (all the time) and how big an impact it had on latency.
You're right that it isn't related to fairness because in this workload
the only IO being sent down was these writes, and only one process was
doing it.

I mention it mostly because the numbers go against all common sense (at
least for me). Storage just isn't as predictable anymore.

The benchmarking team later reported the patch improved latencies on all
io, not just the log writer. This one box is fairly consistent.

-chris

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 01-24-2012, 04:59 PM
"Martin K. Petersen"
 
Default a few storage topics

>>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:

Mike> 1) expose WRITE SAME via higher level interface (ala
Mike> sb_issue_discard) for more efficient zeroing on SCSI devices
Mike> that support it

I actually thought I had submitted those patches as part of the thin
provisioning update. Looks like I held them back for some reason. I'll
check my notes to figure out why and get the kit merged forward ASAP!


Mike> 4) is anyone working on an interface to GET LBA STATUS?
Mike> - Martin Petersen added GET LBA STATUS support to scsi_debug,
Mike> but is there a vision for how tools (e.g. pvmove) could
Mike> access such info in a uniform way across different vendors'
Mike> storage?

I hadn't thought of that use case. Going to be a bit tricky given how
GET LBA STATUS works...

--
Martin K. Petersen Oracle Linux Engineering

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 01-24-2012, 05:05 PM
Jeff Moyer
 
Default a few storage topics

Andreas Dilger <adilger@dilger.ca> writes:

> On 2012-01-24, at 9:56, Christoph Hellwig <hch@infradead.org> wrote:
>> On Tue, Jan 24, 2012 at 10:15:04AM -0500, Chris Mason wrote:
>>> https://lkml.org/lkml/2011/12/13/326
>>>
>>> This patch is another example, although for a slight different reason.
>>> I really have no idea yet what the right answer is in a generic sense,
>>> but you don't need a 512K request to see higher latencies from merging.
>>
>> That assumes the 512k requests is created by merging. We have enough
>> workloads that create large I/O from the get go, and not splitting them
>> and eventually merging them again would be a big win. E.g. I'm
>> currently looking at a distributed block device which uses internal 4MB
>> chunks, and increasing the maximum request size to that dramatically
>> increases the read performance.
>
> (sorry about last email, hit send by accident)
>
> I don't think we can have a "one size fits all" policy here. In most
> RAID devices the IO size needs to be at least 1MB, and with newer
> devices 4MB gives better performance.

Right, and there's more to it than just I/O size. There's access
pattern, and more importantly, workload and related requirements
(latency vs throughput).

> One of the reasons that Lustre used to hack so much around the VFS and
> VM APIs is exactly to avoid the splitting of read/write requests into
> pages and then depend on the elevator to reconstruct a good-sized IO
> out of it.
>
> Things have gotten better with newer kernels, but there is still a
> ways to go w.r.t. allowing large IO requests to pass unhindered
> through to disk (or at least as far as enduring that the IO is aligned
> to the underlying disk geometry).

I've been wondering if it's gotten better, so decided to run a few quick
tests.

kernel version 3.2.0, storage: hp eva fc array, i/o scheduler cfq,
max_sectors_kb: 1024, test program: dd

ext3:
- buffered writes and buffered O_SYNC writes, all 1MB block size show 4k
I/Os passed down to the I/O scheduler
- buffered 1MB reads are a little better, typically in the 128k-256k
range when they hit the I/O scheduler.

ext4:
- buffered writes: 512K I/Os show up at the elevator
- buffered O_SYNC writes: data is again 512KB, journal writes are 4K
- buffered 1MB reads get down to the scheduler in 128KB chunks

xfs:
- buffered writes: 1MB I/Os show up at the elevator
- buffered O_SYNC writes: 1MB I/Os
- buffered 1MB reads: 128KB chunks show up at the I/O scheduler

So, ext4 is doing better than ext3, but still not perfect. xfs is
kicking ass for writes, but reads are still split up.

Cheers,
Jeff

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 01-24-2012, 05:14 PM
Jeff Moyer
 
Default a few storage topics

Chris Mason <chris.mason@oracle.com> writes:

> On Tue, Jan 24, 2012 at 12:12:30PM -0500, Jeff Moyer wrote:
>> Chris Mason <chris.mason@oracle.com> writes:
>>
>> > On Mon, Jan 23, 2012 at 01:28:08PM -0500, Jeff Moyer wrote:
>> >> Andrea Arcangeli <aarcange@redhat.com> writes:
>> >>
>> >> > On Mon, Jan 23, 2012 at 05:18:57PM +0100, Jan Kara wrote:
>> >> >> requst granularity. Sure, big requests will take longer to complete but
>> >> >> maximum request size is relatively low (512k by default) so writing maximum
>> >> >> sized request isn't that much slower than writing 4k. So it works OK in
>> >> >> practice.
>> >> >
>> >> > Totally unrelated to the writeback, but the merged big 512k requests
>> >> > actually adds up some measurable I/O scheduler latencies and they in
>> >> > turn slightly diminish the fairness that cfq could provide with
>> >> > smaller max request size. Probably even more measurable with SSDs (but
>> >> > then SSDs are even faster).
>> >>
>> >> Are you speaking from experience? If so, what workloads were negatively
>> >> affected by merging, and how did you measure that?
>> >
>> > https://lkml.org/lkml/2011/12/13/326
>> >
>> > This patch is another example, although for a slight different reason.
>> > I really have no idea yet what the right answer is in a generic sense,
>> > but you don't need a 512K request to see higher latencies from merging.
>>
>> Well, this patch has almost nothing to with merging, right? It's about
>> keeping I/O from the I/O scheduler for too long (or, prior to on-stack
>> plugging, it was about keeping the queue plugged for too long). And,
>> I'm pretty sure that the testing involved there was with deadline or
>> noop, nothing to do with CFQ fairness. ;-)
>>
>> However, this does bring to light the bigger problem of optimizing for
>> the underlying storage and the workload requirements. Some tuning can
>> be done in the I/O scheduler, but the plugging definitely circumvents
>> that a little bit.
>
> Well, its merging in the sense that we know with perfect accuracy how
> often it happens (all the time) and how big an impact it had on latency.
> You're right that it isn't related to fairness because in this workload
> the only IO being sent down was these writes, and only one process was
> doing it.
>
> I mention it mostly because the numbers go against all common sense (at
> least for me). Storage just isn't as predictable anymore.

Right, strange that we saw an improvement with the patch even on FC
storage. So, it's not just fast SSDs that benefit.

> The benchmarking team later reported the patch improved latencies on all
> io, not just the log writer. This one box is fairly consistent.

We've been running tests with that patch as well, and I've yet to find a
downside. I haven't yet run the original synthetic workload, since I
wanted real-world data first. It's on my list to keep poking at it. I
haven't yet run against really slow storage, either, which I expect to
show some regression with the patch.

Cheers,
Jeff

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 01-24-2012, 05:40 PM
Christoph Hellwig
 
Default a few storage topics

On Tue, Jan 24, 2012 at 01:05:50PM -0500, Jeff Moyer wrote:
> - buffered writes and buffered O_SYNC writes, all 1MB block size show 4k
> I/Os passed down to the I/O scheduler
> - buffered 1MB reads are a little better, typically in the 128k-256k
> range when they hit the I/O scheduler.
>
> ext4:
> - buffered writes: 512K I/Os show up at the elevator
> - buffered O_SYNC writes: data is again 512KB, journal writes are 4K
> - buffered 1MB reads get down to the scheduler in 128KB chunks
>
> xfs:
> - buffered writes: 1MB I/Os show up at the elevator
> - buffered O_SYNC writes: 1MB I/Os
> - buffered 1MB reads: 128KB chunks show up at the I/O scheduler
>
> So, ext4 is doing better than ext3, but still not perfect. xfs is
> kicking ass for writes, but reads are still split up.

All three filesystems use the generic mpages code for reads, so they
all get the same (bad) I/O patterns. Looks like we need to fix this up
ASAP.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 01-24-2012, 06:07 PM
Chris Mason
 
Default a few storage topics

On Tue, Jan 24, 2012 at 01:40:54PM -0500, Christoph Hellwig wrote:
> On Tue, Jan 24, 2012 at 01:05:50PM -0500, Jeff Moyer wrote:
> > - buffered writes and buffered O_SYNC writes, all 1MB block size show 4k
> > I/Os passed down to the I/O scheduler
> > - buffered 1MB reads are a little better, typically in the 128k-256k
> > range when they hit the I/O scheduler.
> >
> > ext4:
> > - buffered writes: 512K I/Os show up at the elevator
> > - buffered O_SYNC writes: data is again 512KB, journal writes are 4K
> > - buffered 1MB reads get down to the scheduler in 128KB chunks
> >
> > xfs:
> > - buffered writes: 1MB I/Os show up at the elevator
> > - buffered O_SYNC writes: 1MB I/Os
> > - buffered 1MB reads: 128KB chunks show up at the I/O scheduler
> >
> > So, ext4 is doing better than ext3, but still not perfect. xfs is
> > kicking ass for writes, but reads are still split up.
>
> All three filesystems use the generic mpages code for reads, so they
> all get the same (bad) I/O patterns. Looks like we need to fix this up
> ASAP.

Can you easily run btrfs through the same rig? We don't use mpages and
I'm curious.

-chris

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 

Thread Tools




All times are GMT. The time now is 12:39 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org