FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Device-mapper Development

 
 
LinkBack Thread Tools
 
Old 03-24-2009, 10:02 AM
"John A. Sullivan III"
 
Default Shell Scripts or Arbitrary Priority Callouts?

On Tue, 2009-03-24 at 09:39 +0200, Pasi Kärkkäinen wrote:
> On Mon, Mar 23, 2009 at 05:46:36AM -0400, John A. Sullivan III wrote:
> > On Sun, 2009-03-22 at 17:27 +0200, Pasi Kärkkäinen wrote:
> > > On Fri, Mar 20, 2009 at 06:01:23AM -0400, John A. Sullivan III wrote:
> > > > >
> > > > > John:
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > I ended up writing a small C program to do the priority computation for me.
> > > > >
> > > > > I have two sets of FC-AL shelves attached to two dual-channel Qlogic
> > > > > cards. That gives me two paths to each disk. I have about 56 spindles
> > > > > in the current configuration, and am tying them together with md
> > > > > software raid.
> > > > >
> > > > > Now, even though each disk says it handles concurrent I/O on each
> > > > > port, my testing indicates that throughput suffers when using multibus
> > > > > by about 1/2 (from ~60 MB/sec sustained I/O with failover to 35 MB/sec
> > > > > when using multibus).
> > > > >
> > > > > However, with failover, I am effectively using only one channel on
> > > > > each card. With my custom priority callout, I more or less match the
> > > > > disks with even numbers to the even numbered scsi channels with a
> > > > > higher priority. Same with the odd numbered disks and odd numbered
> > > > > channels. The odds are 2ndary on even and vice versa. It seems to work
> > > > > rather well, and appears to spread the load nicely.
> > > > >
> > > > > Thanks again for your help!
> > > > >
> > > > I'm really glad you brought up the performance problem. I had posted
> > > > about it a few days ago but it seems to have gotten lost. We are really
> > > > struggling with performance issues when attempting to combine multiple
> > > > paths (in the case of multipath to one big target) or targets (in the
> > > > case of software RAID0 across several targets) rather than using, in
> > > > effect, JBODs. In our case, we are using iSCSI.
> > > >
> > > > Like you, we found that using multibus caused almost a linear drop in
> > > > performance. Round robin across two paths was half as much as aggregate
> > > > throughput to two separate disks, four paths, one fourth.
> > > >
> > > > We also tried striping across the targets with software RAID0 combined
> > > > with failover multipath - roughly the same effect.
> > > >
> > > > We really don't want to be forced to treated SAN attached disks as
> > > > JDOBs. Has anyone cracked this problem of using them in either multibus
> > > > or RAID0 so we can present them as a single device to the OS and still
> > > > load balance multiple paths. This is a HUGE problem for us so any help
> > > > is greatly appreciated. Thanks- John
> > >
> > > Hello.
> > >
> > > Hmm.. just a guess, but could this be related to the fact that if your paths
> > > to the storage are different iSCSI sessions (open-iscsi _doesn't_ support
> > > multiple connections per session aka MC/s), then there is a separate SCSI
> > > command queue per path.. and if SCSI requests are split across those queues
> > > they can get out-of-order and that causes performance drop?
> > >
> > > See:
> > > http://www.nabble.com/round-robin-with-vmware-initiator-and-iscsi-target-td21958346.html
> > >
> > > Especially the reply from Ross (CC). Maybe he has some comments
> > >
> > > -- Pasi
> > <snip>
> > I'm trying to spend a little time on this today and am really feeling my
> > ignorance on the way iSCSI works It looks like linux-iscsi supports
> > MC/S but has not been in active development and will not even compile on
> > my 2.6.27 kernel.
> >
> > To simplify matters, I did put each SAN interface on a separate network.
> > Thus, all the different sessions. If I place them all on the same
> > network and use the iface parameters of open-iscsi, does that eliminate
> > the out-of-order problem and allow me to achieve the performance
> > scalability I'm seeking from dm-multipath in multibus mode? Thanks -
> > John
>
> If you use ifaces feature of open-iscsi, you still get separate sessions.
>
> open-iscsi just does not support MC/s
>
> I think core-iscsi does support MC/s..
>
> Then you again you should play with the different multipath settings, and
> tweak how often IOs are split to different paths etc.. maybe that helps.
>
> -- Pasi
<snip>
I think we're pretty much at the end of our options here but I document
what I've found thus far for closure.

Indeed, there seems to be no way around the session problem. Core-iscsi
does seem to support MC/s but has not been updated in years. It did not
compile with my 2.6.27 kernel and, given that others seem to have had
the same problem, I did not spend a lot of time troubleshooting it.

We did play with the multipath rr_min_io settings and smaller always
seemed to be better until we got into very large numbers of session. We
were testing on a dual quad core AMD Shanghai 2378 system with 32 GB
RAM, a quad port Intel e1000 card and two on-board nvidia forcedeth
ports with disktest using 4K blocks to mimic the file system using
sequential reads (and some sequential writes).

With a single thread, there was no difference at all - only about 12.79
MB/s no matter what we did. With 10 threads and only two interfaces,
there was only a slight difference between rr=1 (81.2B/s), rr=10 (78.87)
and rr=100 (80).

However, when we opened to three and four interfaces, there was a huge
jump for rr=1 (100.4, 105.95) versus rr=10 (80.5, 80.75) and rr=100
(74.3, 77.6).

At 100 threads on three or four ports, the best performance shifted to
rr=10 (327 MB/s, 335) rather than rr=1 (291.7, 290.1) or rr=100 (216.3).
At 400 threads, rr=100 started to overtake rr=10 slightly.

This was using all e1000 interfaces. Our first four port test included
one of the on board ports and performance was dramatically less than
three e1000 ports. Subsequent testing tweaking forcedeth parameters
from defaults yielded no improvement.

After solving the I/O scheduler problem, dm RAID0 behaved better. It
still did not give us anywhere near a fourfold increase (four disks on
four separate ports) but only marginal improvement (14.3 MB/s) using c=8
(to fit into a jumbo packet, match the zvol block size on the back end
and be two block sizes). It did, however, give the best balance of
performance being just slightly slower than rr=1 at 10 threads and
slightly slower than rr=10 at 100 threads though not scaling as well to
400 threads.

Thus, collective throughput is acceptable but individual throughput is
still awful.

Thanks, all - John
--
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 03-24-2009, 10:57 AM
Pasi Kärkkäinen
 
Default Shell Scripts or Arbitrary Priority Callouts?

On Tue, Mar 24, 2009 at 07:02:41AM -0400, John A. Sullivan III wrote:
> > > <snip>
> > > I'm trying to spend a little time on this today and am really feeling my
> > > ignorance on the way iSCSI works It looks like linux-iscsi supports
> > > MC/S but has not been in active development and will not even compile on
> > > my 2.6.27 kernel.
> > >
> > > To simplify matters, I did put each SAN interface on a separate network.
> > > Thus, all the different sessions. If I place them all on the same
> > > network and use the iface parameters of open-iscsi, does that eliminate
> > > the out-of-order problem and allow me to achieve the performance
> > > scalability I'm seeking from dm-multipath in multibus mode? Thanks -
> > > John
> >
> > If you use ifaces feature of open-iscsi, you still get separate sessions.
> >
> > open-iscsi just does not support MC/s
> >
> > I think core-iscsi does support MC/s..
> >
> > Then you again you should play with the different multipath settings, and
> > tweak how often IOs are split to different paths etc.. maybe that helps.
> >
> > -- Pasi
> <snip>
> I think we're pretty much at the end of our options here but I document
> what I've found thus far for closure.
>
> Indeed, there seems to be no way around the session problem. Core-iscsi
> does seem to support MC/s but has not been updated in years. It did not
> compile with my 2.6.27 kernel and, given that others seem to have had
> the same problem, I did not spend a lot of time troubleshooting it.
>

Core-iscsi developer seems to be active developing at least the
new iSCSI target (LIO target).. I think he has been testing it with
core-iscsi, so maybe there's newer version somewhere?

> We did play with the multipath rr_min_io settings and smaller always
> seemed to be better until we got into very large numbers of session. We
> were testing on a dual quad core AMD Shanghai 2378 system with 32 GB
> RAM, a quad port Intel e1000 card and two on-board nvidia forcedeth
> ports with disktest using 4K blocks to mimic the file system using
> sequential reads (and some sequential writes).
>

Nice hardware. Btw are you using jumbo frames or flow control for iSCSI
traffic?

> With a single thread, there was no difference at all - only about 12.79
> MB/s no matter what we did. With 10 threads and only two interfaces,
> there was only a slight difference between rr=1 (81.2B/s), rr=10 (78.87)
> and rr=100 (80).
>
> However, when we opened to three and four interfaces, there was a huge
> jump for rr=1 (100.4, 105.95) versus rr=10 (80.5, 80.75) and rr=100
> (74.3, 77.6).
>
> At 100 threads on three or four ports, the best performance shifted to
> rr=10 (327 MB/s, 335) rather than rr=1 (291.7, 290.1) or rr=100 (216.3).
> At 400 threads, rr=100 started to overtake rr=10 slightly.
>
> This was using all e1000 interfaces. Our first four port test included
> one of the on board ports and performance was dramatically less than
> three e1000 ports. Subsequent testing tweaking forcedeth parameters
> from defaults yielded no improvement.
>
> After solving the I/O scheduler problem, dm RAID0 behaved better. It
> still did not give us anywhere near a fourfold increase (four disks on
> four separate ports) but only marginal improvement (14.3 MB/s) using c=8
> (to fit into a jumbo packet, match the zvol block size on the back end
> and be two block sizes). It did, however, give the best balance of
> performance being just slightly slower than rr=1 at 10 threads and
> slightly slower than rr=10 at 100 threads though not scaling as well to
> 400 threads.
>

When you used dm RAID0 you didn't have any multipath configuration, right?

What kind of stripe size and other settings you had for RAID0?

What kind of performance do you get using just a single iscsi session (and
thus just a single path), no multipathing, no DM RAID0 ? Just a filesystem
directly on top of the iscsi /dev/sd? device.

> Thus, collective throughput is acceptable but individual throughput is
> still awful.
>

Sounds like there's some other problem if invidual throughput is bad? Or did
you mean performance with a single disktest IO thread is bad, but using multiple
disktest threads it's good.. that would make more sense

-- Pasi

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 03-24-2009, 11:21 AM
"John A. Sullivan III"
 
Default Shell Scripts or Arbitrary Priority Callouts?

On Tue, 2009-03-24 at 13:57 +0200, Pasi Kärkkäinen wrote:
> On Tue, Mar 24, 2009 at 07:02:41AM -0400, John A. Sullivan III wrote:
> > > > <snip>
> > > > I'm trying to spend a little time on this today and am really feeling my
> > > > ignorance on the way iSCSI works It looks like linux-iscsi supports
> > > > MC/S but has not been in active development and will not even compile on
> > > > my 2.6.27 kernel.
> > > >
> > > > To simplify matters, I did put each SAN interface on a separate network.
> > > > Thus, all the different sessions. If I place them all on the same
> > > > network and use the iface parameters of open-iscsi, does that eliminate
> > > > the out-of-order problem and allow me to achieve the performance
> > > > scalability I'm seeking from dm-multipath in multibus mode? Thanks -
> > > > John
> > >
> > > If you use ifaces feature of open-iscsi, you still get separate sessions.
> > >
> > > open-iscsi just does not support MC/s
> > >
> > > I think core-iscsi does support MC/s..
> > >
> > > Then you again you should play with the different multipath settings, and
> > > tweak how often IOs are split to different paths etc.. maybe that helps.
> > >
> > > -- Pasi
> > <snip>
> > I think we're pretty much at the end of our options here but I document
> > what I've found thus far for closure.
> >
> > Indeed, there seems to be no way around the session problem. Core-iscsi
> > does seem to support MC/s but has not been updated in years. It did not
> > compile with my 2.6.27 kernel and, given that others seem to have had
> > the same problem, I did not spend a lot of time troubleshooting it.
> >
>
> Core-iscsi developer seems to be active developing at least the
> new iSCSI target (LIO target).. I think he has been testing it with
> core-iscsi, so maybe there's newer version somewhere?
>
> > We did play with the multipath rr_min_io settings and smaller always
> > seemed to be better until we got into very large numbers of session. We
> > were testing on a dual quad core AMD Shanghai 2378 system with 32 GB
> > RAM, a quad port Intel e1000 card and two on-board nvidia forcedeth
> > ports with disktest using 4K blocks to mimic the file system using
> > sequential reads (and some sequential writes).
> >
>
> Nice hardware. Btw are you using jumbo frames or flow control for iSCSI
> traffic?
>
> > With a single thread, there was no difference at all - only about 12.79
> > MB/s no matter what we did. With 10 threads and only two interfaces,
> > there was only a slight difference between rr=1 (81.2B/s), rr=10 (78.87)
> > and rr=100 (80).
> >
> > However, when we opened to three and four interfaces, there was a huge
> > jump for rr=1 (100.4, 105.95) versus rr=10 (80.5, 80.75) and rr=100
> > (74.3, 77.6).
> >
> > At 100 threads on three or four ports, the best performance shifted to
> > rr=10 (327 MB/s, 335) rather than rr=1 (291.7, 290.1) or rr=100 (216.3).
> > At 400 threads, rr=100 started to overtake rr=10 slightly.
> >
> > This was using all e1000 interfaces. Our first four port test included
> > one of the on board ports and performance was dramatically less than
> > three e1000 ports. Subsequent testing tweaking forcedeth parameters
> > from defaults yielded no improvement.
> >
> > After solving the I/O scheduler problem, dm RAID0 behaved better. It
> > still did not give us anywhere near a fourfold increase (four disks on
> > four separate ports) but only marginal improvement (14.3 MB/s) using c=8
> > (to fit into a jumbo packet, match the zvol block size on the back end
> > and be two block sizes). It did, however, give the best balance of
> > performance being just slightly slower than rr=1 at 10 threads and
> > slightly slower than rr=10 at 100 threads though not scaling as well to
> > 400 threads.
> >
>
> When you used dm RAID0 you didn't have any multipath configuration, right?
Correct although we also did test successfully with multipath in
failover mode and RAID0.
>
> What kind of stripe size and other settings you had for RAID0?
Chunk size was 8KB with four disks.
>
> What kind of performance do you get using just a single iscsi session (and
> thus just a single path), no multipathing, no DM RAID0 ? Just a filesystem
> directly on top of the iscsi /dev/sd? device.
Miserable - same roughly 12 MB/s.
>
> > Thus, collective throughput is acceptable but individual throughput is
> > still awful.
> >
>
> Sounds like there's some other problem if invidual throughput is bad? Or did
> you mean performance with a single disktest IO thread is bad, but using multiple
> disktest threads it's good.. that would make more sense
Yes, the latter. Single thread (I assume mimicking a single disk
operation, e.g., copying a large file) is miserable - much slower than
local disk despite the availability of huge bandwidth. We start
utilizing the bandwidth when multiplying concurrent disk activity into
the hundreds.

I am guessing the single thread performance problem is an open-iscsi
issue but I was hoping multipath would help us work around it by
utilizing multiple sessions per disk operation. I suppose that is where
we run into the command ordering problem unless there is something else
afoot. Thanks - John
<snip>
--
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 03-24-2009, 02:01 PM
Pasi Kärkkäinen
 
Default Shell Scripts or Arbitrary Priority Callouts?

On Tue, Mar 24, 2009 at 08:21:45AM -0400, John A. Sullivan III wrote:
> >
> > Core-iscsi developer seems to be active developing at least the
> > new iSCSI target (LIO target).. I think he has been testing it with
> > core-iscsi, so maybe there's newer version somewhere?
> >
> > > We did play with the multipath rr_min_io settings and smaller always
> > > seemed to be better until we got into very large numbers of session. We
> > > were testing on a dual quad core AMD Shanghai 2378 system with 32 GB
> > > RAM, a quad port Intel e1000 card and two on-board nvidia forcedeth
> > > ports with disktest using 4K blocks to mimic the file system using
> > > sequential reads (and some sequential writes).
> > >
> >
> > Nice hardware. Btw are you using jumbo frames or flow control for iSCSI
> > traffic?
> >

Dunno if you noticed this..


> > >
> >
> > When you used dm RAID0 you didn't have any multipath configuration, right?
> Correct although we also did test successfully with multipath in
> failover mode and RAID0.
> >

OK.

> > What kind of stripe size and other settings you had for RAID0?
> Chunk size was 8KB with four disks.
> >

Did you try with much bigger sizes.. 128 kB ?

> > What kind of performance do you get using just a single iscsi session (and
> > thus just a single path), no multipathing, no DM RAID0 ? Just a filesystem
> > directly on top of the iscsi /dev/sd? device.
> Miserable - same roughly 12 MB/s.

OK, Here's your problem. Was this btw reads or writes? Did you tune
readahead-settings?

Can paste your iSCSI session settings negotiated with the target?

> >
> > Sounds like there's some other problem if invidual throughput is bad? Or did
> > you mean performance with a single disktest IO thread is bad, but using multiple
> > disktest threads it's good.. that would make more sense
> Yes, the latter. Single thread (I assume mimicking a single disk
> operation, e.g., copying a large file) is miserable - much slower than
> local disk despite the availability of huge bandwidth. We start
> utilizing the bandwidth when multiplying concurrent disk activity into
> the hundreds.
>
> I am guessing the single thread performance problem is an open-iscsi
> issue but I was hoping multipath would help us work around it by
> utilizing multiple sessions per disk operation. I suppose that is where
> we run into the command ordering problem unless there is something else
> afoot. Thanks - John

You should be able to get many times the throughput you get now.. just with
a single path/session.

What kind of latency do you have from the initiator to the target/storage?

Try with for example 4 kB ping:
ping -s 4096 <ip_of_the_iscsi_target>

1000ms divided by the roundtrip you get from ping should give you maximum
possible IOPS using a single path..

4 kB * IOPS == max bandwidth you can achieve.

-- Pasi

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 03-24-2009, 02:09 PM
Pasi Kärkkäinen
 
Default Shell Scripts or Arbitrary Priority Callouts?

On Tue, Mar 24, 2009 at 05:01:04PM +0200, Pasi Kärkkäinen wrote:
> On Tue, Mar 24, 2009 at 08:21:45AM -0400, John A. Sullivan III wrote:
> > >
> > > Core-iscsi developer seems to be active developing at least the
> > > new iSCSI target (LIO target).. I think he has been testing it with
> > > core-iscsi, so maybe there's newer version somewhere?
> > >
> > > > We did play with the multipath rr_min_io settings and smaller always
> > > > seemed to be better until we got into very large numbers of session. We
> > > > were testing on a dual quad core AMD Shanghai 2378 system with 32 GB
> > > > RAM, a quad port Intel e1000 card and two on-board nvidia forcedeth
> > > > ports with disktest using 4K blocks to mimic the file system using
> > > > sequential reads (and some sequential writes).
> > > >
> > >
> > > Nice hardware. Btw are you using jumbo frames or flow control for iSCSI
> > > traffic?
> > >
>
> Dunno if you noticed this..
>
>
> > > >
> > >
> > > When you used dm RAID0 you didn't have any multipath configuration, right?
> > Correct although we also did test successfully with multipath in
> > failover mode and RAID0.
> > >
>
> OK.
>
> > > What kind of stripe size and other settings you had for RAID0?
> > Chunk size was 8KB with four disks.
> > >
>
> Did you try with much bigger sizes.. 128 kB ?
>
> > > What kind of performance do you get using just a single iscsi session (and
> > > thus just a single path), no multipathing, no DM RAID0 ? Just a filesystem
> > > directly on top of the iscsi /dev/sd? device.
> > Miserable - same roughly 12 MB/s.
>
> OK, Here's your problem. Was this btw reads or writes? Did you tune
> readahead-settings?
>
> Can paste your iSCSI session settings negotiated with the target?
>
> > >
> > > Sounds like there's some other problem if invidual throughput is bad? Or did
> > > you mean performance with a single disktest IO thread is bad, but using multiple
> > > disktest threads it's good.. that would make more sense
> > Yes, the latter. Single thread (I assume mimicking a single disk
> > operation, e.g., copying a large file) is miserable - much slower than
> > local disk despite the availability of huge bandwidth. We start
> > utilizing the bandwidth when multiplying concurrent disk activity into
> > the hundreds.
> >
> > I am guessing the single thread performance problem is an open-iscsi
> > issue but I was hoping multipath would help us work around it by
> > utilizing multiple sessions per disk operation. I suppose that is where
> > we run into the command ordering problem unless there is something else
> > afoot. Thanks - John
>
> You should be able to get many times the throughput you get now.. just with
> a single path/session.
>
> What kind of latency do you have from the initiator to the target/storage?
>
> Try with for example 4 kB ping:
> ping -s 4096 <ip_of_the_iscsi_target>
>
> 1000ms divided by the roundtrip you get from ping should give you maximum
> possible IOPS using a single path..
>
> 4 kB * IOPS == max bandwidth you can achieve.
>

Maybe I should have been more clear about that.. assuming you're measuring
4 kB IO's with disktest, and you have 1 outstanding IO at a time, then the
above is max throughput you can get.

Higher block/IO size and higher number of outstanding IOs will give you
better thoughput.

I think CFQ disk elevator/scheduler has a bug that prevents queue depths
bigger than 1 outstanding IO.. so don't use that. "noop" might be a good idea.

-- Pasi

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 03-24-2009, 02:43 PM
"John A. Sullivan III"
 
Default Shell Scripts or Arbitrary Priority Callouts?

I greatly appreciate the help. I'll answer in the thread below as well
as consolidating answers to the questions posed in your other email.

On Tue, 2009-03-24 at 17:01 +0200, Pasi Kärkkäinen wrote:
> On Tue, Mar 24, 2009 at 08:21:45AM -0400, John A. Sullivan III wrote:
> > >
> > > Core-iscsi developer seems to be active developing at least the
> > > new iSCSI target (LIO target).. I think he has been testing it with
> > > core-iscsi, so maybe there's newer version somewhere?
> > >
> > > > We did play with the multipath rr_min_io settings and smaller always
> > > > seemed to be better until we got into very large numbers of session. We
> > > > were testing on a dual quad core AMD Shanghai 2378 system with 32 GB
> > > > RAM, a quad port Intel e1000 card and two on-board nvidia forcedeth
> > > > ports with disktest using 4K blocks to mimic the file system using
> > > > sequential reads (and some sequential writes).
> > > >
> > >
> > > Nice hardware. Btw are you using jumbo frames or flow control for iSCSI
> > > traffic?
> > >
>
> Dunno if you noticed this..
We are actually quite enthusiastic about the environment and the
project. We hope to have many of these hosting about 400 VServer guests
running virtual desktops from the X2Go project. It's not my project but
I don't mind plugging them as I think it is a great technology.

We are using jumbo frames. The ProCurve 2810 switches explicitly state
to NOT use flow control and jumbo frames simultaneously. We tried it
anyway but with poor results.
>
>
> > > >
> > >
> > > When you used dm RAID0 you didn't have any multipath configuration, right?
> > Correct although we also did test successfully with multipath in
> > failover mode and RAID0.
> > >
>
> OK.
>
> > > What kind of stripe size and other settings you had for RAID0?
> > Chunk size was 8KB with four disks.
> > >
>
> Did you try with much bigger sizes.. 128 kB ?
We tried slightly larger sizes - 16KB and 32KB I believe and observed
performance degradation. In fact, in some scenarios 4KB chunk sizes
gave us better performance than 8KB.
>
> > > What kind of performance do you get using just a single iscsi session (and
> > > thus just a single path), no multipathing, no DM RAID0 ? Just a filesystem
> > > directly on top of the iscsi /dev/sd? device.
> > Miserable - same roughly 12 MB/s.
>
> OK, Here's your problem. Was this btw reads or writes? Did you tune
> readahead-settings?
12MBps is sequential reading but sequential writing is not much
different. We did tweak readahead to 1024. We did not want to go much
larger in order to maintain balance with the various data patterns -
some of which are random and some of which may not read linearly.
>
> Can paste your iSCSI session settings negotiated with the target?
Pardon my ignorance but, other than packet traces, how do I show the
final negotiated settings?
>
> > >
> > > Sounds like there's some other problem if invidual throughput is bad? Or did
> > > you mean performance with a single disktest IO thread is bad, but using multiple
> > > disktest threads it's good.. that would make more sense
> > Yes, the latter. Single thread (I assume mimicking a single disk
> > operation, e.g., copying a large file) is miserable - much slower than
> > local disk despite the availability of huge bandwidth. We start
> > utilizing the bandwidth when multiplying concurrent disk activity into
> > the hundreds.
> >
> > I am guessing the single thread performance problem is an open-iscsi
> > issue but I was hoping multipath would help us work around it by
> > utilizing multiple sessions per disk operation. I suppose that is where
> > we run into the command ordering problem unless there is something else
> > afoot. Thanks - John
>
> You should be able to get many times the throughput you get now.. just with
> a single path/session.
>
> What kind of latency do you have from the initiator to the target/storage?
>
> Try with for example 4 kB ping:
> ping -s 4096 <ip_of_the_iscsi_target>
We have about 400 micro seconds - that seems a bit high
rtt min/avg/max/mdev = 0.275/0.337/0.398/0.047 ms

>
> 1000ms divided by the roundtrip you get from ping should give you maximum
> possible IOPS using a single path..
>
1000 / 0.4 = 2500
> 4 kB * IOPS == max bandwidth you can achieve.
2500 * 4KB = 10 MBps
Hmm . . . seems like what we are getting. Is that an abnormally high
latency? We have tried playing with interrupt coalescing on the
initiator side but without significant effect. Thanks for putting
together the formula for me. Not only does it help me understand but it
means I can work on addressing the latency issue without setting up and
running disk tests.

I would love to use larger block sizes as you suggest in your other
email but, on AMD64, I believe we are stuck with 4KB. I've not seen any
way to change it and would gladly do so if someone knows how.

CFQ was indeed a problem. It would not scale with increasing the number
of threads. noop, deadline, and anticipatory all fared much better. We
are currently using noop for the iSCSI targets. Thanks again - John
>
> -- Pasi
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
--
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 03-24-2009, 03:36 PM
Pasi Kärkkäinen
 
Default Shell Scripts or Arbitrary Priority Callouts?

On Tue, Mar 24, 2009 at 11:43:20AM -0400, John A. Sullivan III wrote:
> I greatly appreciate the help. I'll answer in the thread below as well
> as consolidating answers to the questions posed in your other email.
>
> On Tue, 2009-03-24 at 17:01 +0200, Pasi Kärkkäinen wrote:
> > On Tue, Mar 24, 2009 at 08:21:45AM -0400, John A. Sullivan III wrote:
> > > >
> > > > Core-iscsi developer seems to be active developing at least the
> > > > new iSCSI target (LIO target).. I think he has been testing it with
> > > > core-iscsi, so maybe there's newer version somewhere?
> > > >
> > > > > We did play with the multipath rr_min_io settings and smaller always
> > > > > seemed to be better until we got into very large numbers of session. We
> > > > > were testing on a dual quad core AMD Shanghai 2378 system with 32 GB
> > > > > RAM, a quad port Intel e1000 card and two on-board nvidia forcedeth
> > > > > ports with disktest using 4K blocks to mimic the file system using
> > > > > sequential reads (and some sequential writes).
> > > > >
> > > >
> > > > Nice hardware. Btw are you using jumbo frames or flow control for iSCSI
> > > > traffic?
> > > >
> >
> > Dunno if you noticed this..
> We are actually quite enthusiastic about the environment and the
> project. We hope to have many of these hosting about 400 VServer guests
> running virtual desktops from the X2Go project. It's not my project but
> I don't mind plugging them as I think it is a great technology.
>
> We are using jumbo frames. The ProCurve 2810 switches explicitly state
> to NOT use flow control and jumbo frames simultaneously. We tried it
> anyway but with poor results.

Ok.

iirc 2810 does not have very big buffers per port, so you might be better
using flow control instead of jumbos.. then again I'm not sure how good flow
control implementation HP has?

The whole point of flow control is to prevent packet loss/drop.. this happens
with sending pause frames before the port buffers get full. If port buffers
get full then the switch doesn't have any other option than to drop the
packets.. and this causes tcp-retransmits -> causes delay and tcp slows down
to prevent further packet drops.

flow control "pause frames" cause less delay than tcp-retransmits.

Do you see tcp retransmits with "netstat -s" ? Check both the target and the initiators.

> >
> >
> > > > >
> > > >
> > > > When you used dm RAID0 you didn't have any multipath configuration, right?
> > > Correct although we also did test successfully with multipath in
> > > failover mode and RAID0.
> > > >
> >
> > OK.
> >
> > > > What kind of stripe size and other settings you had for RAID0?
> > > Chunk size was 8KB with four disks.
> > > >
> >
> > Did you try with much bigger sizes.. 128 kB ?
> We tried slightly larger sizes - 16KB and 32KB I believe and observed
> performance degradation. In fact, in some scenarios 4KB chunk sizes
> gave us better performance than 8KB.

Ok.

> >
> > > > What kind of performance do you get using just a single iscsi session (and
> > > > thus just a single path), no multipathing, no DM RAID0 ? Just a filesystem
> > > > directly on top of the iscsi /dev/sd? device.
> > > Miserable - same roughly 12 MB/s.
> >
> > OK, Here's your problem. Was this btw reads or writes? Did you tune
> > readahead-settings?
> 12MBps is sequential reading but sequential writing is not much
> different. We did tweak readahead to 1024. We did not want to go much
> larger in order to maintain balance with the various data patterns -
> some of which are random and some of which may not read linearly.

I did some benchmarking earlier between two servers; other one running ietd
target with 'nullio' and other running open-iscsi initiator. Both using a single gigabit NIC.

I remember getting very close to full gigabit speed at least with bigger
block sizes. I can't remember how much I got with 4 kB blocks.

Those tests were made with dd.

nullio target is a good way to benchmark your network and initiator and
verify everything is correct.

Also it's good to first test for example with FTP and Iperf to verify
network is working properly between target and the initiator and all the
other basic settings are correct.

Btw have you configured tcp stacks of the servers? Bigger default tcp window
size, bigger maximun tcp window size etc..

> >
> > Can paste your iSCSI session settings negotiated with the target?
> Pardon my ignorance but, other than packet traces, how do I show the
> final negotiated settings?

Try:

iscsiadm -i -m session
iscsiadm -m session -P3


> >
> > > >
> > > > Sounds like there's some other problem if invidual throughput is bad? Or did
> > > > you mean performance with a single disktest IO thread is bad, but using multiple
> > > > disktest threads it's good.. that would make more sense
> > > Yes, the latter. Single thread (I assume mimicking a single disk
> > > operation, e.g., copying a large file) is miserable - much slower than
> > > local disk despite the availability of huge bandwidth. We start
> > > utilizing the bandwidth when multiplying concurrent disk activity into
> > > the hundreds.
> > >
> > > I am guessing the single thread performance problem is an open-iscsi
> > > issue but I was hoping multipath would help us work around it by
> > > utilizing multiple sessions per disk operation. I suppose that is where
> > > we run into the command ordering problem unless there is something else
> > > afoot. Thanks - John
> >
> > You should be able to get many times the throughput you get now.. just with
> > a single path/session.
> >
> > What kind of latency do you have from the initiator to the target/storage?
> >
> > Try with for example 4 kB ping:
> > ping -s 4096 <ip_of_the_iscsi_target>
> We have about 400 micro seconds - that seems a bit high
> rtt min/avg/max/mdev = 0.275/0.337/0.398/0.047 ms
>

Yeah.. that's a bit high.

> >
> > 1000ms divided by the roundtrip you get from ping should give you maximum
> > possible IOPS using a single path..
> >
> 1000 / 0.4 = 2500
> > 4 kB * IOPS == max bandwidth you can achieve.
> 2500 * 4KB = 10 MBps
> Hmm . . . seems like what we are getting. Is that an abnormally high
> latency? We have tried playing with interrupt coalescing on the
> initiator side but without significant effect. Thanks for putting
> together the formula for me. Not only does it help me understand but it
> means I can work on addressing the latency issue without setting up and
> running disk tests.
>

I think Ross suggested in some other thread the following settings for e1000
NICs:

"Set the e1000s InterruptThrottleRate=1 and their TxRingBufferSize=4096
and RxRingBufferSize=4096 (verify those option names with a modinfo)
and add those to modprobe.conf."

> I would love to use larger block sizes as you suggest in your other
> email but, on AMD64, I believe we are stuck with 4KB. I've not seen any
> way to change it and would gladly do so if someone knows how.
>

Are we talking about filesystem block sizes? That shouldn't be a problem if
your application uses larger blocksizes for read/write operations..

Try for example with:
dd if=/dev/zero of=/iscsilun/file.bin bs=1024k count=1024

and optionally add "oflag=direct" (or iflag=direct) if you want to make sure
caches do not mess up the results.

> CFQ was indeed a problem. It would not scale with increasing the number
> of threads. noop, deadline, and anticipatory all fared much better. We
> are currently using noop for the iSCSI targets. Thanks again - John

Yep. And no problems.. hopefully I'm able to help and guide to right
direction

-- Pasi

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 03-24-2009, 04:30 PM
"John A. Sullivan III"
 
Default Shell Scripts or Arbitrary Priority Callouts?

Thanks very much, again, and, again, I'll reply in the text - John

On Tue, 2009-03-24 at 18:36 +0200, Pasi Kärkkäinen wrote:
> On Tue, Mar 24, 2009 at 11:43:20AM -0400, John A. Sullivan III wrote:
> > I greatly appreciate the help. I'll answer in the thread below as well
> > as consolidating answers to the questions posed in your other email.
> >
> > On Tue, 2009-03-24 at 17:01 +0200, Pasi Kärkkäinen wrote:
> > > On Tue, Mar 24, 2009 at 08:21:45AM -0400, John A. Sullivan III wrote:
> > > > >
> > > > > Core-iscsi developer seems to be active developing at least the
> > > > > new iSCSI target (LIO target).. I think he has been testing it with
> > > > > core-iscsi, so maybe there's newer version somewhere?
> > > > >
> > > > > > We did play with the multipath rr_min_io settings and smaller always
> > > > > > seemed to be better until we got into very large numbers of session. We
> > > > > > were testing on a dual quad core AMD Shanghai 2378 system with 32 GB
> > > > > > RAM, a quad port Intel e1000 card and two on-board nvidia forcedeth
> > > > > > ports with disktest using 4K blocks to mimic the file system using
> > > > > > sequential reads (and some sequential writes).
> > > > > >
> > > > >
> > > > > Nice hardware. Btw are you using jumbo frames or flow control for iSCSI
> > > > > traffic?
> > > > >
> > >
> > > Dunno if you noticed this..
> > We are actually quite enthusiastic about the environment and the
> > project. We hope to have many of these hosting about 400 VServer guests
> > running virtual desktops from the X2Go project. It's not my project but
> > I don't mind plugging them as I think it is a great technology.
> >
> > We are using jumbo frames. The ProCurve 2810 switches explicitly state
> > to NOT use flow control and jumbo frames simultaneously. We tried it
> > anyway but with poor results.
>
> Ok.
>
> iirc 2810 does not have very big buffers per port, so you might be better
> using flow control instead of jumbos.. then again I'm not sure how good flow
> control implementation HP has?
>
> The whole point of flow control is to prevent packet loss/drop.. this happens
> with sending pause frames before the port buffers get full. If port buffers
> get full then the switch doesn't have any other option than to drop the
> packets.. and this causes tcp-retransmits -> causes delay and tcp slows down
> to prevent further packet drops.
>
> flow control "pause frames" cause less delay than tcp-retransmits.
>
> Do you see tcp retransmits with "netstat -s" ? Check both the target and the initiators.
Thankfully this is an area of some expertise for me (unlike disk I/O -
obviously ). We have been pretty thorough about checking the
network path. We've not seen any upper layer retransmission or buffer
overflows.
>
> > >
> > >
> > > > > >
> > > > >
> > > > > When you used dm RAID0 you didn't have any multipath configuration, right?
> > > > Correct although we also did test successfully with multipath in
> > > > failover mode and RAID0.
> > > > >
> > >
> > > OK.
> > >
> > > > > What kind of stripe size and other settings you had for RAID0?
> > > > Chunk size was 8KB with four disks.
> > > > >
> > >
> > > Did you try with much bigger sizes.. 128 kB ?
> > We tried slightly larger sizes - 16KB and 32KB I believe and observed
> > performance degradation. In fact, in some scenarios 4KB chunk sizes
> > gave us better performance than 8KB.
>
> Ok.
>
> > >
> > > > > What kind of performance do you get using just a single iscsi session (and
> > > > > thus just a single path), no multipathing, no DM RAID0 ? Just a filesystem
> > > > > directly on top of the iscsi /dev/sd? device.
> > > > Miserable - same roughly 12 MB/s.
> > >
> > > OK, Here's your problem. Was this btw reads or writes? Did you tune
> > > readahead-settings?
> > 12MBps is sequential reading but sequential writing is not much
> > different. We did tweak readahead to 1024. We did not want to go much
> > larger in order to maintain balance with the various data patterns -
> > some of which are random and some of which may not read linearly.
>
> I did some benchmarking earlier between two servers; other one running ietd
> target with 'nullio' and other running open-iscsi initiator. Both using a single gigabit NIC.
>
> I remember getting very close to full gigabit speed at least with bigger
> block sizes. I can't remember how much I got with 4 kB blocks.
>
> Those tests were made with dd.
Yes, if we use 64KB blocks, we can saturate a Gig link. With larger
sizes, we can push over 3 Gpbs over the four gig links in the test
environment.
>
> nullio target is a good way to benchmark your network and initiator and
> verify everything is correct.
>
> Also it's good to first test for example with FTP and Iperf to verify
> network is working properly between target and the initiator and all the
> other basic settings are correct.
We did flood ping the network and had all interfaces operating at near
capacity. The network itself looks very healthy.
>
> Btw have you configured tcp stacks of the servers? Bigger default tcp window
> size, bigger maximun tcp window size etc..
Yep, tweaked transmit queue length, receive and transmit windows, net
device backlogs, buffer space, disabled nagle, and even played with the
dirty page watermarks.
>
> > >
> > > Can paste your iSCSI session settings negotiated with the target?
> > Pardon my ignorance but, other than packet traces, how do I show the
> > final negotiated settings?
>
> Try:
>
> iscsiadm -i -m session
> iscsiadm -m session -P3
>
Here's what it says. Pretty much as expected. We are using COMSTAR on
the target and took some traces to see what COMSTAR was expecting. We
set the open-iscsi parameters to match:

Current Portal: 172.x.x.174:3260,2
Persistent Portal: 172.x.x.174:3260,2
**********
Interface:
**********
Iface Name: default
Iface Transport: tcp
Iface Initiatorname: iqn.2008-05.biz.ssi:vd-gen
Iface IPaddress: 172.x.x.162
Iface HWaddress: default
Iface Netdev: default
SID: 32
iSCSI Connection State: LOGGED IN
iSCSI Session State: LOGGED_IN
Internal iscsid Session State: NO CHANGE
************************
Negotiated iSCSI params:
************************
HeaderDigest: None
DataDigest: None
MaxRecvDataSegmentLength: 131072
MaxXmitDataSegmentLength: 8192
FirstBurstLength: 65536
MaxBurstLength: 524288
ImmediateData: Yes
InitialR2T: Yes
MaxOutstandingR2T: 1
************************
Attached SCSI devices:
************************
Host Number: 39 State: running
scsi39 Channel 00 Id 0 Lun: 0
Attached scsi disk sdah State: running

>
> > >
> > > > >
> > > > > Sounds like there's some other problem if invidual throughput is bad? Or did
> > > > > you mean performance with a single disktest IO thread is bad, but using multiple
> > > > > disktest threads it's good.. that would make more sense
> > > > Yes, the latter. Single thread (I assume mimicking a single disk
> > > > operation, e.g., copying a large file) is miserable - much slower than
> > > > local disk despite the availability of huge bandwidth. We start
> > > > utilizing the bandwidth when multiplying concurrent disk activity into
> > > > the hundreds.
> > > >
> > > > I am guessing the single thread performance problem is an open-iscsi
> > > > issue but I was hoping multipath would help us work around it by
> > > > utilizing multiple sessions per disk operation. I suppose that is where
> > > > we run into the command ordering problem unless there is something else
> > > > afoot. Thanks - John
> > >
> > > You should be able to get many times the throughput you get now.. just with
> > > a single path/session.
> > >
> > > What kind of latency do you have from the initiator to the target/storage?
> > >
> > > Try with for example 4 kB ping:
> > > ping -s 4096 <ip_of_the_iscsi_target>
> > We have about 400 micro seconds - that seems a bit high
> > rtt min/avg/max/mdev = 0.275/0.337/0.398/0.047 ms
> >
>
> Yeah.. that's a bit high.
Actually, with more testing, we're seeing it stretch up to over 700
micro-seconds. I'll attach a raft of data I collected at the end of
this email.
>
> > >
> > > 1000ms divided by the roundtrip you get from ping should give you maximum
> > > possible IOPS using a single path..
> > >
> > 1000 / 0.4 = 2500
> > > 4 kB * IOPS == max bandwidth you can achieve.
> > 2500 * 4KB = 10 MBps
> > Hmm . . . seems like what we are getting. Is that an abnormally high
> > latency? We have tried playing with interrupt coalescing on the
> > initiator side but without significant effect. Thanks for putting
> > together the formula for me. Not only does it help me understand but it
> > means I can work on addressing the latency issue without setting up and
> > running disk tests.
> >
>
> I think Ross suggested in some other thread the following settings for e1000
> NICs:
>
> "Set the e1000s InterruptThrottleRate=1 and their TxRingBufferSize=4096
> and RxRingBufferSize=4096 (verify those option names with a modinfo)
> and add those to modprobe.conf."
We did try playing with the ring buffer but to no avail. Modinfo does
not seem to display the current settings. We did try playing with
setting the InterruptThrottleRate to 1 but again to no avail. As I'll
mention later, I suspect the issue might be the opensolaris based
target.
>
> > I would love to use larger block sizes as you suggest in your other
> > email but, on AMD64, I believe we are stuck with 4KB. I've not seen any
> > way to change it and would gladly do so if someone knows how.
> >
>
> Are we talking about filesystem block sizes? That shouldn't be a problem if
> your application uses larger blocksizes for read/write operations..
>
Yes, file system block size. When we try rough, end user style tests,
e.g., large file copies, we seem to get the performance indicated by 4KB
blocks, i.e., lousy!
> Try for example with:
> dd if=/dev/zero of=/iscsilun/file.bin bs=1024k count=1024
Large block sizes can make the system truly fly so we suspect you are
absolutely correct about latency being the issue. We did do our testing
with raw interfaces by the way.
>
> and optionally add "oflag=direct" (or iflag=direct) if you want to make sure
> caches do not mess up the results.
>
> > CFQ was indeed a problem. It would not scale with increasing the number
> > of threads. noop, deadline, and anticipatory all fared much better. We
> > are currently using noop for the iSCSI targets. Thanks again - John
>
> Yep. And no problems.. hopefully I'm able to help and guide to right
> direction
<snip>
I did a little digging and calculating and here is what I came up with
and sent to Nexenta. Please tell me if I am on the right track.

I am using jumbo frames and should be able to get 2 4KB blocks
per frame. Total size should be 8192 + 78 (TCP + IP + Ethernet + CRC
-oops we need to add iSCSI -what size is the iSCSI header?) + 12
(interframe gap) = 8282 bytes. Transmission latency should be 8282 *
8 / 1,000,000,000 = 66.3 micro-seconds. Switch latency is 5.7
microseconds so let's say network latency is 72 - well let's say 75
micro-seconds. The only additional latency should be added by the
network stacks on the target and initiator.

Current round trip latency between the initiator (Linux) and target
(Nexenta) is around 400 micro-seconds and fluctuates significantly:

Hmm . . this is worse than the last test:
PING 172.30.13.158 (172.30.13.158) 8192(8220) bytes of data.
8200 bytes from 172.30.13.158: icmp_seq=1 ttl=255 time=1.36 ms
8200 bytes from 172.30.13.158: icmp_seq=2 ttl=255 time=0.638 ms
8200 bytes from 172.30.13.158: icmp_seq=3 ttl=255 time=0.622 ms
8200 bytes from 172.30.13.158: icmp_seq=4 ttl=255 time=0.603 ms
8200 bytes from 172.30.13.158: icmp_seq=5 ttl=255 time=0.586 ms
8200 bytes from 172.30.13.158: icmp_seq=6 ttl=255 time=0.564 ms
8200 bytes from 172.30.13.158: icmp_seq=7 ttl=255 time=0.553 ms
8200 bytes from 172.30.13.158: icmp_seq=8 ttl=255 time=0.525 ms
8200 bytes from 172.30.13.158: icmp_seq=9 ttl=255 time=0.508 ms
8200 bytes from 172.30.13.158: icmp_seq=10 ttl=255 time=0.490 ms
8200 bytes from 172.30.13.158: icmp_seq=11 ttl=255 time=0.472 ms
8200 bytes from 172.30.13.158: icmp_seq=12 ttl=255 time=0.454 ms
8200 bytes from 172.30.13.158: icmp_seq=13 ttl=255 time=0.436 ms
8200 bytes from 172.30.13.158: icmp_seq=14 ttl=255 time=0.674 ms
8200 bytes from 172.30.13.158: icmp_seq=15 ttl=255 time=0.399 ms
8200 bytes from 172.30.13.158: icmp_seq=16 ttl=255 time=0.638 ms
8200 bytes from 172.30.13.158: icmp_seq=17 ttl=255 time=0.620 ms
8200 bytes from 172.30.13.158: icmp_seq=18 ttl=255 time=0.601 ms
8200 bytes from 172.30.13.158: icmp_seq=19 ttl=255 time=0.583 ms
8200 bytes from 172.30.13.158: icmp_seq=20 ttl=255 time=0.563 ms
8200 bytes from 172.30.13.158: icmp_seq=21 ttl=255 time=0.546 ms
8200 bytes from 172.30.13.158: icmp_seq=22 ttl=255 time=0.518 ms
8200 bytes from 172.30.13.158: icmp_seq=23 ttl=255 time=0.501 ms
8200 bytes from 172.30.13.158: icmp_seq=24 ttl=255 time=0.481 ms
8200 bytes from 172.30.13.158: icmp_seq=25 ttl=255 time=0.463 ms
8200 bytes from 172.30.13.158: icmp_seq=26 ttl=255 time=0.443 ms
8200 bytes from 172.30.13.158: icmp_seq=27 ttl=255 time=0.682 ms
8200 bytes from 172.30.13.158: icmp_seq=28 ttl=255 time=0.404 ms
8200 bytes from 172.30.13.158: icmp_seq=29 ttl=255 time=0.644 ms
8200 bytes from 172.30.13.158: icmp_seq=30 ttl=255 time=0.624 ms
8200 bytes from 172.30.13.158: icmp_seq=31 ttl=255 time=0.605 ms
8200 bytes from 172.30.13.158: icmp_seq=32 ttl=255 time=0.586 ms
8200 bytes from 172.30.13.158: icmp_seq=33 ttl=255 time=0.566 ms
^C
--- 172.30.13.158 ping statistics ---
33 packets transmitted, 33 received, 0% packet loss, time 32000ms
rtt min/avg/max/mdev = 0.399/0.574/1.366/0.161 ms

There is nothing going on in the network. So we are seeing 574
micro-seconds total with only 150 micro-seconds attributed to
transmission. And we see a wide variation in latency.

I then tested the latency between interfaces on the initiator and the
target. Here is what I get for internal latency on the Linux initiator:
PING 172.30.13.18 (172.30.13.18) from 172.30.13.146 : 8192(8220) bytes
of data.
8200 bytes from 172.30.13.18: icmp_seq=1 ttl=64 time=0.033 ms
8200 bytes from 172.30.13.18: icmp_seq=2 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=3 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=4 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=5 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=6 ttl=64 time=0.017 ms
8200 bytes from 172.30.13.18: icmp_seq=7 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=8 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=9 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=10 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=11 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=12 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=13 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=14 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=15 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=16 ttl=64 time=0.017 ms
8200 bytes from 172.30.13.18: icmp_seq=17 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=18 ttl=64 time=0.017 ms
8200 bytes from 172.30.13.18: icmp_seq=19 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=20 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=21 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=22 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=23 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=24 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=25 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=26 ttl=64 time=0.017 ms
8200 bytes from 172.30.13.18: icmp_seq=27 ttl=64 time=0.019 ms
8200 bytes from 172.30.13.18: icmp_seq=28 ttl=64 time=0.018 ms
8200 bytes from 172.30.13.18: icmp_seq=29 ttl=64 time=0.018 ms
^C
--- 172.30.13.18 ping statistics ---
29 packets transmitted, 29 received, 0% packet loss, time 27999ms
rtt min/avg/max/mdev = 0.017/0.018/0.033/0.005 ms

A very consistent 18 micro-seconds.

Here is what I get from the Z200:
root@disk01:/etc# ping -s -i e1000g6 172.30.13.190 4096
PING 172.30.13.190: 4096 data bytes
4104 bytes from 172.30.13.190: icmp_seq=0. time=0.104 ms
4104 bytes from 172.30.13.190: icmp_seq=1. time=0.081 ms
4104 bytes from 172.30.13.190: icmp_seq=2. time=0.067 ms
4104 bytes from 172.30.13.190: icmp_seq=3. time=0.083 ms
4104 bytes from 172.30.13.190: icmp_seq=4. time=0.097 ms
4104 bytes from 172.30.13.190: icmp_seq=5. time=0.043 ms
4104 bytes from 172.30.13.190: icmp_seq=6. time=0.048 ms
4104 bytes from 172.30.13.190: icmp_seq=7. time=0.050 ms
4104 bytes from 172.30.13.190: icmp_seq=8. time=0.043 ms
4104 bytes from 172.30.13.190: icmp_seq=9. time=0.043 ms
4104 bytes from 172.30.13.190: icmp_seq=10. time=0.043 ms
4104 bytes from 172.30.13.190: icmp_seq=11. time=0.042 ms
4104 bytes from 172.30.13.190: icmp_seq=12. time=0.043 ms
4104 bytes from 172.30.13.190: icmp_seq=13. time=0.043 ms
4104 bytes from 172.30.13.190: icmp_seq=14. time=0.042 ms
4104 bytes from 172.30.13.190: icmp_seq=15. time=0.047 ms
4104 bytes from 172.30.13.190: icmp_seq=16. time=0.072 ms
4104 bytes from 172.30.13.190: icmp_seq=17. time=0.080 ms
4104 bytes from 172.30.13.190: icmp_seq=18. time=0.070 ms
4104 bytes from 172.30.13.190: icmp_seq=19. time=0.066 ms
4104 bytes from 172.30.13.190: icmp_seq=20. time=0.086 ms
4104 bytes from 172.30.13.190: icmp_seq=21. time=0.068 ms
4104 bytes from 172.30.13.190: icmp_seq=22. time=0.079 ms
4104 bytes from 172.30.13.190: icmp_seq=23. time=0.068 ms
4104 bytes from 172.30.13.190: icmp_seq=24. time=0.069 ms
4104 bytes from 172.30.13.190: icmp_seq=25. time=0.070 ms
4104 bytes from 172.30.13.190: icmp_seq=26. time=0.095 ms
4104 bytes from 172.30.13.190: icmp_seq=27. time=0.095 ms
4104 bytes from 172.30.13.190: icmp_seq=28. time=0.073 ms
4104 bytes from 172.30.13.190: icmp_seq=29. time=0.071 ms
4104 bytes from 172.30.13.190: icmp_seq=30. time=0.071 ms
^C
----172.30.13.190 PING Statistics----
31 packets transmitted, 31 packets received, 0% packet loss
round-trip (ms) min/avg/max/stddev = 0.042/0.066/0.104/0.019

Notice it is several times longer latency with much wider variation.
How to we tune the opensolaris network stack to reduce it's latency? I'd
really like to improve the individual user experience. I can tell them
it's like commuting to work on the train instead of the car during rush
hour - faster when there's lots of traffic but slower when there is not,
but they will judge the product by their individual experiences more
than their collective experiences. Thus, I really want to improve the
individual disk operation throughput.

Latency seems to be our key. If I can add only 20 micro-seconds of
latency from initiator and target each, that would be roughly 200 micro
seconds. That would almost triple the throughput from what we are
currently seeing.

Unfortunately, I'm a bit ignorant of tweaking networks on opensolaris.
I can certainly learn but am I headed in the right direction or is this
direction of investigation misguided? Thanks - John

--
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 03-24-2009, 05:17 PM
Pasi Kärkkäinen
 
Default Shell Scripts or Arbitrary Priority Callouts?

On Tue, Mar 24, 2009 at 01:30:10PM -0400, John A. Sullivan III wrote:
> Thanks very much, again, and, again, I'll reply in the text - John
>

Np

> >
> > iirc 2810 does not have very big buffers per port, so you might be better
> > using flow control instead of jumbos.. then again I'm not sure how good flow
> > control implementation HP has?
> >
> > The whole point of flow control is to prevent packet loss/drop.. this happens
> > with sending pause frames before the port buffers get full. If port buffers
> > get full then the switch doesn't have any other option than to drop the
> > packets.. and this causes tcp-retransmits -> causes delay and tcp slows down
> > to prevent further packet drops.
> >
> > flow control "pause frames" cause less delay than tcp-retransmits.
> >
> > Do you see tcp retransmits with "netstat -s" ? Check both the target and the initiators.
> Thankfully this is an area of some expertise for me (unlike disk I/O -
> obviously ). We have been pretty thorough about checking the
> network path. We've not seen any upper layer retransmission or buffer
> overflows.

Good

> > > > > > What kind of performance do you get using just a single iscsi session (and
> > > > > > thus just a single path), no multipathing, no DM RAID0 ? Just a filesystem
> > > > > > directly on top of the iscsi /dev/sd? device.
> > > > > Miserable - same roughly 12 MB/s.
> > > >
> > > > OK, Here's your problem. Was this btw reads or writes? Did you tune
> > > > readahead-settings?
> > > 12MBps is sequential reading but sequential writing is not much
> > > different. We did tweak readahead to 1024. We did not want to go much
> > > larger in order to maintain balance with the various data patterns -
> > > some of which are random and some of which may not read linearly.
> >
> > I did some benchmarking earlier between two servers; other one running ietd
> > target with 'nullio' and other running open-iscsi initiator. Both using a single gigabit NIC.
> >
> > I remember getting very close to full gigabit speed at least with bigger
> > block sizes. I can't remember how much I got with 4 kB blocks.
> >
> > Those tests were made with dd.
> Yes, if we use 64KB blocks, we can saturate a Gig link. With larger
> sizes, we can push over 3 Gpbs over the four gig links in the test
> environment.

That's good.

> >
> > nullio target is a good way to benchmark your network and initiator and
> > verify everything is correct.
> >
> > Also it's good to first test for example with FTP and Iperf to verify
> > network is working properly between target and the initiator and all the
> > other basic settings are correct.
> We did flood ping the network and had all interfaces operating at near
> capacity. The network itself looks very healthy.

Ok.

> >
> > Btw have you configured tcp stacks of the servers? Bigger default tcp window
> > size, bigger maximun tcp window size etc..
> Yep, tweaked transmit queue length, receive and transmit windows, net
> device backlogs, buffer space, disabled nagle, and even played with the
> dirty page watermarks.

That's all taken care of then

Also on the target?

> >
> > > >
> > > > Can paste your iSCSI session settings negotiated with the target?
> > > Pardon my ignorance but, other than packet traces, how do I show the
> > > final negotiated settings?
> >
> > Try:
> >
> > iscsiadm -i -m session
> > iscsiadm -m session -P3
> >
> Here's what it says. Pretty much as expected. We are using COMSTAR on
> the target and took some traces to see what COMSTAR was expecting. We
> set the open-iscsi parameters to match:
>
> Current Portal: 172.x.x.174:3260,2
> Persistent Portal: 172.x.x.174:3260,2
> **********
> Interface:
> **********
> Iface Name: default
> Iface Transport: tcp
> Iface Initiatorname: iqn.2008-05.biz.ssi:vd-gen
> Iface IPaddress: 172.x.x.162
> Iface HWaddress: default
> Iface Netdev: default
> SID: 32
> iSCSI Connection State: LOGGED IN
> iSCSI Session State: LOGGED_IN
> Internal iscsid Session State: NO CHANGE
> ************************
> Negotiated iSCSI params:
> ************************
> HeaderDigest: None
> DataDigest: None
> MaxRecvDataSegmentLength: 131072
> MaxXmitDataSegmentLength: 8192
> FirstBurstLength: 65536
> MaxBurstLength: 524288
> ImmediateData: Yes
> InitialR2T: Yes

I guess InitialR2T could be No for a bit better performance?

MaxXmitDataSegmentLength looks small?

> > > > You should be able to get many times the throughput you get now.. just with
> > > > a single path/session.
> > > >
> > > > What kind of latency do you have from the initiator to the target/storage?
> > > >
> > > > Try with for example 4 kB ping:
> > > > ping -s 4096 <ip_of_the_iscsi_target>
> > > We have about 400 micro seconds - that seems a bit high
> > > rtt min/avg/max/mdev = 0.275/0.337/0.398/0.047 ms
> > >
> >
> > Yeah.. that's a bit high.
> Actually, with more testing, we're seeing it stretch up to over 700
> micro-seconds. I'll attach a raft of data I collected at the end of
> this email.

Ok.

> > I think Ross suggested in some other thread the following settings for e1000
> > NICs:
> >
> > "Set the e1000s InterruptThrottleRate=1 and their TxRingBufferSize=4096
> > and RxRingBufferSize=4096 (verify those option names with a modinfo)
> > and add those to modprobe.conf."
> We did try playing with the ring buffer but to no avail. Modinfo does
> not seem to display the current settings. We did try playing with
> setting the InterruptThrottleRate to 1 but again to no avail. As I'll
> mention later, I suspect the issue might be the opensolaris based
> target.

Could be..

> >
> > > I would love to use larger block sizes as you suggest in your other
> > > email but, on AMD64, I believe we are stuck with 4KB. I've not seen any
> > > way to change it and would gladly do so if someone knows how.
> > >
> >
> > Are we talking about filesystem block sizes? That shouldn't be a problem if
> > your application uses larger blocksizes for read/write operations..
> >
> Yes, file system block size. When we try rough, end user style tests,
> e.g., large file copies, we seem to get the performance indicated by 4KB
> blocks, i.e., lousy!

Yep.. try upgrading to 10 Gbit Ethernet for much lower latency

> > Try for example with:
> > dd if=/dev/zero of=/iscsilun/file.bin bs=1024k count=1024
> Large block sizes can make the system truly fly so we suspect you are
> absolutely correct about latency being the issue. We did do our testing
> with raw interfaces by the way.

Ok.

> <snip>
> I did a little digging and calculating and here is what I came up with
> and sent to Nexenta. Please tell me if I am on the right track.
>
> I am using jumbo frames and should be able to get 2 4KB blocks
> per frame. Total size should be 8192 + 78 (TCP + IP + Ethernet + CRC
> -oops we need to add iSCSI -what size is the iSCSI header?) + 12
> (interframe gap) = 8282 bytes. Transmission latency should be 8282 *
> 8 / 1,000,000,000 = 66.3 micro-seconds. Switch latency is 5.7
> microseconds so let's say network latency is 72 - well let's say 75
> micro-seconds. The only additional latency should be added by the
> network stacks on the target and initiator.
>
> Current round trip latency between the initiator (Linux) and target
> (Nexenta) is around 400 micro-seconds and fluctuates significantly:
>
> Hmm . . this is worse than the last test:
> PING 172.30.13.158 (172.30.13.158) 8192(8220) bytes of data.

> --- 172.30.13.158 ping statistics ---
> 33 packets transmitted, 33 received, 0% packet loss, time 32000ms
> rtt min/avg/max/mdev = 0.399/0.574/1.366/0.161 ms
>
> There is nothing going on in the network. So we are seeing 574
> micro-seconds total with only 150 micro-seconds attributed to
> transmission. And we see a wide variation in latency.
>

Yeah something wrong there.. How much latency do you have between different
initiator machines?

> I then tested the latency between interfaces on the initiator and the
> target. Here is what I get for internal latency on the Linux initiator:
> PING 172.30.13.18 (172.30.13.18) from 172.30.13.146 : 8192(8220) bytes
> of data.
> --- 172.30.13.18 ping statistics ---
> 29 packets transmitted, 29 received, 0% packet loss, time 27999ms
> rtt min/avg/max/mdev = 0.017/0.018/0.033/0.005 ms
>
> A very consistent 18 micro-seconds.
>

Yeah, I take it that's not through network/switch

> Here is what I get from the Z200:
> root@disk01:/etc# ping -s -i e1000g6 172.30.13.190 4096
> PING 172.30.13.190: 4096 data bytes
> ----172.30.13.190 PING Statistics----
> 31 packets transmitted, 31 packets received, 0% packet loss
> round-trip (ms) min/avg/max/stddev = 0.042/0.066/0.104/0.019
>

Big difference.. I'm not familiar with Solaris, so can't really suggest what
to tune there..

> Notice it is several times longer latency with much wider variation.
> How to we tune the opensolaris network stack to reduce it's latency? I'd
> really like to improve the individual user experience. I can tell them
> it's like commuting to work on the train instead of the car during rush
> hour - faster when there's lots of traffic but slower when there is not,
> but they will judge the product by their individual experiences more
> than their collective experiences. Thus, I really want to improve the
> individual disk operation throughput.
>
> Latency seems to be our key. If I can add only 20 micro-seconds of
> latency from initiator and target each, that would be roughly 200 micro
> seconds. That would almost triple the throughput from what we are
> currently seeing.
>

Indeed

> Unfortunately, I'm a bit ignorant of tweaking networks on opensolaris.
> I can certainly learn but am I headed in the right direction or is this
> direction of investigation misguided? Thanks - John
>

Low latency is the key for good (iSCSI) SAN performance, as it directly
gives you more (possible) IOPS.

Other option is to configure software/settings so that there are multiple
outstanding IO's on the fly.. then you're not limited with the latency (so much).

-- Pasi

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 03-25-2009, 02:41 AM
"John A. Sullivan III"
 
Default Shell Scripts or Arbitrary Priority Callouts?

On Tue, 2009-03-24 at 20:17 +0200, Pasi Kärkkäinen wrote:
> On Tue, Mar 24, 2009 at 01:30:10PM -0400, John A. Sullivan III wrote:
> > Thanks very much, again, and, again, I'll reply in the text - John
> >
>
> Np
>
> > >
> > > iirc 2810 does not have very big buffers per port, so you might be better
> > > using flow control instead of jumbos.. then again I'm not sure how good flow
> > > control implementation HP has?
> > >
> > > The whole point of flow control is to prevent packet loss/drop.. this happens
> > > with sending pause frames before the port buffers get full. If port buffers
> > > get full then the switch doesn't have any other option than to drop the
> > > packets.. and this causes tcp-retransmits -> causes delay and tcp slows down
> > > to prevent further packet drops.
> > >
> > > flow control "pause frames" cause less delay than tcp-retransmits.
> > >
> > > Do you see tcp retransmits with "netstat -s" ? Check both the target and the initiators.
> > Thankfully this is an area of some expertise for me (unlike disk I/O -
> > obviously ). We have been pretty thorough about checking the
> > network path. We've not seen any upper layer retransmission or buffer
> > overflows.
>
> Good
>
> > > > > > > What kind of performance do you get using just a single iscsi session (and
> > > > > > > thus just a single path), no multipathing, no DM RAID0 ? Just a filesystem
> > > > > > > directly on top of the iscsi /dev/sd? device.
> > > > > > Miserable - same roughly 12 MB/s.
> > > > >
> > > > > OK, Here's your problem. Was this btw reads or writes? Did you tune
> > > > > readahead-settings?
> > > > 12MBps is sequential reading but sequential writing is not much
> > > > different. We did tweak readahead to 1024. We did not want to go much
> > > > larger in order to maintain balance with the various data patterns -
> > > > some of which are random and some of which may not read linearly.
> > >
> > > I did some benchmarking earlier between two servers; other one running ietd
> > > target with 'nullio' and other running open-iscsi initiator. Both using a single gigabit NIC.
> > >
> > > I remember getting very close to full gigabit speed at least with bigger
> > > block sizes. I can't remember how much I got with 4 kB blocks.
> > >
> > > Those tests were made with dd.
> > Yes, if we use 64KB blocks, we can saturate a Gig link. With larger
> > sizes, we can push over 3 Gpbs over the four gig links in the test
> > environment.
>
> That's good.
>
> > >
> > > nullio target is a good way to benchmark your network and initiator and
> > > verify everything is correct.
> > >
> > > Also it's good to first test for example with FTP and Iperf to verify
> > > network is working properly between target and the initiator and all the
> > > other basic settings are correct.
> > We did flood ping the network and had all interfaces operating at near
> > capacity. The network itself looks very healthy.
>
> Ok.
>
> > >
> > > Btw have you configured tcp stacks of the servers? Bigger default tcp window
> > > size, bigger maximun tcp window size etc..
> > Yep, tweaked transmit queue length, receive and transmit windows, net
> > device backlogs, buffer space, disabled nagle, and even played with the
> > dirty page watermarks.
>
> That's all taken care of then
>
> Also on the target?
>
> > >
> > > > >
> > > > > Can paste your iSCSI session settings negotiated with the target?
> > > > Pardon my ignorance but, other than packet traces, how do I show the
> > > > final negotiated settings?
> > >
> > > Try:
> > >
> > > iscsiadm -i -m session
> > > iscsiadm -m session -P3
> > >
> > Here's what it says. Pretty much as expected. We are using COMSTAR on
> > the target and took some traces to see what COMSTAR was expecting. We
> > set the open-iscsi parameters to match:
> >
> > Current Portal: 172.x.x.174:3260,2
> > Persistent Portal: 172.x.x.174:3260,2
> > **********
> > Interface:
> > **********
> > Iface Name: default
> > Iface Transport: tcp
> > Iface Initiatorname: iqn.2008-05.biz.ssi:vd-gen
> > Iface IPaddress: 172.x.x.162
> > Iface HWaddress: default
> > Iface Netdev: default
> > SID: 32
> > iSCSI Connection State: LOGGED IN
> > iSCSI Session State: LOGGED_IN
> > Internal iscsid Session State: NO CHANGE
> > ************************
> > Negotiated iSCSI params:
> > ************************
> > HeaderDigest: None
> > DataDigest: None
> > MaxRecvDataSegmentLength: 131072
> > MaxXmitDataSegmentLength: 8192
> > FirstBurstLength: 65536
> > MaxBurstLength: 524288
> > ImmediateData: Yes
> > InitialR2T: Yes
>
> I guess InitialR2T could be No for a bit better performance?
>
> MaxXmitDataSegmentLength looks small?
>
> > > > > You should be able to get many times the throughput you get now.. just with
> > > > > a single path/session.
> > > > >
> > > > > What kind of latency do you have from the initiator to the target/storage?
> > > > >
> > > > > Try with for example 4 kB ping:
> > > > > ping -s 4096 <ip_of_the_iscsi_target>
> > > > We have about 400 micro seconds - that seems a bit high
> > > > rtt min/avg/max/mdev = 0.275/0.337/0.398/0.047 ms
> > > >
> > >
> > > Yeah.. that's a bit high.
> > Actually, with more testing, we're seeing it stretch up to over 700
> > micro-seconds. I'll attach a raft of data I collected at the end of
> > this email.
>
> Ok.
>
> > > I think Ross suggested in some other thread the following settings for e1000
> > > NICs:
> > >
> > > "Set the e1000s InterruptThrottleRate=1 and their TxRingBufferSize=4096
> > > and RxRingBufferSize=4096 (verify those option names with a modinfo)
> > > and add those to modprobe.conf."
> > We did try playing with the ring buffer but to no avail. Modinfo does
> > not seem to display the current settings. We did try playing with
> > setting the InterruptThrottleRate to 1 but again to no avail. As I'll
> > mention later, I suspect the issue might be the opensolaris based
> > target.
>
> Could be..
>
> > >
> > > > I would love to use larger block sizes as you suggest in your other
> > > > email but, on AMD64, I believe we are stuck with 4KB. I've not seen any
> > > > way to change it and would gladly do so if someone knows how.
> > > >
> > >
> > > Are we talking about filesystem block sizes? That shouldn't be a problem if
> > > your application uses larger blocksizes for read/write operations..
> > >
> > Yes, file system block size. When we try rough, end user style tests,
> > e.g., large file copies, we seem to get the performance indicated by 4KB
> > blocks, i.e., lousy!
>
> Yep.. try upgrading to 10 Gbit Ethernet for much lower latency
>
> > > Try for example with:
> > > dd if=/dev/zero of=/iscsilun/file.bin bs=1024k count=1024
> > Large block sizes can make the system truly fly so we suspect you are
> > absolutely correct about latency being the issue. We did do our testing
> > with raw interfaces by the way.
>
> Ok.
>
> > <snip>
> > I did a little digging and calculating and here is what I came up with
> > and sent to Nexenta. Please tell me if I am on the right track.
> >
> > I am using jumbo frames and should be able to get 2 4KB blocks
> > per frame. Total size should be 8192 + 78 (TCP + IP + Ethernet + CRC
> > -oops we need to add iSCSI -what size is the iSCSI header?) + 12
> > (interframe gap) = 8282 bytes. Transmission latency should be 8282 *
> > 8 / 1,000,000,000 = 66.3 micro-seconds. Switch latency is 5.7
> > microseconds so let's say network latency is 72 - well let's say 75
> > micro-seconds. The only additional latency should be added by the
> > network stacks on the target and initiator.
> >
> > Current round trip latency between the initiator (Linux) and target
> > (Nexenta) is around 400 micro-seconds and fluctuates significantly:
> >
> > Hmm . . this is worse than the last test:
> > PING 172.30.13.158 (172.30.13.158) 8192(8220) bytes of data.
>
> > --- 172.30.13.158 ping statistics ---
> > 33 packets transmitted, 33 received, 0% packet loss, time 32000ms
> > rtt min/avg/max/mdev = 0.399/0.574/1.366/0.161 ms
> >
> > There is nothing going on in the network. So we are seeing 574
> > micro-seconds total with only 150 micro-seconds attributed to
> > transmission. And we see a wide variation in latency.
> >
>
> Yeah something wrong there.. How much latency do you have between different
> initiator machines?
>
> > I then tested the latency between interfaces on the initiator and the
> > target. Here is what I get for internal latency on the Linux initiator:
> > PING 172.30.13.18 (172.30.13.18) from 172.30.13.146 : 8192(8220) bytes
> > of data.
> > --- 172.30.13.18 ping statistics ---
> > 29 packets transmitted, 29 received, 0% packet loss, time 27999ms
> > rtt min/avg/max/mdev = 0.017/0.018/0.033/0.005 ms
> >
> > A very consistent 18 micro-seconds.
> >
>
> Yeah, I take it that's not through network/switch
>
> > Here is what I get from the Z200:
> > root@disk01:/etc# ping -s -i e1000g6 172.30.13.190 4096
> > PING 172.30.13.190: 4096 data bytes
> > ----172.30.13.190 PING Statistics----
> > 31 packets transmitted, 31 packets received, 0% packet loss
> > round-trip (ms) min/avg/max/stddev = 0.042/0.066/0.104/0.019
> >
>
> Big difference.. I'm not familiar with Solaris, so can't really suggest what
> to tune there..
>
> > Notice it is several times longer latency with much wider variation.
> > How to we tune the opensolaris network stack to reduce it's latency? I'd
> > really like to improve the individual user experience. I can tell them
> > it's like commuting to work on the train instead of the car during rush
> > hour - faster when there's lots of traffic but slower when there is not,
> > but they will judge the product by their individual experiences more
> > than their collective experiences. Thus, I really want to improve the
> > individual disk operation throughput.
> >
> > Latency seems to be our key. If I can add only 20 micro-seconds of
> > latency from initiator and target each, that would be roughly 200 micro
> > seconds. That would almost triple the throughput from what we are
> > currently seeing.
> >
>
> Indeed
>
> > Unfortunately, I'm a bit ignorant of tweaking networks on opensolaris.
> > I can certainly learn but am I headed in the right direction or is this
> > direction of investigation misguided? Thanks - John
> >
>
> Low latency is the key for good (iSCSI) SAN performance, as it directly
> gives you more (possible) IOPS.
>
> Other option is to configure software/settings so that there are multiple
> outstanding IO's on the fly.. then you're not limited with the latency (so much).
>
> -- Pasi
<snip>
Ross has been of enormous help offline. Indeed, disabling jumbo packets
produced an almost 50% increase in single threaded throughput. We are
pretty well set although still a bit disappointed in the latency we are
seeing in opensolaris and have escalated to the vendor about addressing
it.

The once piece which is still a mystery is why using four targets on
four separate interfaces striped with dmadm RAID0 does not produce an
aggregate of slightly less than four times the IOPS of a single target
on a single interface. This would not seem to be the out of order SCSI
command problem of multipath. One of life's great mysteries yet to be
revealed. Thanks again, all - John
--
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com

http://www.spiritualoutreach.com
Making Christianity intelligible to secular society


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 

Thread Tools




All times are GMT. The time now is 06:46 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright ©2007 - 2008, www.linux-archive.org