FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian User

 
 
LinkBack Thread Tools
 
Old 08-13-2011, 04:40 PM
Stan Hoeppner
 
Default LVM write performance

On 8/13/2011 9:45 AM, Ivan Shmakov wrote:
>>>>>> Stan Hoeppner <stan@hardwarefreak.com> writes:
>
> […]
>
> > The horrible performance with bs=512 is likely due to the LVM block
> > size being 4096, and forcing block writes that are 1/8th normal size,
> > causing lots of merging. If you divide 120MB/s by 8 you get 15MB/s,
> > which IIRC from your original post, is approximately the write
> > performance you were seeing, which was 19MB/s.
>
> I'm not an expert in that matter either, but I don't seem to
> recall that LVM uses any “blocks”, other than, of course, the
> LVM “extents.”
>
> What's more important in my opinion is that 4096 is exactly the
> platform's page size.
>
> --cut: vgcreate(8) --
> -s, --physicalextentsize PhysicalExtentSize[kKmMgGtT]
> Sets the physical extent size on physical volumes of this volume
> group. A size suffix (k for kilobytes up to t for terabytes) is
> optional, megabytes is the default if no suffix is present. The
> default is 4 MB and it must be at least 1 KB and a power of 2.
> --cut: vgcreate(8) --

To use a water analogy, an extent is a pool used for storing data. It
has zero to do with transferring the payload. A block is a bucket used
to carry data to and from the pool.

If one fills his bucket only 1/8th full, it will take 8 times as many
trips (transfers) to fill the pool vs carrying a full bucket each time.
This is inefficient. This is a factor in the OP's problem. This is a
very coarse analogy, and maybe not the best, but gets the overall point
across.

The LVM block (bucket) size is 4kB, which yes, does match the page size,
which is important. It also matches the default filesystem block size
of all Linux filesystems. This is not coincidence. Everything in Linux
is optimized around a 4kB page size, whether memory management or IO.
And to drive the point home that this isn't an LVM or RAID problem, but
a proper use of dd problem, here's a demonstration of the phenomenon on
a single low end internal 7.2k SATA disk w/16MB cache, with a partition
formatted with XFS, write barriers enabled:

t$ dd if=/dev/zero of=./test1 bs=512 count=1000000
512000000 bytes (512 MB) copied, 16.2892 s, 31.4 MB/s

t$ dd if=/dev/zero of=./test1 bs=1024 count=500000
512000000 bytes (512 MB) copied, 10.5173 s, 48.7 MB/s

$ dd if=/dev/zero of=./test1 bs=2048 count=250000
512000000 bytes (512 MB) copied, 7.77854 s, 65.8 MB/s

$ dd if=/dev/zero of=./test1 bs=4096 count=125000
512000000 bytes (512 MB) copied, 6.64778 s, 77.0 MB/s

t$ dd if=/dev/zero of=./test1 bs=8192 count=62500
512000000 bytes (512 MB) copied, 6.10967 s, 83.8 MB/s

$ dd if=/dev/zero of=./test1 bs=16384 count=31250
512000000 bytes (512 MB) copied, 6.11042 s, 83.8 MB/s

This test system is rather old, having only 384MB RAM. I tested with
and without conv=fsync and the results are the same. This clearly
demonstrates that one should always use a 4kB block size with dd, WRT
HDDs and SSDs, LVM or mdraid, or hardware RAID. Floppy drives, tape,
and other slower devices probably need a different dd block size.

--
Stan



--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4E46A911.2060101@hardwarefreak.com">http://lists.debian.org/4E46A911.2060101@hardwarefreak.com
 
Old 08-14-2011, 07:14 AM
Dion Kant
 
Default LVM write performance

On 08/13/2011 03:55 PM, Stan Hoeppner wrote:
> My explanation to you wasn't fully correct. I confused specifying no
> block size with specifying an insanely large block size. The other post
> I was referring to dealt with people using a 1GB (or larger) block size
> because it made the math easier for them when wanting to write a large
> test file.
Ok, that makes sense. It is the large specified bs which make the dd is
going to buffer the data first.

> Instead of dividing their total file size by 4096 and using the result
> for "bs=4096 count=X" (which is the proper method I described to you)
> they were simply specifying, for example, "bs=2G count=1" to write a 2
> GB test file. Doing this causes the massive buffering I described, and
> consequently, horrible performance, typically by a factor of 10 or more,
> depending on the specific system.
>
> The horrible performance with bs=512 is likely due to the LVM block size
> being 4096, and forcing block writes that are 1/8th normal size, causing
> lots of merging. If you divide 120MB/s by 8 you get 15MB/s, which IIRC
> from your original post, is approximately the write performance you were
> seeing, which was 19MB/s.
Recall that I took LVM out of the loop already. So now I am doing the
experiment with writing data straight to the block device. In my case
/dev/sdb4. (If writing on the block device level does not perform, how
will LVM be able to perform?)

Inspired by your advice, I did some more investigation on this. I wrote
a small test program, i.e. taking dd out of the loop as well. It writes
1 GB test data with increasing block sizes directly to /dev/sdb4. Here
are some results:

root@dom0-2:~# ./bw
Writing 1 GB
bs time rate
(bytes) (s) (MiB/s)
1 85.476 12.5619
2 33.016 32.5218
4 23.6675 45.3679
8 20.112 53.3881
16 18.76 57.2356
32 17.872 60.0795
64 17.636 60.8834
128 17.096 62.8064
256 17.188 62.4704
512 16.8482 63.7303
1024 57.6053 18.6396
2048 57.94 18.532
4096 17.016 63.1019
8192 16.604 64.6675
16384 16.452 65.2649
32768 17.132 62.6748
65536 16.256 66.052
131072 16.44 65.3127
262144 16.264 66.0194
524288 16.388 65.5199

The good and problematic block sizes do not really coincide with the
ones I observe with dd, but the odd behaviour is there. There are some
magic block sizes {1,1024, 2048} which cause a drop in performance.
Looking at vmstat output at the same time I see unexpected bi and the
interrupt rate goes sky high.

In my case it is the ahci driver handling the writes. Here is the vmstat
trace belonging to the bs=1 write and I add some more observations below:

procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
r b swpd free buff cache si so bi bo in cs us sy
id wa
0 0 0 6379780 23820 112616 0 0 0 0 78 82 0 0
100 0
0 0 0 6379780 23820 112616 0 0 0 0 77 80 0 0
100 0
0 0 0 6379780 23820 112616 0 0 0 0 79 80 0 0
100 0
0 0 0 6379780 23820 112616 0 0 0 0 78 82 0 0
100 0
0 0 0 6379780 23820 112616 0 0 0 0 76 80 0 0
100 0
0 0 0 6379780 23820 112616 0 0 0 0 77 83 0 0
100 0
0 0 0 6379780 23820 112616 0 0 0 0 75 80 0 0
100 0
0 0 0 6379780 23820 112616 0 0 0 0 82 82 0 0
100 0
0 0 0 6379780 23820 112616 0 0 0 0 90 93 0 0
100 0
1 0 0 6376796 27132 112524 0 0 828 4 400 531 0 0
100 0
1 0 0 6346416 57496 112560 0 0 7590 0 2408 3877 5 0
92 3
1 0 0 6315788 88048 112580 0 0 7638 0 2435 3903 7 0
90 3
1 1 0 6284416 118548 112540 0 0 7624 0 2428 3903 6 0
91 3
1 0 0 6253168 148896 112564 0 0 7586 0 2403 3875 6 0
91 3
1 0 0 6221920 179284 112484 0 0 7596 0 2414 3884 5 0
93 2
1 0 0 6190672 209648 112540 0 0 7590 0 2417 3877 5 0
93 2
0 1 0 6160540 239796 112536 0 0 7540 0 6240 3851 6 0
76 18
0 1 0 6129540 269952 112584 0 0 7538 0 6255 3856 6 0
86 8
1 0 0 6098292 300116 112504 0 0 7540 0 6233 3853 5 0
89 6
1 0 0 6067540 330280 112552 0 0 7538 0 6196 3857 6 0
87 7
1 0 0 6036540 360452 112536 0 0 7542 0 6281 3868 5 0
89 6
1 0 0 6005540 390608 112464 0 0 7540 0 6268 3856 6 0
85 8
1 0 0 5974292 420788 112516 0 0 7542 0 6246 3865 6 0
86 7
1 0 0 5943416 450952 112444 0 0 7540 0 6253 3860 5 0
88 6
1 0 0 5912540 481128 112488 0 0 7546 0 6226 3861 6 0
86 7
1 0 0 5881292 511300 112472 0 0 7540 0 6225 3860 5 0
89 6
1 0 0 5850292 541456 112464 0 0 7538 0 6192 3858 6 0
86 7
0 2 0 5817664 570260 112516 0 0 7200 40706 5990 4820 6 0
81 13
0 2 0 5789268 597752 112472 0 0 6870 0 5775 5251 5 0
80 15
1 1 0 5760996 625164 112676 0 0 6854 8192 5795 5248 5 0
73 21
1 1 0 5732476 653232 112572 0 0 7014 8192 5285 5362 5 0
82 13
1 1 0 5704080 680924 112676 0 0 6922 0 2340 5290 3 0
92 5
1 1 0 5674504 709444 112540 0 0 7130 8192 2404 5469 5 0
71 24
1 1 0 5646184 737144 112484 0 0 6924 0 2320 5293 5 0
85 10
1 1 0 5617460 765004 112484 0 0 6966 8192 5844 5329 5 0
75 20
2 2 0 5588264 793288 112500 0 0 7068 8192 5313 5404 4 0
85 11
1 1 0 5559556 821084 112628 0 0 6948 0 2326 5309 8 0
78 14
1 1 0 5530468 849304 112476 0 0 7054 8192 2374 5395 5 0
75 20
1 1 0 5501892 876956 112464 0 0 6912 8192 2321 5285 5 0
85 10
0 2 0 5472936 905044 112584 0 0 7024 0 5889 5370 5 0
70 25
0 2 0 5444476 933096 112596 0 0 7010 8192 5874 5360 4 0
82 13
0 2 0 5415520 960924 112476 0 0 6960 0 5841 5323 6 0
70 24
1 1 0 5386580 989096 112696 0 0 7038 8192 5282 5384 6 0
69 25
2 2 0 5357624 1017164 112688 0 0 7016 0 2358 5362 4
0 89 7
1 1 0 5328428 1045280 112580 0 0 7028 8192 2356 5379 5
0 80 15
0 2 0 5296688 1072396 112540 0 0 6778 50068 2314 5194 0
0 99 1
0 2 0 5297044 1072396 112616 0 0 0 64520 317 176 0
0 75 24
0 2 0 5297044 1072396 112616 0 0 0 64520 310 175 0
0 77 23
0 2 0 5297044 1072396 112616 0 0 0 64520 300 161 0
0 85 15
0 2 0 5297052 1072396 112616 0 0 0 72204 317 180 0
0 77 22
0 2 0 5297052 1072396 112616 0 0 0 64520 307 170 0
0 84 16
0 1 0 5300540 1072396 112616 0 0 0 21310 309 203 0
0 98 2
1 0 0 6351440 52252 112680 0 0 54 25 688 343 1 1
63 35
1 0 0 6269720 133036 112600 0 0 0 0 575 88 7 0
93 0
1 0 0 6186516 213812 112560 0 0 0 0 568 83 9 0
91 0
1 0 0 6103560 294588 112512 0 0 0 0 569 85 6 0
94 0
1 0 0 6020852 375428 112688 0 0 0 0 571 84 9 0
90 0
1 0 0 5937896 456244 112664 0 0 0 0 571 86 7 0
93 0

Writing to /dev/sdb4 starts when there is a step in the interrupt
column. As long as the interrupts are high there is bi related to this
writing. After initial buffering there is a first write to the disk at
40MB/s averaged over 2 seconds. Then only a couple of 8MB/s writes and
in the mean time the (kernel) buffer is growing up to 1072396 kB. Then
the driver starts writing at expected rates and the interrupt rate goes
down to a reasonable level. It is only at the end of the write that the
ahci driver gives back its buffer memory. After this, when the interrupt
rate goes to a level of about 570, the ahci driver is swallowing the
second write iteration with a block size of 2 bytes.

Here is the code fragment responsible for writing and measuring:

sync();
gettimeofday(&tstart, &tz);
for (int i=0; i<Ntot/N; ++i)
sdb4->write(buf,N);
sdb4->flush();
sdb4->close();
sync();
gettimeofday(&tstop, &tz);

N is the block size and sdb4 is

ofstream* sdb4 = new ofstream("/dev/sdb4", ofstream::binary);


I think Stan is right that this may be something in the ahci kernel driver.

I have some 3ware controller laying around. I might repeat the
experiments with this and post them here if someone is interested.

Dion

> If my explanation doesn't seem thorough enough that's because I'm not a
> kernel expert. I'm just have a little better than average knowledge/
> understanding of some of aspects of the kernel.
>
> If you want a really good explanation of the reasons behind this dd
> block size behavior while writing to a raw LVM device, try posting to
> lkml proper or one of the sub lists dealing with LVM and the block
> layer. Also, I'm sure some of the expert developers on the XFS list
> could answer this as well, though it would be a little OT there, unless
> of course your filesystem test yielding the 120MB/s was using XFS.
>
> -- Stan


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4E4775BD.5030006@concero.nl">http://lists.debian.org/4E4775BD.5030006@concero.nl
 
Old 08-14-2011, 11:23 AM
Dion Kant
 
Default LVM write performance

On 08/14/2011 09:14 AM, Dion Kant wrote:
> The good and problematic block sizes do not really coincide with the
> ones I observe with dd, but the odd behaviour is there.
When testing on Linux kernel 2.6.37.6-0.5-xen, I found that a sync()
call did not give any guarantee that the buffers are actually written to
disk. This forced me to start using writing through a file descriptor
so I can use fsync() to determine the moment where the data is really
written to disk. Now the results coincide with the ones obtained with
dd. It is obvious that dd will use file descriptors as well. Forget
about the previous results, they will be wrong because of libgcc stream
buffering and I did not check how these buffers are actually written to
kernel space.

Now I obtain:

dom0-2:~ # ./bw
Writing 1 GB
bs time rate
(bytes) (s) (MiB/s)
128 66.6928 16.0998
256 57.1125 18.8005
512 57.219 18.7655
1024 56.6571 18.9516
2048 55.5829 19.3179
4096 14.9638 71.7558
8192 15.6889 68.4395
16384 16.3382 65.7197
32768 15.2223 70.5372
65536 15.2356 70.4757
131072 15.2417 70.4474
262144 16.4634 65.2201
524288 15.2347 70.4802

Best result is obtained with Stan's golden rule bs=4096 and a lot of
interrupts when the bs is not an integral multiple of 4096.

int fd = open("/dev/sdb4", O_WRONLY | O_APPEND);

...

gettimeofday(&tstart, &tz);
for (int i=0; i<Ntot/N; ++i)
written+=write(fd, buf, N);
fsync(fd);
close(fd);
gettimeofday(&tstop, &tz);

Dion


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4E47B035.5050703@concero.nl">http://lists.debian.org/4E47B035.5050703@concero.nl
 
Old 08-14-2011, 11:56 AM
Stan Hoeppner
 
Default LVM write performance

On 8/14/2011 2:14 AM, Dion Kant wrote:
> On 08/13/2011 03:55 PM, Stan Hoeppner wrote:
>> My explanation to you wasn't fully correct. I confused specifying no
>> block size with specifying an insanely large block size. The other post
>> I was referring to dealt with people using a 1GB (or larger) block size
>> because it made the math easier for them when wanting to write a large
>> test file.
> Ok, that makes sense. It is the large specified bs which make the dd is
> going to buffer the data first.

Yep, it's really dramatic on machines with low memory due to swapping.
When I first tested this phenomenon with a 1GB dd block size on my
machine with only 384 MB RAM and a 1GB swap partition, it took many
minutes to complete, vs tens of seconds using a 4kB block size. Almost
all of the 2GB of test was being pushed into swap, then being read from
swap and written to the file-- swap and file on the same physical disk.
This is one of the reasons I keep this old machine around--problems of
this nature show up more quickly and are more easily identified.

>> Instead of dividing their total file size by 4096 and using the result
>> for "bs=4096 count=X" (which is the proper method I described to you)
>> they were simply specifying, for example, "bs=2G count=1" to write a 2
>> GB test file. Doing this causes the massive buffering I described, and
>> consequently, horrible performance, typically by a factor of 10 or more,
>> depending on the specific system.
>>
>> The horrible performance with bs=512 is likely due to the LVM block size
>> being 4096, and forcing block writes that are 1/8th normal size, causing
>> lots of merging. If you divide 120MB/s by 8 you get 15MB/s, which IIRC
>> from your original post, is approximately the write performance you were
>> seeing, which was 19MB/s.
> Recall that I took LVM out of the loop already. So now I am doing the
> experiment with writing data straight to the block device. In my case
> /dev/sdb4. (If writing on the block device level does not perform, how
> will LVM be able to perform?)
>
> Inspired by your advice, I did some more investigation on this. I wrote
> a small test program, i.e. taking dd out of the loop as well. It writes
> 1 GB test data with increasing block sizes directly to /dev/sdb4. Here
> are some results:
>
> root@dom0-2:~# ./bw
> Writing 1 GB
> bs time rate
> (bytes) (s) (MiB/s)
> 1 85.476 12.5619
> 2 33.016 32.5218
> 4 23.6675 45.3679
> 8 20.112 53.3881
> 16 18.76 57.2356
> 32 17.872 60.0795
> 64 17.636 60.8834
> 128 17.096 62.8064
> 256 17.188 62.4704
> 512 16.8482 63.7303
> 1024 57.6053 18.6396
> 2048 57.94 18.532
> 4096 17.016 63.1019
> 8192 16.604 64.6675
> 16384 16.452 65.2649
> 32768 17.132 62.6748
> 65536 16.256 66.052
> 131072 16.44 65.3127
> 262144 16.264 66.0194
> 524288 16.388 65.5199

The dips at 1024 & 2048 are strange, but not entirely unexpected.

> The good and problematic block sizes do not really coincide with the
> ones I observe with dd, but the odd behaviour is there. There are some
> magic block sizes {1,1024, 2048} which cause a drop in performance.
> Looking at vmstat output at the same time I see unexpected bi and the
> interrupt rate goes sky high.
>
> In my case it is the ahci driver handling the writes. Here is the vmstat
> trace belonging to the bs=1 write and I add some more observations below:

Yeah, every platform will have quirks.

> procs -----------memory---------- ---swap-- -----io---- -system--
> ----cpu----
> r b swpd free buff cache si so bi bo in cs us sy
> id wa
> 0 0 0 6379780 23820 112616 0 0 0 0 78 82 0 0
> 100 0
> 0 0 0 6379780 23820 112616 0 0 0 0 77 80 0 0
> 100 0
> 0 0 0 6379780 23820 112616 0 0 0 0 79 80 0 0
> 100 0
> 0 0 0 6379780 23820 112616 0 0 0 0 78 82 0 0
> 100 0
> 0 0 0 6379780 23820 112616 0 0 0 0 76 80 0 0
> 100 0
> 0 0 0 6379780 23820 112616 0 0 0 0 77 83 0 0
> 100 0
> 0 0 0 6379780 23820 112616 0 0 0 0 75 80 0 0
> 100 0
> 0 0 0 6379780 23820 112616 0 0 0 0 82 82 0 0
> 100 0
> 0 0 0 6379780 23820 112616 0 0 0 0 90 93 0 0
> 100 0
> 1 0 0 6376796 27132 112524 0 0 828 4 400 531 0 0
> 100 0
> 1 0 0 6346416 57496 112560 0 0 7590 0 2408 3877 5 0
> 92 3
> 1 0 0 6315788 88048 112580 0 0 7638 0 2435 3903 7 0
> 90 3
> 1 1 0 6284416 118548 112540 0 0 7624 0 2428 3903 6 0
> 91 3
> 1 0 0 6253168 148896 112564 0 0 7586 0 2403 3875 6 0
> 91 3
> 1 0 0 6221920 179284 112484 0 0 7596 0 2414 3884 5 0
> 93 2
> 1 0 0 6190672 209648 112540 0 0 7590 0 2417 3877 5 0
> 93 2
> 0 1 0 6160540 239796 112536 0 0 7540 0 6240 3851 6 0
> 76 18
> 0 1 0 6129540 269952 112584 0 0 7538 0 6255 3856 6 0
> 86 8
> 1 0 0 6098292 300116 112504 0 0 7540 0 6233 3853 5 0
> 89 6
> 1 0 0 6067540 330280 112552 0 0 7538 0 6196 3857 6 0
> 87 7
> 1 0 0 6036540 360452 112536 0 0 7542 0 6281 3868 5 0
> 89 6
> 1 0 0 6005540 390608 112464 0 0 7540 0 6268 3856 6 0
> 85 8
> 1 0 0 5974292 420788 112516 0 0 7542 0 6246 3865 6 0
> 86 7
> 1 0 0 5943416 450952 112444 0 0 7540 0 6253 3860 5 0
> 88 6
> 1 0 0 5912540 481128 112488 0 0 7546 0 6226 3861 6 0
> 86 7
> 1 0 0 5881292 511300 112472 0 0 7540 0 6225 3860 5 0
> 89 6
> 1 0 0 5850292 541456 112464 0 0 7538 0 6192 3858 6 0
> 86 7
> 0 2 0 5817664 570260 112516 0 0 7200 40706 5990 4820 6 0
> 81 13
> 0 2 0 5789268 597752 112472 0 0 6870 0 5775 5251 5 0
> 80 15
> 1 1 0 5760996 625164 112676 0 0 6854 8192 5795 5248 5 0
> 73 21
> 1 1 0 5732476 653232 112572 0 0 7014 8192 5285 5362 5 0
> 82 13
> 1 1 0 5704080 680924 112676 0 0 6922 0 2340 5290 3 0
> 92 5
> 1 1 0 5674504 709444 112540 0 0 7130 8192 2404 5469 5 0
> 71 24
> 1 1 0 5646184 737144 112484 0 0 6924 0 2320 5293 5 0
> 85 10
> 1 1 0 5617460 765004 112484 0 0 6966 8192 5844 5329 5 0
> 75 20
> 2 2 0 5588264 793288 112500 0 0 7068 8192 5313 5404 4 0
> 85 11
> 1 1 0 5559556 821084 112628 0 0 6948 0 2326 5309 8 0
> 78 14
> 1 1 0 5530468 849304 112476 0 0 7054 8192 2374 5395 5 0
> 75 20
> 1 1 0 5501892 876956 112464 0 0 6912 8192 2321 5285 5 0
> 85 10
> 0 2 0 5472936 905044 112584 0 0 7024 0 5889 5370 5 0
> 70 25
> 0 2 0 5444476 933096 112596 0 0 7010 8192 5874 5360 4 0
> 82 13
> 0 2 0 5415520 960924 112476 0 0 6960 0 5841 5323 6 0
> 70 24
> 1 1 0 5386580 989096 112696 0 0 7038 8192 5282 5384 6 0
> 69 25
> 2 2 0 5357624 1017164 112688 0 0 7016 0 2358 5362 4
> 0 89 7
> 1 1 0 5328428 1045280 112580 0 0 7028 8192 2356 5379 5
> 0 80 15
> 0 2 0 5296688 1072396 112540 0 0 6778 50068 2314 5194 0
> 0 99 1
> 0 2 0 5297044 1072396 112616 0 0 0 64520 317 176 0
> 0 75 24
> 0 2 0 5297044 1072396 112616 0 0 0 64520 310 175 0
> 0 77 23
> 0 2 0 5297044 1072396 112616 0 0 0 64520 300 161 0
> 0 85 15
> 0 2 0 5297052 1072396 112616 0 0 0 72204 317 180 0
> 0 77 22
> 0 2 0 5297052 1072396 112616 0 0 0 64520 307 170 0
> 0 84 16
> 0 1 0 5300540 1072396 112616 0 0 0 21310 309 203 0
> 0 98 2
> 1 0 0 6351440 52252 112680 0 0 54 25 688 343 1 1
> 63 35
> 1 0 0 6269720 133036 112600 0 0 0 0 575 88 7 0
> 93 0
> 1 0 0 6186516 213812 112560 0 0 0 0 568 83 9 0
> 91 0
> 1 0 0 6103560 294588 112512 0 0 0 0 569 85 6 0
> 94 0
> 1 0 0 6020852 375428 112688 0 0 0 0 571 84 9 0
> 90 0
> 1 0 0 5937896 456244 112664 0 0 0 0 571 86 7 0
> 93 0
>
> Writing to /dev/sdb4 starts when there is a step in the interrupt
> column. As long as the interrupts are high there is bi related to this
> writing. After initial buffering there is a first write to the disk at
> 40MB/s averaged over 2 seconds. Then only a couple of 8MB/s writes and
> in the mean time the (kernel) buffer is growing up to 1072396 kB. Then
> the driver starts writing at expected rates and the interrupt rate goes
> down to a reasonable level. It is only at the end of the write that the
> ahci driver gives back its buffer memory. After this, when the interrupt
> rate goes to a level of about 570, the ahci driver is swallowing the
> second write iteration with a block size of 2 bytes.

What do you see when you insert large delays between iterations, or run
each iteration after clearing the baffles, i.e.
$ echo 3 > /proc/sys/vm/drop_caches

> Here is the code fragment responsible for writing and measuring:
>
> sync();
> gettimeofday(&tstart, &tz);
> for (int i=0; i<Ntot/N; ++i)
> sdb4->write(buf,N);
> sdb4->flush();
> sdb4->close();
> sync();
> gettimeofday(&tstop, &tz);
>
> N is the block size and sdb4 is
>
> ofstream* sdb4 = new ofstream("/dev/sdb4", ofstream::binary);
>
>
> I think Stan is right that this may be something in the ahci kernel driver.

Could be. Could just need tweaking, say queue_depth, elevator, etc.
Did you test with all 3 elevators or just one?

> I have some 3ware controller laying around. I might repeat the
> experiments with this and post them here if someone is interested.

If it doesn't have the hardware write cache enabled you will likely see
worse performance than with the current drive/controller.

--
Stan


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4E47B7EB.9080700@hardwarefreak.com">http://lists.debian.org/4E47B7EB.9080700@hardwarefreak.com
 
Old 08-14-2011, 12:30 PM
Dion Kant
 
Default LVM write performance

On 08/14/2011 01:23 PM, Dion Kant wrote:
> Forget
> about the previous results, they will be wrong because of libgcc stream
> buffering and I did not check how these buffers are actually written to
> kernel space.
libgcc uses writev to write out an array of buffers to kernel space

User bs Actual bs
1 8191
2 8192
4 8192
8 8192
16 8192
32 8192
64 8192
128 8192
256 8192
512 8192
1024 1024
2048 2048
4096 4096
8192 8192

Except for writing single user bytes, libgcc does a good job in gathering the data into buffers with an integral buffer size of 8192 bytes. From a user bs of 1024 and further, it sticks to this buffer size for writing the data to kernel space. So that explains the results I obtained with the write method of ofstream. For all cases the kernel is addressed with a buffer size which is an integral multiple of 4096 the performance is good.

I think the one to less buffer size for the single byte case provides an option for improvement of libgcc.

Dion



--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4E47BFF5.70903@concero.nl">http://lists.debian.org/4E47BFF5.70903@concero.nl
 
Old 08-19-2011, 09:38 PM
Dion Kant
 
Default LVM write performance

On 08/14/2011 02:30 PM, Dion Kant wrote:
> On 08/14/2011 01:23 PM, Dion Kant wrote:
>> Forget
>> about the previous results, they will be wrong because of libgcc stream
>> buffering and I did not check how these buffers are actually written to
>> kernel space.
> libgcc uses writev to write out an array of buffers to kernel space
>
> User bs Actual bs
> 1 8191
> 2 8192
> 4 8192
> 8 8192
> 16 8192
> 32 8192
> 64 8192
> 128 8192
> 256 8192
> 512 8192
> 1024 1024
> 2048 2048
> 4096 4096
> 8192 8192
>
> Except for writing single user bytes, libgcc does a good job in gathering the data into buffers with an integral buffer size of 8192 bytes. From a user bs of 1024 and further, it sticks to this buffer size for writing the data to kernel space. So that explains the results I obtained with the write method of ofstream. For all cases the kernel is addressed with a buffer size which is an integral multiple of 4096 the performance is good.
>
> I think the one to less buffer size for the single byte case provides an option for improvement of libgcc.
>
> Dion

I now think I understand the "strange" behaviour for block sizes not an
integral multiple of 4096 bytes. (Of course you guys already knew the
answer but just didn't want to make it easy for me to find the answer.)

The newer disks today have a sector size of 4096 bytes. They may still
be reporting 512 bytes, but this is to keep some ancient OS-es working.

When a block write is not an integral of 4096 bytes, for example 512
bytes, 4095 or 8191 bytes, the driver must first read the sector, modify
it and finally write it back to the disk. This explains the bi and the
increased number of interrupts.

I did some Google searches but did not find much. Can someone confirm
this hypothesis?

Best regards,

Dion


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4E4ED7E2.6090102@concero.nl">http://lists.debian.org/4E4ED7E2.6090102@concero.nl
 
Old 08-19-2011, 10:53 PM
Stan Hoeppner
 
Default LVM write performance

On 8/19/2011 4:38 PM, Dion Kant wrote:

> I now think I understand the "strange" behaviour for block sizes not an
> integral multiple of 4096 bytes. (Of course you guys already knew the
> answer but just didn't want to make it easy for me to find the answer.)
>
> The newer disks today have a sector size of 4096 bytes. They may still
> be reporting 512 bytes, but this is to keep some ancient OS-es working.
>
> When a block write is not an integral of 4096 bytes, for example 512
> bytes, 4095 or 8191 bytes, the driver must first read the sector, modify
> it and finally write it back to the disk. This explains the bi and the
> increased number of interrupts.
>
> I did some Google searches but did not find much. Can someone confirm
> this hypothesis?

The read-modify-write performance penalty of unaligned partitions on the
"Advanced Format" drives (4KB native sectors) is a separate unrelated issue.

As I demonstrated earlier in this thread, the performance drop seen when
using dd with block sizes less than 4KB affects traditional 512B/sector
drives as well. If one has a misaligned partition on an Advanced Format
drive, one takes a double performance hit when dd bs is less than 4KB.

Again, everything in (x86) Linux is optimized around the 'magic' 4KB
size, including page size, filesystem block size, and LVM block size.

BTW, did you run your test with each of the elevators, as I recommended?
Do the following, testing dd after each change.

$ echo deadline > /sys/block/sdX/queue/scheduler
$ echo noop > /sys/block/sdX/queue/scheduler
$ echo cfq > /sys/block/sdX/queue/scheduler

Also, just for fun, and interesting results, increase your read_ahead_kb
from the default 128 to 512.

$ echo 512 > /sys/block/sdX/queue/read_ahead_kb

These changes are volatile so a reboot clears them in the event you're
unable to change them back to the defaults for any reason. This is
easily avoidable if you simply cat the files and write down the values
before changing them. After testing, echo the default values back in.

--
Stan


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4E4EE96B.4060604@hardwarefreak.com">http://lists.debian.org/4E4EE96B.4060604@hardwarefreak.com
 
Old 08-30-2011, 08:17 PM
Dion Kant
 
Default LVM write performance

On 08/20/2011 12:53 AM, Stan Hoeppner wrote:

On 8/19/2011 4:38 PM, Dion Kant wrote:



I now think I understand the "strange" behaviour for block sizes not an
integral multiple of 4096 bytes. (Of course you guys already knew the
answer but just didn't want to make it easy for me to find the answer.)

The newer disks today have a sector size of 4096 bytes. They may still
be reporting 512 bytes, but this is to keep some ancient OS-es working.

When a block write is not an integral of 4096 bytes, for example 512
bytes, 4095 or 8191 bytes, the driver must first read the sector, modify
it and finally write it back to the disk. This explains the bi and the
increased number of interrupts.

I did some Google searches but did not find much. Can someone confirm
this hypothesis?



The read-modify-write performance penalty of unaligned partitions on the
"Advanced Format" drives (4KB native sectors) is a separate unrelated issue.

As I demonstrated earlier in this thread, the performance drop seen when
using dd with block sizes less than 4KB affects traditional 512B/sector
drives as well. If one has a misaligned partition on an Advanced Format
drive, one takes a double performance hit when dd bs is less than 4KB.

Again, everything in (x86) Linux is optimized around the 'magic' 4KB
size, including page size, filesystem block size, and LVM block size.


Ok, I have done some browsing through the kernel sources. I
understand the VFS a bit better now. When a read/write is issued
on a block device file, the block size is 4096 bytes, i.e.
reads/writes to the disk are done using blocks equal to the page
cache size, i.e. the magic 4KB.



Submitting a request with a block size which is not an integral
multiple of 4096 bytes results in a call to ll_rw_block(READ, 1,
&bh), which reads in 4096 blocks, one by one into the page
cache. This must be done before the user data can be used to
partially update the concerning buffer page in the cache. After
being updated, the buffer is flagged dirty and finally written to
disk (8 sectors of 512 bytes).



I found a nice debugging switch which helps monitoring the
process.



echo 1 > /proc/sys/vm/block_dump



makes all bio requests being logged as kernel output.



Example:



dd of=/dev/vg/d1 if=/dev/zero bs=4095 count=2 conv=sync



[* 239.977384] dd(6110): READ block 0 on dm-3

[* 240.026952] dd(6110): READ block 8 on dm-3

[* 240.027735] dd(6110): WRITE block 0 on dm-3

[* 240.027754] dd(6110): WRITE block 8 on dm-3



The ll_rw_block(READ, 1, &bh) is*
causing the reads which can be seen when monitoring with vmstat.
The tests given below (as you requested) were carried out before I
gained a better understanding of the VFS. However, remaining
questions I still have are:



1. Why are the partial block updates (through ll_rw_block(READ,

1, &bh)) so dramatic slow as compared to other reads from the
disk?



2. Furthermore remember the much better performance I reported
when mounting a file system on the block device first, before
accessing the disk through the block device file. If I find some
more spare time I do some more digging in the kernel. Maybe I find
out that then a different set of f_ops are used for accessing the
raw block device by the Virtual Filesystem Switch.






BTW, did you run your test with each of the elevators, as I recommended?
Do the following, testing dd after each change.



$ echo 128 > /sys/block/sdc/queue/read_ahead_kb



dom0-2:~ # echo deadline > /sys/block/sdc/queue/scheduler
dom0-2:~ # cat /sys/block/sdc/queue/scheduler
noop [deadline] cfq
dom0-2:~ # ./bw
Writing 1 GB
bs time rate
(bytes) (s) (MiB/s)
512 54.0373 19.8704 1024
1024 54.2937 19.7765 1024
2048 52.1781 20.5784 1024
4096 13.751 78.0846 1024
8192 13.8519 77.5159 1024

dom0-2:~ # echo noop > /sys/block/sdc/queue/scheduler
dom0-2:~ # cat /sys/block/sdc/queue/scheduler
[noop] deadline cfq
dom0-2:~ # ./bw
Writing 1 GB
bs time rate
(bytes) (s) (MiB/s)
512 53.9634 19.8976 1024
1024 52.0421 20.6322 1024
2048 54.0437 19.868 1024
4096 13.9612 76.9088 1024
8192 13.8183 77.7043 1024

dom0-2:~ # echo cfq > /sys/block/sdc/queue/scheduler
dom0-2:~ # cat /sys/block/sdc/queue/scheduler
noop deadline [cfq]
dom0-2:~ # ./bw
Writing 1 GB
bs time rate
(bytes) (s) (MiB/s)
512 56.0087 19.171 1024
1024 56.345 19.0565 1024
2048 56.0436 19.159 1024
4096 15.1232 70.9999 1024
8192 15.4236 69.6168 1024



Also, just for fun, and interesting results, increase your read_ahead_kb
from the default 128 to 512.

$ echo 512 > /sys/block/sdX/queue/read_ahead_kb


$ echo deadline > /sys/block/sdX/queue/scheduler

dom0-2:~ # ./bw

Writing 1 GB

**** bs******** time*** rate

** (bytes)****** (s)** (MiB/s)

****** 512** 54.1023** 19.8465*********** 1024

***** 1024** 52.1824** 20.5767*********** 1024

***** 2048** 54.3797** 19.7453*********** 1024

***** 4096** 13.7252** 78.2315*********** 1024

***** 8192*** 13.727** 78.2211*********** 1024





$ echo noop > /sys/block/sdX/queue/scheduler


dom0-2:~ # ./bw

Writing 1 GB

**** bs******** time*** rate

** (bytes)****** (s)** (MiB/s)

****** 512** 54.0853** 19.8527*********** 1024

***** 1024*** 54.525** 19.6927*********** 1024

***** 2048** 50.6829** 21.1855*********** 1024

***** 4096** 14.1272** 76.0051*********** 1024

***** 8192*** 13.914** 77.1701*********** 1024





$ echo cfq > /sys/block/sdX/queue/scheduler


dom0-2:~ # ./bw

Writing 1 GB

**** bs******** time*** rate

** (bytes)****** (s)** (MiB/s)

****** 512** 56.0274** 19.1646*********** 1024

***** 1024** 55.7614*** 19.256*********** 1024

***** 2048** 56.5394*** 18.991*********** 1024

***** 4096** 16.0562** 66.8739*********** 1024

***** 8192** 17.3842** 61.7654*********** 1024



Differences between deadline and noop are in the order of 2 to 3 %
in favour of deadline. Remarkable is the run with the cfq
elevator. It clearly has less performance, about 20% less
(compared to the highest result) for the 512 read_ahead_kb case.
Another try with the same settings:



dom0-2:~ # ./bw

Writing 1 GB

**** bs******** time*** rate

** (bytes)****** (s)** (MiB/s)

****** 512** 56.8122** 18.8999*********** 1024

***** 1024** 56.5486** 18.9879*********** 1024

***** 2048** 56.2555** 19.0869*********** 1024

***** 4096*** 14.886** 72.1311*********** 1024

***** 8192*** 15.461** 69.4486*********** 1024



so it looks like the previous result was at the low end of the
statistical variation.








These changes are volatile so a reboot clears them in the event you're
unable to change them back to the defaults for any reason. This is
easily avoidable if you simply cat the files and write down the values
before changing them. After testing, echo the default values back in.



I did some testing on a newer system with an AOC-USAS-S4i


Adaptec AACRAID Controller on a Supermicro. It uses the
aacraid driver. This controller supports RAID0,1,10 but with
configuring the controller in a way that it published the disks as
4 single disk RAID0 to Linux (the controller cannot do JBOD), we
obtained much better performance with Linux software RAID0, or
striping with LVM or LVM on top of RAID0 as compared to RAID0
being managed by the controller. Now we obtain 300 to 350 MByte/s
sustained write performance as about 150 MB/s when using the
controller.



We use 4 ST32000644NS drives.



Repeating the tests on this system gives similar results, let
alone that the 2 TB drives have a better write performance of
about 50%.





capture4:~ # cat* /sys/block/sdc/queue/read_ahead_kb

128

capture4:~ # cat /sys/block/sdc/queue/scheduler

noop [deadline] cfq



capture4:~ # ./bw /dev/sdc1

Writing 1 GB

**** bs******** time*** rate

** (bytes)****** (s)** (MiB/s)

***** 8192*** 8.5879*** 125.03*********** 1024

***** 4096** 8.54407** 125.671*********** 1024

***** 2048** 65.0727** 16.5007*********** 1024



Note the performance drop by a factor 1/8 halving the bs from
4096 to 2048.



Reading a drive is 8.8% faster and works for all block sizes:



capture4:~ # ./br /dev/sdc1

Reading 1 GB

**** bs******** time*** rate

** (bytes)****** (s)** (MiB/s)

****** 512** 7.86782** 136.473*********** 1024

***** 1024** 7.85202** 136.747*********** 1024

***** 2048** 7.85979** 136.612*********** 1024

***** 4096** 7.86932** 136.447*********** 1024

***** 8192*** 7.8509** 136.767*********** 1024

*

dd gives similar results:

capture4:~ # dd if=/dev/sdc1 of=/dev/null bs=512 count=2097152

2097152+0 records in

2097152+0 records out

1073741824 bytes (1.1 GB) copied, 7.85281 s, 137 MB/s



Dion
 

Thread Tools




All times are GMT. The time now is 10:33 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org