Fix a crash when block device is read and block size is changed at the same time
Mikulas Patocka <mpatocka@redhat.com> writes:
> On Fri, 31 Aug 2012, Mikulas Patocka wrote:
>
>> Hi
>>
>> This is a series of patches to prevent a crash when when someone is
>> reading block device and block size is changed simultaneously. (the crash
>> is already happening in the production environment)
>>
>> The first patch adds a rw-lock to struct block_device, but doesn't use the
>> lock anywhere. The reason why I submit this as a separate patch is that on
>> my computer adding an unused field to this structure affects performance
>> much more than any locking changes.
>>
>> The second patch uses the rw-lock. The lock is locked for read when doing
>> I/O on the block device and it is locked for write when changing block
>> size.
>>
>> The third patch converts the rw-lock to a percpu rw-lock for better
>> performance, to avoid cache line bouncing.
>>
>> The fourth patch is an alternate percpu rw-lock implementation using RCU
>> by Eric Dumazet. It avoids any atomic instruction in the hot path.
>>
>> Mikulas
>
> I tested performance of patches. I created 4GB ramdisk, I initially filled
> it with zeros (so that ramdisk allocation-on-demand doesn't affect the
> results).
>
> I ran fio to perform 8 concurrent accesses on 8 core machine (two
> Barcelona Opterons):
> time fio --rw=randrw --size=4G --bs=512 --filename=/dev/ram0 --direct=1
> --name=job1 --name=job2 --name=job3 --name=job4 --name=job5 --name=job6
> --name=job7 --name=job8
>
> The results actually show that the size of struct block_device and
> alignment of subsequent fields in struct inode have far more effect on
> result that the type of locking used. (struct inode is placed just after
> struct block_device in "struct bdev_inode" in fs/block-dev.c)
>
> plain kernel 3.5.3: 57.9s
> patch 1: 43.4s
> patches 1,2: 43.7s
> patches 1,2,3: 38.5s
> patches 1,2,3,4: 58.6s
>
> You can see that patch 1 improves the time by 14.5 seconds, but all that
> patch 1 does is adding an unused structure field.
>
> Patch 3 is 4.9 seconds faster than patch 1, althogh patch 1 does no
> locking at all and patch 3 does per-cpu locking. So, the reason for the
> speedup is different sizeof of struct block_device (and subsequently,
> different alignment of struct inode), rather than locking improvement.
How many runs did you do? Did you see much run to run variation?
> I would be interested if other people did performance testing of the
> patches too.
I'll do some testing next week, but don't expect to get to it before
Wednesday.
Cheers,
Jeff
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
09-17-2012, 09:19 PM
Jeff Moyer
Fix a crash when block device is read and block size is changed at the same time
Jeff Moyer <jmoyer@redhat.com> writes:
> Mikulas Patocka <mpatocka@redhat.com> writes:
>> I would be interested if other people did performance testing of the
>> patches too.
>
> I'll do some testing next week, but don't expect to get to it before
> Wednesday.
Sorry for taking so long on this. I managed to get access to an 80cpu
(160 threads) system with 1TB of memory. I installed a pcie ssd into
this machine and did some testing against the raw block device.
I've attached the fio job file I used. Basically, I tested sequential
reads, sequential writes, random reads, random writes, and then a mix of
sequential reads and writes, and a mix of random reads and writes. All
tests used direct I/O to the block device, and each number shown is an
average of 5 runs. I had to pin the fio processes to the same numa node
as the pcie adapter in order to get low run-to-run variations. Because
of the numa factor, I was unable to get reliable results running
processes against all of the 160 threads on the system. The runs below
have 4 processes, each pushing a queue depth of 1024.
So, on to the results. I haven't fully investigated them yet, but I
plan to as they are rather surprising.
The first patch in the series simply adds a semaphore to the
block_device structure. Mikulas, you had mentioned that this managed to
have a large effect on your test load. In my case, this didn't seem to
make any difference at all:
The headings are:
BW = bandwidth in KB/s
IOPS = I/Os per second
msec = number of miliseconds the run took (smaller is better)
usr = %user time
sys = %system time
csw = context switches
The first two tables show the results of each run. In this case, the
first is the unpatched kernel, and the second is the one with the
block_device structure change. The third table is the % difference
between the two. A positive number indicates the second run had a
larger average than the first. I found that the context switch rate was
rather unpredictable, so I really should have just left that out of the
reporting.
As you can see, adding a member to struct block_device did not really
change the results.
Next up is the patch that actually uses the rw semaphore to protect
access to the block size. Here are the results:
As you can see, there were modest gains in write, read, and randread.
This is somewhat unexpected, as you would think that introducing locking
would not *help* performance! Investigating the standard deviations for
each set of 5 runs shows that the performance difference is significant
(the standard deviation is reported as a percentage of the average):
This is a table of standard deviations for the 5 runs comprising the
above average with this kernel: 3.6.0-rc5.mikulas.1+
Wow! Switching to the per-cpu semaphore implementation just boosted the
performance of the I/O path big-time. Note that the system time also
goes down! So, we get better throughput and less system time. This
sounds too good to be true. ;-) Here are the standard deviations
(again, shown as percentages) for the .3+ kernel:
Next up, I'm going to get some perf and blktrace data from these runs to
see if I can identify why there is such a drastic change in
performance. I will also attempt to run the tests against a different
vendor's adapter, and maybe against some FC storage if I can set that up.
[randrw4]
offset=24g
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
09-18-2012, 05:22 PM
Jeff Moyer
Fix a crash when block device is read and block size is changed at the same time
Mikulas Patocka <mpatocka@redhat.com> writes:
> Hi Jeff
>
> Thanks for testing.
>
> It would be interesting ... what happens if you take the patch 3, leave
> "struct percpu_rw_semaphore bd_block_size_semaphore" in "struct
> block_device", but remove any use of the semaphore from fs/block_dev.c? -
> will the performance be like unpatched kernel or like patch 3? It could be
> that the change in the alignment affects performance on your CPU too, just
> differently than on my CPU.
I'll give it a try and report back.
> What is the CPU model that you used for testing?
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
09-18-2012, 06:58 PM
Jeff Moyer
Fix a crash when block device is read and block size is changed at the same time
Mikulas Patocka <mpatocka@redhat.com> writes:
> On Tue, 18 Sep 2012, Jeff Moyer wrote:
>
>> Mikulas Patocka <mpatocka@redhat.com> writes:
>>
>> > Hi Jeff
>> >
>> > Thanks for testing.
>> >
>> > It would be interesting ... what happens if you take the patch 3, leave
>> > "struct percpu_rw_semaphore bd_block_size_semaphore" in "struct
>> > block_device", but remove any use of the semaphore from fs/block_dev.c? -
>> > will the performance be like unpatched kernel or like patch 3? It could be
>> > that the change in the alignment affects performance on your CPU too, just
>> > differently than on my CPU.
>>
>> I'll give it a try and report back.
>>
>> > What is the CPU model that you used for testing?
>>
>> http://ark.intel.com/products/53570/Intel-Xeon-Processor-E7-2860-%2824M-Cache-2_26-GHz-6_40-GTs-Intel-QPI%29
>>
> BTW. why did you use just 4 processes? - the processor has 10 cores and 20
> threads (so theoretically, you could run 20 processes bound on a single
> numa node). Were the results not stable with more than 4 processes?
There is no good reason for it. Since I was able to show some
differences in performance, I didn't see the need to scale beyond 4. I
can certainly bump the count up if/when that becomes interesting.
Cheers,
Jeff
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
09-18-2012, 08:11 PM
Jeff Moyer
Fix a crash when block device is read and block size is changed at the same time
Mikulas Patocka <mpatocka@redhat.com> writes:
> Hi Jeff
>
> Thanks for testing.
>
> It would be interesting ... what happens if you take the patch 3, leave
> "struct percpu_rw_semaphore bd_block_size_semaphore" in "struct
> block_device", but remove any use of the semaphore from fs/block_dev.c? -
> will the performance be like unpatched kernel or like patch 3? It could be
> that the change in the alignment affects performance on your CPU too, just
> differently than on my CPU.
It turns out to be exactly the same performance as with the 3rd patch
applied, so I guess it does have something to do with cache alignment.
Here is the patch (against vanilla) I ended up testing. Let me know if
I've botched it somehow.
So, I next up I'll play similar tricks to what you did (padding struct
block_device in all kernels) to eliminate the differences due to
structure alignment and provide a clear picture of what the locking
effects are.
Thanks!
Jeff
diff --git a/drivers/char/raw.c b/drivers/char/raw.c
index 54a3a6d..0bb207e 100644
--- a/drivers/char/raw.c
+++ b/drivers/char/raw.c
@@ -285,7 +285,7 @@ static long raw_ctl_compat_ioctl(struct file *file, unsigned int cmd,
int set_blocksize(struct block_device *bdev, int size)
{
+ struct address_space *mapping;
+
/* Size must be a power of two, and between 512 and PAGE_SIZE */
if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
return -EINVAL;
@@ -124,6 +126,16 @@ int set_blocksize(struct block_device *bdev, int size)
if (size < bdev_logical_block_size(bdev))
return -EINVAL;
+ /* Check that the block device is not memory mapped */
+ mapping = bdev->bd_inode->i_mapping;
+ mutex_lock(&mapping->i_mmap_mutex);
+ if (!prio_tree_empty(&mapping->i_mmap) ||
+ !list_empty(&mapping->i_mmap_nonlinear)) {
+ mutex_unlock(&mapping->i_mmap_mutex);
+ return -EBUSY;
+ }
+ mutex_unlock(&mapping->i_mmap_mutex);
+
/* Don't change the size if it is same as current */
if (bdev->bd_block_size != size) {
sync_blockdev(bdev);
@@ -131,6 +143,7 @@ int set_blocksize(struct block_device *bdev, int size)
bdev->bd_inode->i_blkbits = blksize_bits(size);
kill_bdev(bdev);
}
+
return 0;
}
/*
* It's silly to have NR_OPEN bigger than NR_FILE, but you can change
@@ -724,6 +725,8 @@ struct block_device {
int bd_fsfreeze_count;
/* Mutex for freeze */
struct mutex bd_fsfreeze_mutex;
+ /* A semaphore that prevents I/O while block size is being changed */
+ struct percpu_rw_semaphore bd_block_size_semaphore;
};
/*
@@ -2564,6 +2567,8 @@ extern int generic_segment_checks(const struct iovec *iov,
unsigned long *nr_segs, size_t *count, int access_flags);