Fix Crash when IO is being submitted and block size is changed
On Thu, 19 Jul 2012, Jeff Moyer wrote:
> Mikulas Patocka <mpatocka@redhat.com> writes:
>
> > On Tue, 17 Jul 2012, Jeff Moyer wrote:
> >
>
> >> > This is the patch that fixes this crash: it takes a rw-semaphore around
> >> > all direct-IO path.
> >> >
> >> > (note that if someone is concerned about performance, the rw-semaphore
> >> > could be made per-cpu --- take it for read on the current CPU and take it
> >> > for write on all CPUs).
> >>
> >> Here we go again. :-) I believe we had at one point tried taking a rw
> >> semaphore around GUP inside of the direct I/O code path to fix the fork
> >> vs. GUP race (that still exists today). When testing that, the overhead
> >> of the semaphore was *way* too high to be considered an acceptable
> >> solution. I've CC'd Larry Woodman, Andrea, and Kosaki Motohiro who all
> >> worked on that particular bug. Hopefully they can give better
> >> quantification of the slowdown than my poor memory.
> >>
> >> Cheers,
> >> Jeff
> >
> > Both down_read and up_read together take 82 ticks on Core2, 69 ticks on
> > AMD K10, 62 ticks on UltraSparc2 if the target is in L1 cache. So, if
> > percpu rw_semaphores were used, it would slow down only by this amount.
>
> Sorry, I'm not familiar with per-cpu rw semaphores. Where are they
> implemented?
Here I'm resending the upstream patches with per rw-semaphores - percpu
rw-semaphores are implemented in the next patch.
(For Jeff: you can use your patch for RHEL-6 that you did for perfocmance
testing, with the change that I proposed).
Mikulas
---
blockdev: fix a crash when block size is changed and I/O is issued simultaneously
The kernel may crash when block size is changed and I/O is issued
simultaneously.
Because some subsystems (udev or lvm) may read any block device anytime,
the bug actually puts any code that changes a block device size in
jeopardy.
The crash can be reproduced if you place "msleep(1000)" to
blkdev_get_blocks just before "bh->b_size = max_blocks <<
inode->i_blkbits;".
Then, run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct"
While it is waiting in msleep, run "blockdev --setbsz 2048 /dev/ram0"
You get a BUG.
The direct and non-direct I/O is written with the assumption that block
size does not change. It doesn't seem practical to fix these crashes
one-by-one there may be many crash possibilities when block size changes
at a certain place and it is impossible to find them all and verify the
code.
This patch introduces a new rw-lock bd_block_size_semaphore. The lock is
taken for read during I/O. It is taken for write when changing block
size. Consequently, block size can't be changed while I/O is being
submitted.
For asynchronous I/O, the patch only prevents block size change while
the I/O is being submitted. The block size can change when the I/O is in
progress or when the I/O is being finished. This is acceptable because
there are no accesses to block size when asynchronous I/O is being
finished.
The patch prevents block size changing while the device is mapped with
mmap.
int set_blocksize(struct block_device *bdev, int size)
{
+ struct address_space *mapping;
+
/* Size must be a power of two, and between 512 and PAGE_SIZE */
if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
return -EINVAL;
@@ -124,6 +126,20 @@ int set_blocksize(struct block_device *b
if (size < bdev_logical_block_size(bdev))
return -EINVAL;
+ /* Prevent starting I/O or mapping the device */
+ down_write(&bdev->bd_block_size_semaphore);
+
+ /* Check that the block device is not memory mapped */
+ mapping = bdev->bd_inode->i_mapping;
+ mutex_lock(&mapping->i_mmap_mutex);
+ if (!prio_tree_empty(&mapping->i_mmap) ||
+ !list_empty(&mapping->i_mmap_nonlinear)) {
+ mutex_unlock(&mapping->i_mmap_mutex);
+ up_write(&bdev->bd_block_size_semaphore);
+ return -EBUSY;
+ }
+ mutex_unlock(&mapping->i_mmap_mutex);
+
/* Don't change the size if it is same as current */
if (bdev->bd_block_size != size) {
sync_blockdev(bdev);
@@ -131,6 +147,9 @@ int set_blocksize(struct block_device *b
bdev->bd_inode->i_blkbits = blksize_bits(size);
kill_bdev(bdev);
}
+
+ up_write(&bdev->bd_block_size_semaphore);
+
return 0;
}