FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Device-mapper Development

 
 
LinkBack Thread Tools
 
Old 08-31-2012, 08:11 PM
Jeff Moyer
 
Default Fix a crash when block device is read and block size is changed at the same time

Mikulas Patocka <mpatocka@redhat.com> writes:

> On Fri, 31 Aug 2012, Mikulas Patocka wrote:
>
>> Hi
>>
>> This is a series of patches to prevent a crash when when someone is
>> reading block device and block size is changed simultaneously. (the crash
>> is already happening in the production environment)
>>
>> The first patch adds a rw-lock to struct block_device, but doesn't use the
>> lock anywhere. The reason why I submit this as a separate patch is that on
>> my computer adding an unused field to this structure affects performance
>> much more than any locking changes.
>>
>> The second patch uses the rw-lock. The lock is locked for read when doing
>> I/O on the block device and it is locked for write when changing block
>> size.
>>
>> The third patch converts the rw-lock to a percpu rw-lock for better
>> performance, to avoid cache line bouncing.
>>
>> The fourth patch is an alternate percpu rw-lock implementation using RCU
>> by Eric Dumazet. It avoids any atomic instruction in the hot path.
>>
>> Mikulas
>
> I tested performance of patches. I created 4GB ramdisk, I initially filled
> it with zeros (so that ramdisk allocation-on-demand doesn't affect the
> results).
>
> I ran fio to perform 8 concurrent accesses on 8 core machine (two
> Barcelona Opterons):
> time fio --rw=randrw --size=4G --bs=512 --filename=/dev/ram0 --direct=1
> --name=job1 --name=job2 --name=job3 --name=job4 --name=job5 --name=job6
> --name=job7 --name=job8
>
> The results actually show that the size of struct block_device and
> alignment of subsequent fields in struct inode have far more effect on
> result that the type of locking used. (struct inode is placed just after
> struct block_device in "struct bdev_inode" in fs/block-dev.c)
>
> plain kernel 3.5.3: 57.9s
> patch 1: 43.4s
> patches 1,2: 43.7s
> patches 1,2,3: 38.5s
> patches 1,2,3,4: 58.6s
>
> You can see that patch 1 improves the time by 14.5 seconds, but all that
> patch 1 does is adding an unused structure field.
>
> Patch 3 is 4.9 seconds faster than patch 1, althogh patch 1 does no
> locking at all and patch 3 does per-cpu locking. So, the reason for the
> speedup is different sizeof of struct block_device (and subsequently,
> different alignment of struct inode), rather than locking improvement.

How many runs did you do? Did you see much run to run variation?

> I would be interested if other people did performance testing of the
> patches too.

I'll do some testing next week, but don't expect to get to it before
Wednesday.

Cheers,
Jeff

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 09-17-2012, 09:19 PM
Jeff Moyer
 
Default Fix a crash when block device is read and block size is changed at the same time

Jeff Moyer <jmoyer@redhat.com> writes:
> Mikulas Patocka <mpatocka@redhat.com> writes:
>> I would be interested if other people did performance testing of the
>> patches too.
>
> I'll do some testing next week, but don't expect to get to it before
> Wednesday.

Sorry for taking so long on this. I managed to get access to an 80cpu
(160 threads) system with 1TB of memory. I installed a pcie ssd into
this machine and did some testing against the raw block device.

I've attached the fio job file I used. Basically, I tested sequential
reads, sequential writes, random reads, random writes, and then a mix of
sequential reads and writes, and a mix of random reads and writes. All
tests used direct I/O to the block device, and each number shown is an
average of 5 runs. I had to pin the fio processes to the same numa node
as the pcie adapter in order to get low run-to-run variations. Because
of the numa factor, I was unable to get reliable results running
processes against all of the 160 threads on the system. The runs below
have 4 processes, each pushing a queue depth of 1024.

So, on to the results. I haven't fully investigated them yet, but I
plan to as they are rather surprising.

The first patch in the series simply adds a semaphore to the
block_device structure. Mikulas, you had mentioned that this managed to
have a large effect on your test load. In my case, this didn't seem to
make any difference at all:

3.6.0-rc5+-job.fio-run2/output-avg
READ WRITE CPU
Job Name BW IOPS msec BW IOPS msec usr sys csw
write1 0 0 0 748522 187130 44864 16.34 60.65 3799440.00
read1 690615 172653 48602 0 0 0 13.45 61.42 4044720.00
randwrite1 0 0 0 716406 179101 46839 29.03 52.79 3151140.00
randread1 683466 170866 49108 0 0 0 25.92 54.67 3081610.00
readwrite1 377518 94379 44450 377645 94410 44450 15.49 64.32 3139240.00
randrw1 355815 88953 47178 355733 88933 47178 27.96 54.24 2944570.00
3.6.0-rc5.mikulas.1+-job.fio-run2/output-avg
READ WRITE CPU
Job Name BW IOPS msec BW IOPS msec usr sys csw
write1 0 0 0 764037 191009 43925 17.14 60.15 3536950.00
read1 696880 174220 48152 0 0 0 13.90 61.74 3710168.00
randwrite1 0 0 0 737331 184332 45511 29.82 52.71 2869440.00
randread1 689319 172329 48684 0 0 0 26.38 54.58 2927411.00
readwrite1 387651 96912 43294 387799 96949 43294 16.06 64.92 2814340.00
randrw1 360298 90074 46591 360304 90075 46591 28.53 54.10 2793120.00
%diff
READ WRITE CPU
Job Name BW IOPS msec BW IOPS msec usr sys csw
write1 0 0 0 0 0 0 0.00 0.00 -6.91
read1 0 0 0 0 0 0 0.00 0.00 -8.27
randwrite1 0 0 0 0 0 0 0.00 0.00 -8.94
randread1 0 0 0 0 0 0 0.00 0.00 -5.00
readwrite1 0 0 0 0 0 0 0.00 0.00 -10.35
randrw1 0 0 0 0 0 0 0.00 0.00 -5.14


The headings are:
BW = bandwidth in KB/s
IOPS = I/Os per second
msec = number of miliseconds the run took (smaller is better)
usr = %user time
sys = %system time
csw = context switches

The first two tables show the results of each run. In this case, the
first is the unpatched kernel, and the second is the one with the
block_device structure change. The third table is the % difference
between the two. A positive number indicates the second run had a
larger average than the first. I found that the context switch rate was
rather unpredictable, so I really should have just left that out of the
reporting.

As you can see, adding a member to struct block_device did not really
change the results.


Next up is the patch that actually uses the rw semaphore to protect
access to the block size. Here are the results:

3.6.0-rc5.mikulas.1+-job.fio-run2/output-avg
READ WRITE CPU
Job Name BW IOPS msec BW IOPS msec usr sys csw
write1 0 0 0 764037 191009 43925 17.14 60.15 3536950.00
read1 696880 174220 48152 0 0 0 13.90 61.74 3710168.00
randwrite1 0 0 0 737331 184332 45511 29.82 52.71 2869440.00
randread1 689319 172329 48684 0 0 0 26.38 54.58 2927411.00
readwrite1 387651 96912 43294 387799 96949 43294 16.06 64.92 2814340.00
randrw1 360298 90074 46591 360304 90075 46591 28.53 54.10 2793120.00
3.6.0-rc5.mikulas.2+-job.fio-run2/output-avg
READ WRITE CPU
Job Name BW IOPS msec BW IOPS msec usr sys csw
write1 0 0 0 816713 204178 41108 16.60 62.06 3159574.00
read1 749437 187359 44800 0 0 0 13.91 63.69 3190050.00
randwrite1 0 0 0 747534 186883 44941 29.96 53.23 2617590.00
randread1 734627 183656 45699 0 0 0 27.02 56.27 2403191.00
readwrite1 396113 99027 42397 396120 99029 42397 14.50 63.21 3460140.00
randrw1 374408 93601 44806 374556 93638 44806 28.46 54.33 2688985.00
%diff
READ WRITE CPU
Job Name BW IOPS msec BW IOPS msec usr sys csw
write1 0 0 0 6 6 -6 0.00 0.00 -10.67
read1 7 7 -6 0 0 0 0.00 0.00 -14.02
randwrite1 0 0 0 0 0 0 0.00 0.00 -8.78
randread1 6 6 -6 0 0 0 0.00 0.00 -17.91
readwrite1 0 0 0 0 0 0 -9.71 0.00 22.95
randrw1 0 0 0 0 0 0 0.00 0.00 0.00

As you can see, there were modest gains in write, read, and randread.
This is somewhat unexpected, as you would think that introducing locking
would not *help* performance! Investigating the standard deviations for
each set of 5 runs shows that the performance difference is significant
(the standard deviation is reported as a percentage of the average):

This is a table of standard deviations for the 5 runs comprising the
above average with this kernel: 3.6.0-rc5.mikulas.1+

READ WRITE CPU
Job Name BW IOPS msec BW IOPS msec usr sys csw
write1 0 0 0 1 1 1 2.99 1.27 9.10
read1 0 0 0 0 0 0 2.12 0.53 5.03
randwrite1 0 0 0 0 0 0 1.25 0.49 5.52
randread1 1 1 1 0 0 0 1.81 1.18 10.04
readwrite1 2 2 2 2 2 2 11.35 1.86 26.83
randrw1 2 2 2 2 2 2 4.01 2.71 22.72

And here are the standard deviations for the .2+ kernel:

READ WRITE CPU
Job Name BW IOPS msec BW IOPS msec usr sys csw
write1 0 0 0 2 2 2 3.33 2.95 7.88
read1 2 2 2 0 0 0 6.44 2.30 19.27
randwrite1 0 0 0 3 3 3 0.18 0.52 1.71
randread1 2 2 2 0 0 0 3.72 2.34 23.70
readwrite1 3 3 3 3 3 3 3.35 2.61 7.38
randrw1 1 1 1 1 1 1 1.80 1.00 9.73


Next, we'll move on to the third patch in the series, which converts the
rw semaphore to a per-cpu semaphore.

3.6.0-rc5.mikulas.2+-job.fio-run2/output-avg
READ WRITE CPU
Job Name BW IOPS msec BW IOPS msec usr sys csw
write1 0 0 0 816713 204178 41108 16.60 62.06 3159574.00
read1 749437 187359 44800 0 0 0 13.91 63.69 3190050.00
randwrite1 0 0 0 747534 186883 44941 29.96 53.23 2617590.00
randread1 734627 183656 45699 0 0 0 27.02 56.27 2403191.00
readwrite1 396113 99027 42397 396120 99029 42397 14.50 63.21 3460140.00
randrw1 374408 93601 44806 374556 93638 44806 28.46 54.33 2688985.00
3.6.0-rc5.mikulas.3+-job.fio-run2/output-avg
READ WRITE CPU
Job Name BW IOPS msec BW IOPS msec usr sys csw
write1 0 0 0 870892 217723 38528 17.83 41.57 1697870.00
read1 1430164 357541 23462 0 0 0 14.41 56.00 241315.00
randwrite1 0 0 0 759789 189947 44163 31.48 36.36 1256040.00
randread1 1043830 260958 32146 0 0 0 31.89 44.39 185032.00
readwrite1 692567 173141 24226 692489 173122 24226 18.65 53.64 311255.00
randrw1 501208 125302 33469 501446 125361 33469 35.40 41.61 246391.00
%diff
READ WRITE CPU
Job Name BW IOPS msec BW IOPS msec usr sys csw
write1 0 0 0 6 6 -6 7.41 -33.02 -46.26
read1 90 90 -47 0 0 0 0.00 -12.07 -92.44
randwrite1 0 0 0 0 0 0 5.07 -31.69 -52.02
randread1 42 42 -29 0 0 0 18.02 -21.11 -92.30
readwrite1 74 74 -42 74 74 -42 28.62 -15.14 -91.00
randrw1 33 33 -25 33 33 -25 24.39 -23.41 -90.84

Wow! Switching to the per-cpu semaphore implementation just boosted the
performance of the I/O path big-time. Note that the system time also
goes down! So, we get better throughput and less system time. This
sounds too good to be true. ;-) Here are the standard deviations
(again, shown as percentages) for the .3+ kernel:

READ WRITE CPU
Job Name BW IOPS msec BW IOPS msec usr sys csw
write1 0 0 0 0 0 0 0.96 0.19 1.03
read1 0 0 0 0 0 0 1.82 0.24 2.46
randwrite1 0 0 0 0 0 0 0.40 0.39 0.68
randread1 0 0 0 0 0 0 0.53 0.31 2.02
readwrite1 0 0 0 0 0 0 2.73 4.07 33.27
randrw1 1 1 1 1 1 1 0.40 0.10 3.29

Again, there's no slop there, so the results are very reproducible.

Finally, the last patch changes to an rcu-based rw semaphore
implementation. Here are the results for that, as compared with the
previous kernel:

3.6.0-rc5.mikulas.3+-job.fio-run2/output-avg
READ WRITE CPU
Job Name BW IOPS msec BW IOPS msec usr sys csw
write1 0 0 0 870892 217723 38528 17.83 41.57 1697870.00
read1 1430164 357541 23462 0 0 0 14.41 56.00 241315.00
randwrite1 0 0 0 759789 189947 44163 31.48 36.36 1256040.00
randread1 1043830 260958 32146 0 0 0 31.89 44.39 185032.00
readwrite1 692567 173141 24226 692489 173122 24226 18.65 53.64 311255.00
randrw1 501208 125302 33469 501446 125361 33469 35.40 41.61 246391.00
3.6.0-rc5.mikulas.4+-job.fio-run2/output-avg
READ WRITE CPU
Job Name BW IOPS msec BW IOPS msec usr sys csw
write1 0 0 0 812659 203164 41309 16.80 61.71 3208620.00
read1 739061 184765 45442 0 0 0 14.32 62.85 3375484.00
randwrite1 0 0 0 726971 181742 46192 30.00 52.33 2736270.00
randread1 719040 179760 46683 0 0 0 26.47 54.78 2914080.00
readwrite1 396670 99167 42309 396619 99154 42309 14.91 63.12 3412220.00
randrw1 374790 93697 44766 374807 93701 44766 28.42 54.10 2774690.00
%diff
READ WRITE CPU
Job Name BW IOPS msec BW IOPS msec usr sys csw
write1 0 0 0 -6 -6 7 -5.78 48.45 88.98
read1 -48 -48 93 0 0 0 0.00 12.23 1298.79
randwrite1 0 0 0 0 0 0 0.00 43.92 117.85
randread1 -31 -31 45 0 0 0 -17.00 23.41 1474.91
readwrite1 -42 -42 74 -42 -42 74 -20.05 17.67 996.28
randrw1 -25 -25 33 -25 -25 33 -19.72 30.02
1026.13

And we've lost a good bit of performance! Talk about
counter-intuitive. Here are the standard deviation numbers:

READ WRITE CPU
Job Name BW IOPS msec BW IOPS msec usr sys csw
write1 0 0 0 2 2 2 2.96 3.00 6.79
read1 3 3 3 0 0 0 6.52 2.82 21.86
randwrite1 0 0 0 2 2 2 0.71 0.55 4.07
randread1 1 1 1 0 0 0 4.13 2.31 20.12
readwrite1 1 1 1 1 1 1 4.14 2.64 6.12
randrw1 0 0 0 0 0 0 0.59 0.25 2.99


Here is a comparison of the vanilla kernel versus the best performing
patch in this series (patch 3 of 4):

3.6.0-rc5+-job.fio-run2/output-avg
READ WRITE CPU
Job Name BW IOPS msec BW IOPS msec usr sys csw
write1 0 0 0 748522 187130 44864 16.34 60.65 3799440.00
read1 690615 172653 48602 0 0 0 13.45 61.42 4044720.00
randwrite1 0 0 0 716406 179101 46839 29.03 52.79 3151140.00
randread1 683466 170866 49108 0 0 0 25.92 54.67 3081610.00
readwrite1 377518 94379 44450 377645 94410 44450 15.49 64.32 3139240.00
randrw1 355815 88953 47178 355733 88933 47178 27.96 54.24 2944570.00
3.6.0-rc5.mikulas.3+-job.fio-run2/output-avg
READ WRITE CPU
Job Name BW IOPS msec BW IOPS msec usr sys csw
write1 0 0 0 870892 217723 38528 17.83 41.57 1697870.00
read1 1430164 357541 23462 0 0 0 14.41 56.00 241315.00
randwrite1 0 0 0 759789 189947 44163 31.48 36.36 1256040.00
randread1 1043830 260958 32146 0 0 0 31.89 44.39 185032.00
readwrite1 692567 173141 24226 692489 173122 24226 18.65 53.64 311255.00
randrw1 501208 125302 33469 501446 125361 33469 35.40 41.61 246391.00
%diff
READ WRITE CPU
Job Name BW IOPS msec BW IOPS msec usr sys csw
write1 0 0 0 16 16 -14 9.12 -31.46 -55.31
read1 107 107 -51 0 0 0 7.14 -8.82 -94.03
randwrite1 0 0 0 6 6 -5 8.44 -31.12 -60.14
randread1 52 52 -34 0 0 0 23.03 -18.80 -94.00
readwrite1 83 83 -45 83 83 -45 20.40 -16.60 -90.09
randrw1 40 40 -29 40 40 -29 26.61 -23.29 -91.63


Next up, I'm going to get some perf and blktrace data from these runs to
see if I can identify why there is such a drastic change in
performance. I will also attempt to run the tests against a different
vendor's adapter, and maybe against some FC storage if I can set that up.

Cheers,
Jeff

[global]
ioengine=libaio
direct=1
iodepth=1024
iodepth_batch=32
iodepth_batch_complete=1
blocksize=4k
filename=/dev/XXX
size=8g
group_reporting=1
readwrite=write

[write1]
offset=0

[write2]
offset=8g

[write3]
offset=16g

[write4]
offset=24g

[global]
readwrite=read

[read1]
stonewall
offset=0

[read2]
offset=8g

[read3]
offset=16g

[read4]
offset=24g

[global]
readwrite=randwrite

[randwrite1]
stonewall
offset=0

[randwrite2]
offset=8g

[randwrite3]
offset=16g

[randwrite4]
offset=24g

[global]
readwrite=randread

[randread1]
stonewall
offset=0

[randread2]
offset=8g

[randread3]
offset=16g

[randread4]
offset=24g

[global]
readwrite=readwrite

[readwrite1]
stonewall
offset=0

[readwrite2]
offset=8g

[readwrite3]
offset=16g

[readwrite4]
offset=24g

[global]
readwrite=randrw

[randrw1]
stonewall
offset=0

[randrw2]
offset=8g

[randrw3]
offset=16g

[randrw4]
offset=24g
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 09-18-2012, 05:22 PM
Jeff Moyer
 
Default Fix a crash when block device is read and block size is changed at the same time

Mikulas Patocka <mpatocka@redhat.com> writes:

> Hi Jeff
>
> Thanks for testing.
>
> It would be interesting ... what happens if you take the patch 3, leave
> "struct percpu_rw_semaphore bd_block_size_semaphore" in "struct
> block_device", but remove any use of the semaphore from fs/block_dev.c? -
> will the performance be like unpatched kernel or like patch 3? It could be
> that the change in the alignment affects performance on your CPU too, just
> differently than on my CPU.

I'll give it a try and report back.

> What is the CPU model that you used for testing?

http://ark.intel.com/products/53570/Intel-Xeon-Processor-E7-2860-%2824M-Cache-2_26-GHz-6_40-GTs-Intel-QPI%29

Cheers,
Jeff

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 09-18-2012, 06:58 PM
Jeff Moyer
 
Default Fix a crash when block device is read and block size is changed at the same time

Mikulas Patocka <mpatocka@redhat.com> writes:

> On Tue, 18 Sep 2012, Jeff Moyer wrote:
>
>> Mikulas Patocka <mpatocka@redhat.com> writes:
>>
>> > Hi Jeff
>> >
>> > Thanks for testing.
>> >
>> > It would be interesting ... what happens if you take the patch 3, leave
>> > "struct percpu_rw_semaphore bd_block_size_semaphore" in "struct
>> > block_device", but remove any use of the semaphore from fs/block_dev.c? -
>> > will the performance be like unpatched kernel or like patch 3? It could be
>> > that the change in the alignment affects performance on your CPU too, just
>> > differently than on my CPU.
>>
>> I'll give it a try and report back.
>>
>> > What is the CPU model that you used for testing?
>>
>> http://ark.intel.com/products/53570/Intel-Xeon-Processor-E7-2860-%2824M-Cache-2_26-GHz-6_40-GTs-Intel-QPI%29
>>
> BTW. why did you use just 4 processes? - the processor has 10 cores and 20
> threads (so theoretically, you could run 20 processes bound on a single
> numa node). Were the results not stable with more than 4 processes?

There is no good reason for it. Since I was able to show some
differences in performance, I didn't see the need to scale beyond 4. I
can certainly bump the count up if/when that becomes interesting.

Cheers,
Jeff

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 09-18-2012, 08:11 PM
Jeff Moyer
 
Default Fix a crash when block device is read and block size is changed at the same time

Mikulas Patocka <mpatocka@redhat.com> writes:

> Hi Jeff
>
> Thanks for testing.
>
> It would be interesting ... what happens if you take the patch 3, leave
> "struct percpu_rw_semaphore bd_block_size_semaphore" in "struct
> block_device", but remove any use of the semaphore from fs/block_dev.c? -
> will the performance be like unpatched kernel or like patch 3? It could be
> that the change in the alignment affects performance on your CPU too, just
> differently than on my CPU.

It turns out to be exactly the same performance as with the 3rd patch
applied, so I guess it does have something to do with cache alignment.
Here is the patch (against vanilla) I ended up testing. Let me know if
I've botched it somehow.

So, I next up I'll play similar tricks to what you did (padding struct
block_device in all kernels) to eliminate the differences due to
structure alignment and provide a clear picture of what the locking
effects are.

Thanks!
Jeff


diff --git a/drivers/char/raw.c b/drivers/char/raw.c
index 54a3a6d..0bb207e 100644
--- a/drivers/char/raw.c
+++ b/drivers/char/raw.c
@@ -285,7 +285,7 @@ static long raw_ctl_compat_ioctl(struct file *file, unsigned int cmd,

static const struct file_operations raw_fops = {
.read = do_sync_read,
- .aio_read = generic_file_aio_read,
+ .aio_read = blkdev_aio_read,
.write = do_sync_write,
.aio_write = blkdev_aio_write,
.fsync = blkdev_fsync,
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 38e721b..c7514b5 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -116,6 +116,8 @@ EXPORT_SYMBOL(invalidate_bdev);

int set_blocksize(struct block_device *bdev, int size)
{
+ struct address_space *mapping;
+
/* Size must be a power of two, and between 512 and PAGE_SIZE */
if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
return -EINVAL;
@@ -124,6 +126,16 @@ int set_blocksize(struct block_device *bdev, int size)
if (size < bdev_logical_block_size(bdev))
return -EINVAL;

+ /* Check that the block device is not memory mapped */
+ mapping = bdev->bd_inode->i_mapping;
+ mutex_lock(&mapping->i_mmap_mutex);
+ if (!prio_tree_empty(&mapping->i_mmap) ||
+ !list_empty(&mapping->i_mmap_nonlinear)) {
+ mutex_unlock(&mapping->i_mmap_mutex);
+ return -EBUSY;
+ }
+ mutex_unlock(&mapping->i_mmap_mutex);
+
/* Don't change the size if it is same as current */
if (bdev->bd_block_size != size) {
sync_blockdev(bdev);
@@ -131,6 +143,7 @@ int set_blocksize(struct block_device *bdev, int size)
bdev->bd_inode->i_blkbits = blksize_bits(size);
kill_bdev(bdev);
}
+
return 0;
}

@@ -441,6 +454,12 @@ static struct inode *bdev_alloc_inode(struct super_block *sb)
struct bdev_inode *ei = kmem_cache_alloc(bdev_cachep, GFP_KERNEL);
if (!ei)
return NULL;
+
+ if (unlikely(percpu_init_rwsem(&ei->bdev.bd_block_size_semaphore))) {
+ kmem_cache_free(bdev_cachep, ei);
+ return NULL;
+ }
+
return &ei->vfs_inode;
}

@@ -449,6 +468,8 @@ static void bdev_i_callback(struct rcu_head *head)
struct inode *inode = container_of(head, struct inode, i_rcu);
struct bdev_inode *bdi = BDEV_I(inode);

+ percpu_free_rwsem(&bdi->bdev.bd_block_size_semaphore);
+
kmem_cache_free(bdev_cachep, bdi);
}

@@ -1567,6 +1588,19 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
return blkdev_ioctl(bdev, mode, cmd, arg);
}

+ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov,
+ unsigned long nr_segs, loff_t pos)
+{
+ ssize_t ret;
+ struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
+ percpu_rwsem_ptr p;
+
+ ret = generic_file_aio_read(iocb, iov, nr_segs, pos);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(blkdev_aio_read);
+
/*
* Write data to the block device. Only intended for the block device itself
* and the raw driver which basically is a fake block device.
@@ -1578,6 +1612,7 @@ ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t pos)
{
struct file *file = iocb->ki_filp;
+ struct block_device *bdev = I_BDEV(file->f_mapping->host);
struct blk_plug plug;
ssize_t ret;

@@ -1597,6 +1632,16 @@ ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
}
EXPORT_SYMBOL_GPL(blkdev_aio_write);

+int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ int ret;
+ struct block_device *bdev = I_BDEV(file->f_mapping->host);
+
+ ret = generic_file_mmap(file, vma);
+
+ return ret;
+}
+
/*
* Try to release a page associated with block device when the system
* is under memory pressure.
@@ -1627,9 +1672,9 @@ const struct file_operations def_blk_fops = {
.llseek = block_llseek,
.read = do_sync_read,
.write = do_sync_write,
- .aio_read = generic_file_aio_read,
+ .aio_read = blkdev_aio_read,
.aio_write = blkdev_aio_write,
- .mmap = generic_file_mmap,
+ .mmap = blkdev_mmap,
.fsync = blkdev_fsync,
.unlocked_ioctl = block_ioctl,
#ifdef CONFIG_COMPAT
diff --git a/include/linux/fs.h b/include/linux/fs.h
index aa11047..15c481d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -10,6 +10,7 @@
#include <linux/ioctl.h>
#include <linux/blk_types.h>
#include <linux/types.h>
+#include <linux/percpu-rwsem.h>

/*
* It's silly to have NR_OPEN bigger than NR_FILE, but you can change
@@ -724,6 +725,8 @@ struct block_device {
int bd_fsfreeze_count;
/* Mutex for freeze */
struct mutex bd_fsfreeze_mutex;
+ /* A semaphore that prevents I/O while block size is being changed */
+ struct percpu_rw_semaphore bd_block_size_semaphore;
};

/*
@@ -2564,6 +2567,8 @@ extern int generic_segment_checks(const struct iovec *iov,
unsigned long *nr_segs, size_t *count, int access_flags);

/* fs/block_dev.c */
+extern ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov,
+ unsigned long nr_segs, loff_t pos);
extern ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t pos);
extern int blkdev_fsync(struct file *filp, loff_t start, loff_t end,

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 

Thread Tools




All times are GMT. The time now is 08:41 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org