FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Device-mapper Development

 
 
LinkBack Thread Tools
 
Old 06-19-2012, 08:39 PM
Dave Chinner
 
Default Ext4 and xfs problems in dm-thin on allocation and discard

On Tue, Jun 19, 2012 at 04:21:30PM -0400, Ted Ts'o wrote:
> On Wed, Jun 20, 2012 at 06:06:31AM +1000, Dave Chinner wrote:
> > > But in general xfs is issuing discards with much smaller extents than
> > > ext4 does, e.g.:
> >
> > THat's normal when you use -o discard - XFS sends extremely
> > fine-grained discards as the have to be issued during the checkpoint
> > commit that frees the extent. Hence they can't be aggregated like is
> > done in ext4.
>
> Actually, ext4 is also sending the discards during (well, actually,
> after) the commit which frees the extent/inode. We do aggregate them
> while the commit is open, but once the transaction is committed, we
> send out the discards. I suspect the difference is in the granularity
> of the transactions between ext4 and xfs.

Exactly - XFS transactions are fine grained, checkpoints are coarse.
We don't merge extents freed in fine grained transactions inside
checkpoints. We probably could, but, well, it's complex to do in XFS
and merging adjacent requests is something the block layer is
supposed to do....

> > As it is, no-one really should be using -o discard - it is extremely
> > inefficient compared to a background fstrim run given that discards
> > are unqueued, blocking IOs. It's just a bad idea until the lower
> > layers get fixed to allow asynchronous, vectored discards and SATA
> > supports queued discards...
>
> What Dave said. :-) This is true for both ext4 and xfs.
>
> As a result, I can very easily see there being a distinction made
> between when we *do* want to pass the discards all the way down to the
> device, and when we only want the thinp layer to process them ---
> because for current devices, sending discards down to the physical
> device is very heavyweight.
>
> I'm not sure how we could do this without a nasty layering violation,
> but some way in which we could label fstrim discards versus "we've
> committed the unlink/truncate and so thinp can feel free to reuse
> these blocks" discards would be interesting to consider.

I think if we had better discard support from the block layer, it
wouldn't matter from a filesystem POV what discard support is
present in the block layer below it. I think it's better to get the
block layer interface fixed than to add new request types/labels to
filesystems to work around the current deficiencies.

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 06-19-2012, 08:44 PM
Mike Snitzer
 
Default Ext4 and xfs problems in dm-thin on allocation and discard

On Tue, Jun 19 2012 at 3:58pm -0400,
Ted Ts'o <tytso@mit.edu> wrote:

> On Tue, Jun 19, 2012 at 11:28:56AM -0400, Mike Snitzer wrote:
> >
> > That is an lvm2 BZ but there is further kernel work needed.
> >
> > It should be noted that the "external origin" feature was added to the
> > thinp target with this commit:
> > http://git.kernel.org/linus/2dd9c257fbc243aa76ee6d
> >
> > It is start, but external origin is kept read-only and any writes
> > trigger allocation of new blocks within the thin-pool.
>
> Hmm... maybe this is what I had been told. I thought there was some
> feature where you could take a read-only thinp snapshot of an external
> volume (i.e., a pre-existing LVM2 volume, or a block device), and then
> after that, make read-write snapshots using the read-only snapshot as
> a base? Is that something that works today, or is planned? Or am I
> totally confused?

The commit I referenced basically provides that capability.

> And if it is something that works today, is there a web site or
> documentation file that gives a recipe for how to use it if we want to
> do some performance experiments (i.e., it doesn't have to be a user
> friendly interface if that's not ready yet).

Documentation/device-mapper/thin-provisioning.txt has details on how to
use dmsetup to create a thin device that uses a read-only external
origin volume (so all reads to unprovisioned areas of the thin device
will be remapped to the external origin -- "external" meaning the volume
outside of the thin-pool).

The creation of a thin device w/ a read-only external origin gets you
started with a thin device that is effectively a snapshot of the origin
volume. That thin device is read-write -- all writes are provisioned
from the thin-pool that is backing the thin device. And you can take
snapshots (or recursive snapshots) of that thin device.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 06-19-2012, 09:37 PM
Spelic
 
Default Ext4 and xfs problems in dm-thin on allocation and discard

On 06/19/12 22:06, Dave Chinner wrote:

On Tue, Jun 19, 2012 at 02:48:59PM -0400, Mike Snitzer wrote:

On Tue, Jun 19 2012 at 10:44am -0400,
Mike Snitzer<snitzer@redhat.com> wrote:


On Tue, Jun 19 2012 at 9:52am -0400,
Spelic<spelic@shiftmail.org> wrote:


I do not know what is the mechanism for which xfs cannot unmap
blocks from dm-thin, but it really can't.
If anyone has dm-thin installed he can try. This is 100%
reproducible for me.

I was initially surprised by this considering the thinp-test-suite does
test a compilebench workload against xfs and ext4 using online discard
(-o discard).

But I just modified that test to use a thin-pool with 'ignore_discard'
and the test still passed on both ext4 and xfs.

So there is more work needed in the thinp-test-suite to use blktrace
hooks to verify that discards are occuring when the compilebench
generated files are removed.

I'll work through that and report back.

blktrace shows discards for both xfs and ext4.

But in general xfs is issuing discards with much smaller extents than
ext4 does, e.g.:

THat's normal when you use -o discard - XFS sends extremely
fine-grained discards as the have to be issued during the checkpoint
commit that frees the extent. Hence they can't be aggregated like is
done in ext4.

As it is, no-one really should be using -o discard - it is extremely
inefficient compared to a background fstrim run given that discards
are unqueued, blocking IOs. It's just a bad idea until the lower
layers get fixed to allow asynchronous, vectored discards and SATA
supports queued discards...



Could it be that the thin blocksize is larger than the discard
granularity by xfs so nothing ever gets unmapped?
I have tried thin pools with the default blocksize (64k afair with lvm2)
and 1MB.
HOWEVER I also have tried fstrim on xfs, and that is also not capable to
unmap things from the dm-thin.

What is the granularity with fstrim in xfs?
Sorry I can't access the machine right now; maybe tomorrow, or in the
weekend.


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 06-19-2012, 11:12 PM
Dave Chinner
 
Default Ext4 and xfs problems in dm-thin on allocation and discard

On Tue, Jun 19, 2012 at 11:37:54PM +0200, Spelic wrote:
> On 06/19/12 22:06, Dave Chinner wrote:
> >On Tue, Jun 19, 2012 at 02:48:59PM -0400, Mike Snitzer wrote:
> >>On Tue, Jun 19 2012 at 10:44am -0400,
> >>Mike Snitzer<snitzer@redhat.com> wrote:
> >>
> >>>On Tue, Jun 19 2012 at 9:52am -0400,
> >>>Spelic<spelic@shiftmail.org> wrote:
> >>>
> >>>>I do not know what is the mechanism for which xfs cannot unmap
> >>>>blocks from dm-thin, but it really can't.
> >>>>If anyone has dm-thin installed he can try. This is 100%
> >>>>reproducible for me.
> >>>I was initially surprised by this considering the thinp-test-suite does
> >>>test a compilebench workload against xfs and ext4 using online discard
> >>>(-o discard).
> >>>
> >>>But I just modified that test to use a thin-pool with 'ignore_discard'
> >>>and the test still passed on both ext4 and xfs.
> >>>
> >>>So there is more work needed in the thinp-test-suite to use blktrace
> >>>hooks to verify that discards are occuring when the compilebench
> >>>generated files are removed.
> >>>
> >>>I'll work through that and report back.
> >>blktrace shows discards for both xfs and ext4.
> >>
> >>But in general xfs is issuing discards with much smaller extents than
> >>ext4 does, e.g.:
> >THat's normal when you use -o discard - XFS sends extremely
> >fine-grained discards as the have to be issued during the checkpoint
> >commit that frees the extent. Hence they can't be aggregated like is
> >done in ext4.
> >
> >As it is, no-one really should be using -o discard - it is extremely
> >inefficient compared to a background fstrim run given that discards
> >are unqueued, blocking IOs. It's just a bad idea until the lower
> >layers get fixed to allow asynchronous, vectored discards and SATA
> >supports queued discards...
> >
>
> Could it be that the thin blocksize is larger than the discard
> granularity by xfs so nothing ever gets unmapped?

for -o discard, possibly. for fstrim, unlikely.

> I have tried thin pools with the default blocksize (64k afair with
> lvm2) and 1MB.
> HOWEVER I also have tried fstrim on xfs, and that is also not
> capable to unmap things from the dm-thin.
> What is the granularity with fstrim in xfs?

Whatever granularity you passed fstrim. You need to run an event
trace on XFS to find out if it is issuing discards before going
any further..

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 06-20-2012, 09:01 AM
Christoph Hellwig
 
Default Ext4 and xfs problems in dm-thin on allocation and discard

On Wed, Jun 20, 2012 at 06:39:38AM +1000, Dave Chinner wrote:
> Exactly - XFS transactions are fine grained, checkpoints are coarse.
> We don't merge extents freed in fine grained transactions inside
> checkpoints. We probably could, but, well, it's complex to do in XFS
> and merging adjacent requests is something the block layer is
> supposed to do....

Last time I checked it actually tries to do that for discard requests,
but then badly falls flat (=oopses). That's the reason why the XFS
transaction commit code still uses the highly suboptimal synchronous
blkdev_issue_discard instead of the async variant I wrote when designing
the code.

Another "issue" with the XFS discard pattern and the current block
layer implementation is that XFS frees a lot of small metadata like
inode clusters and btree blocks and discards them as well. If those
simply fill one of the vectors in a range ATA TRIM command and/or a
queueable command that's not much of an issue, but with the current
combination of non-queueable, non-vetored TRIM that's a fairly nasty
pattern.

So until the block layer is sorted out I can not recommend actually
using -o dicard. I planned to sort out the block layer issues ASAP
when writing that code, but other things have kept me busy every since.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 06-20-2012, 12:11 PM
Spelic
 
Default Ext4 and xfs problems in dm-thin on allocation and discard

Ok guys, I think I found the bug. One or more bugs.


Pool has chunksize 1MB.
In sysfs the thin volume has: queue/discard_max_bytes and
queue/discard_granularity are 1048576 .
And it has discard_alignment = 0, which based on sysfs-block
documentation is correct (a less misleading name would have been
discard_offset imho).

Here is the blktrace from ext4 fstrim:
...
252,9 17 498 0.030466556 841 Q D 19898368 + 2048 [fstrim]
252,9 17 499 0.030467501 841 Q D 19900416 + 2048 [fstrim]
252,9 17 500 0.030468359 841 Q D 19902464 + 2048 [fstrim]
252,9 17 501 0.030469313 841 Q D 19904512 + 2048 [fstrim]
252,9 17 502 0.030470144 841 Q D 19906560 + 2048 [fstrim]
252,9 17 503 0.030471381 841 Q D 19908608 + 2048 [fstrim]
252,9 17 504 0.030472473 841 Q D 19910656 + 2048 [fstrim]
252,9 17 505 0.030473504 841 Q D 19912704 + 2048 [fstrim]
252,9 17 506 0.030474561 841 Q D 19914752 + 2048 [fstrim]
252,9 17 507 0.030475571 841 Q D 19916800 + 2048 [fstrim]
252,9 17 508 0.030476423 841 Q D 19918848 + 2048 [fstrim]
252,9 17 509 0.030477341 841 Q D 19920896 + 2048 [fstrim]
252,9 17 510 0.034299630 841 Q D 19922944 + 2048 [fstrim]
252,9 17 511 0.034306880 841 Q D 19924992 + 2048 [fstrim]
252,9 17 512 0.034307955 841 Q D 19927040 + 2048 [fstrim]
252,9 17 513 0.034308928 841 Q D 19929088 + 2048 [fstrim]
252,9 17 514 0.034309945 841 Q D 19931136 + 2048 [fstrim]
252,9 17 515 0.034311007 841 Q D 19933184 + 2048 [fstrim]
252,9 17 516 0.034312008 841 Q D 19935232 + 2048 [fstrim]
252,9 17 517 0.034313122 841 Q D 19937280 + 2048 [fstrim]
252,9 17 518 0.034314013 841 Q D 19939328 + 2048 [fstrim]
252,9 17 519 0.034314940 841 Q D 19941376 + 2048 [fstrim]
252,9 17 520 0.034315835 841 Q D 19943424 + 2048 [fstrim]
252,9 17 521 0.034316662 841 Q D 19945472 + 2048 [fstrim]
252,9 17 522 0.034317547 841 Q D 19947520 + 2048 [fstrim]
...

Here is the blktrace from xfs fstrim:
252,12 16 1 0.000000000 554 Q D 96 + 2048 [fstrim]
252,12 16 2 0.000010149 554 Q D 2144 + 2048 [fstrim]
252,12 16 3 0.000011349 554 Q D 4192 + 2048 [fstrim]
252,12 16 4 0.000012584 554 Q D 6240 + 2048 [fstrim]
252,12 16 5 0.000013685 554 Q D 8288 + 2048 [fstrim]
252,12 16 6 0.000014660 554 Q D 10336 + 2048 [fstrim]
252,12 16 7 0.000015707 554 Q D 12384 + 2048 [fstrim]
252,12 16 8 0.000016692 554 Q D 14432 + 2048 [fstrim]
252,12 16 9 0.000017594 554 Q D 16480 + 2048 [fstrim]
252,12 16 10 0.000018539 554 Q D 18528 + 2048 [fstrim]
252,12 16 11 0.000019434 554 Q D 20576 + 2048 [fstrim]
252,12 16 12 0.000020879 554 Q D 22624 + 2048 [fstrim]
252,12 16 13 0.000021856 554 Q D 24672 + 2048 [fstrim]
252,12 16 14 0.000022786 554 Q D 26720 + 2048 [fstrim]
252,12 16 15 0.000023699 554 Q D 28768 + 2048 [fstrim]
252,12 16 16 0.000024672 554 Q D 30816 + 2048 [fstrim]
252,12 16 17 0.000025467 554 Q D 32864 + 2048 [fstrim]
252,12 16 18 0.000026374 554 Q D 34912 + 2048 [fstrim]
252,12 16 19 0.000027194 554 Q D 36960 + 2048 [fstrim]
252,12 16 20 0.000028137 554 Q D 39008 + 2048 [fstrim]
252,12 16 21 0.000029524 554 Q D 41056 + 2048 [fstrim]
252,12 16 22 0.000030479 554 Q D 43104 + 2048 [fstrim]
252,12 16 23 0.000031306 554 Q D 45152 + 2048 [fstrim]
252,12 16 24 0.000032134 554 Q D 47200 + 2048 [fstrim]
252,12 16 25 0.000032964 554 Q D 49248 + 2048 [fstrim]
252,12 16 26 0.000033794 554 Q D 51296 + 2048 [fstrim]


As you can see, while ext4 correctly aligns the discards to 1MB, xfs
does not.
It looks like an fstrim or xfs bug: they don't look at discard_alignment
(=0 ... a less misleading name would be discard_offset imho) +
discard_granularity (=1MB) and they don't base alignments on those.
Clearly the dm-thin cannot unmap anything if the 1MB regions are not
fully covered by a single discard. Note that specifying a large -m
option for fstrim does NOT widen the discard messages above 2048, and
this is correct because discard_max_bytes for that device is 1048576 .
If discard_max_bytes could be made much larger these kind of bugs could
be ameliorated, especially in complex situations like layers over
layers, virtualization etc.


Note that also in ext4 there are parts of the discard without the 1MB
alignment as seen with blktrace (out of my snippet), so this also might
need to be fixed, but most of it is aligned to 1MB. In xfs there are no
parts aligned to 1MB.



Now, another problem:
Firstly I wanted to say that in my original post I missed the
conv=notrunc for dd: I complained about the performances because I
expected the zerofiles would have been rewritten in-place without block
re-provisioning by dm-thin, but clearly without conv=notrunc this was
not happening. I confirm that with conv=notrunc performances are high at
the first rewrite, also in ext4, and occupied space in the thin volume
does not increase at every rewrite by dd.

HOWEVER
by NOT specifying conv=notrunc, the behaviour of dd / ext4 / dm-thin is
different if skip_block_zeroing is specified or not. If
skip_block_zeroing is not specified (provisioned blocks are pre-zeroed)
the space occupied by dd truncate + rewrite INCREASES at every rewrite,
while if skip_block_zeroing is NOT specified, dd truncate + rewrite DOES
NOT increase space occupied on the thin volume. Note: try this on ext4,
not xfs.
This looks very strange to me. The only reason I can think of is some
kind of cooperative behaviour of ext4 with the variable

dm-X/queue/discard_zeroes_data
which is different in the two cases. Can anyone give an explanation or
check if this is the intended behaviour?



And still an open question is: why the speed of provisioning new blocks
does not increase with increasing chunk size (64K --> 1MB --> 16MB...),
not even when skip_block_zeroing has been set and there is no CoW?


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 06-20-2012, 10:53 PM
Dave Chinner
 
Default Ext4 and xfs problems in dm-thin on allocation and discard

On Wed, Jun 20, 2012 at 02:11:31PM +0200, Spelic wrote:
> Ok guys, I think I found the bug. One or more bugs.
>
>
> Pool has chunksize 1MB.
> In sysfs the thin volume has: queue/discard_max_bytes and
> queue/discard_granularity are 1048576 .
> And it has discard_alignment = 0, which based on sysfs-block
> documentation is correct (a less misleading name would have been
> discard_offset imho).
> Here is the blktrace from ext4 fstrim:
> ...
> 252,9 17 498 0.030466556 841 Q D 19898368 + 2048 [fstrim]
> 252,9 17 499 0.030467501 841 Q D 19900416 + 2048 [fstrim]
> 252,9 17 500 0.030468359 841 Q D 19902464 + 2048 [fstrim]
> 252,9 17 501 0.030469313 841 Q D 19904512 + 2048 [fstrim]
> 252,9 17 502 0.030470144 841 Q D 19906560 + 2048 [fstrim]
> 252,9 17 503 0.030471381 841 Q D 19908608 + 2048 [fstrim]
> 252,9 17 504 0.030472473 841 Q D 19910656 + 2048 [fstrim]
> 252,9 17 505 0.030473504 841 Q D 19912704 + 2048 [fstrim]
> 252,9 17 506 0.030474561 841 Q D 19914752 + 2048 [fstrim]
> 252,9 17 507 0.030475571 841 Q D 19916800 + 2048 [fstrim]
> 252,9 17 508 0.030476423 841 Q D 19918848 + 2048 [fstrim]
> 252,9 17 509 0.030477341 841 Q D 19920896 + 2048 [fstrim]
> 252,9 17 510 0.034299630 841 Q D 19922944 + 2048 [fstrim]
> 252,9 17 511 0.034306880 841 Q D 19924992 + 2048 [fstrim]
> 252,9 17 512 0.034307955 841 Q D 19927040 + 2048 [fstrim]
> 252,9 17 513 0.034308928 841 Q D 19929088 + 2048 [fstrim]
> 252,9 17 514 0.034309945 841 Q D 19931136 + 2048 [fstrim]
> 252,9 17 515 0.034311007 841 Q D 19933184 + 2048 [fstrim]
> 252,9 17 516 0.034312008 841 Q D 19935232 + 2048 [fstrim]
> 252,9 17 517 0.034313122 841 Q D 19937280 + 2048 [fstrim]
> 252,9 17 518 0.034314013 841 Q D 19939328 + 2048 [fstrim]
> 252,9 17 519 0.034314940 841 Q D 19941376 + 2048 [fstrim]
> 252,9 17 520 0.034315835 841 Q D 19943424 + 2048 [fstrim]
> 252,9 17 521 0.034316662 841 Q D 19945472 + 2048 [fstrim]
> 252,9 17 522 0.034317547 841 Q D 19947520 + 2048 [fstrim]
> ...
>
> Here is the blktrace from xfs fstrim:
> 252,12 16 1 0.000000000 554 Q D 96 + 2048 [fstrim]
> 252,12 16 2 0.000010149 554 Q D 2144 + 2048 [fstrim]
> 252,12 16 3 0.000011349 554 Q D 4192 + 2048 [fstrim]
> 252,12 16 4 0.000012584 554 Q D 6240 + 2048 [fstrim]
> 252,12 16 5 0.000013685 554 Q D 8288 + 2048 [fstrim]
> 252,12 16 6 0.000014660 554 Q D 10336 + 2048 [fstrim]
> 252,12 16 7 0.000015707 554 Q D 12384 + 2048 [fstrim]
> 252,12 16 8 0.000016692 554 Q D 14432 + 2048 [fstrim]
> 252,12 16 9 0.000017594 554 Q D 16480 + 2048 [fstrim]
> 252,12 16 10 0.000018539 554 Q D 18528 + 2048 [fstrim]
> 252,12 16 11 0.000019434 554 Q D 20576 + 2048 [fstrim]
> 252,12 16 12 0.000020879 554 Q D 22624 + 2048 [fstrim]
> 252,12 16 13 0.000021856 554 Q D 24672 + 2048 [fstrim]
> 252,12 16 14 0.000022786 554 Q D 26720 + 2048 [fstrim]
> 252,12 16 15 0.000023699 554 Q D 28768 + 2048 [fstrim]
> 252,12 16 16 0.000024672 554 Q D 30816 + 2048 [fstrim]
> 252,12 16 17 0.000025467 554 Q D 32864 + 2048 [fstrim]
> 252,12 16 18 0.000026374 554 Q D 34912 + 2048 [fstrim]
> 252,12 16 19 0.000027194 554 Q D 36960 + 2048 [fstrim]
> 252,12 16 20 0.000028137 554 Q D 39008 + 2048 [fstrim]
> 252,12 16 21 0.000029524 554 Q D 41056 + 2048 [fstrim]
> 252,12 16 22 0.000030479 554 Q D 43104 + 2048 [fstrim]
> 252,12 16 23 0.000031306 554 Q D 45152 + 2048 [fstrim]
> 252,12 16 24 0.000032134 554 Q D 47200 + 2048 [fstrim]
> 252,12 16 25 0.000032964 554 Q D 49248 + 2048 [fstrim]
> 252,12 16 26 0.000033794 554 Q D 51296 + 2048 [fstrim]
>
>
> As you can see, while ext4 correctly aligns the discards to 1MB, xfs
> does not.

XFs just sends a large extent to blkdev_issue_discard(), and cares
nothing about discard alignment or granularity.

> It looks like an fstrim or xfs bug: they don't look at
> discard_alignment (=0 ... a less misleading name would be
> discard_offset imho) + discard_granularity (=1MB) and they don't
> base alignments on those.

It looks like blkdev_issue_discard() has reduced each discard to
bios of a single "granule" (1MB), and not aligned them, hence they
are ignore by dm-thinp.

what are the discard parameters exposed by dm-thinp in
/sys/block/<thinp-blkdev>/queue/discard*

It looks to me that dmthinp might be setting discard_max_bytes to
1MB rather than discard_granularity. Looking at dm-thin.c:

static void set_discard_limits(struct pool *pool, struct queue_limits *limits)
{
/*
* FIXME: these limits may be incompatible with the pool's data device
*/
limits->max_discard_sectors = pool->sectors_per_block;

/*
* This is just a hint, and not enforced. We have to cope with
* bios that overlap 2 blocks.
*/
limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
limits->discard_zeroes_data = pool->pf.zero_new_blocks;
}


Yes - discard_max_bytes == discard_granularity, and so
blkdev_issue_discard fails to align the request properly. As it is,
setting discard_max_bytes to the thinp block size is silly - it
means you'll never get range requests, and we sent a discard for
every single block in a range rather than having the thinp code
iterate over a range itself.

i.e. this is not a filesystem bug that is causing the problem....

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 06-21-2012, 05:47 PM
Mike Snitzer
 
Default Ext4 and xfs problems in dm-thin on allocation and discard

On Wed, Jun 20 2012 at 6:53pm -0400,
Dave Chinner <david@fromorbit.com> wrote:

> On Wed, Jun 20, 2012 at 02:11:31PM +0200, Spelic wrote:
> > Ok guys, I think I found the bug. One or more bugs.
> >
> >
> > Pool has chunksize 1MB.
> > In sysfs the thin volume has: queue/discard_max_bytes and
> > queue/discard_granularity are 1048576 .
> > And it has discard_alignment = 0, which based on sysfs-block
> > documentation is correct (a less misleading name would have been
> > discard_offset imho).
> > Here is the blktrace from ext4 fstrim:
> > ...
> > 252,9 17 498 0.030466556 841 Q D 19898368 + 2048 [fstrim]
> > 252,9 17 499 0.030467501 841 Q D 19900416 + 2048 [fstrim]
> > 252,9 17 500 0.030468359 841 Q D 19902464 + 2048 [fstrim]
> > 252,9 17 501 0.030469313 841 Q D 19904512 + 2048 [fstrim]
> > 252,9 17 502 0.030470144 841 Q D 19906560 + 2048 [fstrim]
> > 252,9 17 503 0.030471381 841 Q D 19908608 + 2048 [fstrim]
> > 252,9 17 504 0.030472473 841 Q D 19910656 + 2048 [fstrim]
> > 252,9 17 505 0.030473504 841 Q D 19912704 + 2048 [fstrim]
> > 252,9 17 506 0.030474561 841 Q D 19914752 + 2048 [fstrim]
> > 252,9 17 507 0.030475571 841 Q D 19916800 + 2048 [fstrim]
> > 252,9 17 508 0.030476423 841 Q D 19918848 + 2048 [fstrim]
> > 252,9 17 509 0.030477341 841 Q D 19920896 + 2048 [fstrim]
> > 252,9 17 510 0.034299630 841 Q D 19922944 + 2048 [fstrim]
> > 252,9 17 511 0.034306880 841 Q D 19924992 + 2048 [fstrim]
> > 252,9 17 512 0.034307955 841 Q D 19927040 + 2048 [fstrim]
> > 252,9 17 513 0.034308928 841 Q D 19929088 + 2048 [fstrim]
> > 252,9 17 514 0.034309945 841 Q D 19931136 + 2048 [fstrim]
> > 252,9 17 515 0.034311007 841 Q D 19933184 + 2048 [fstrim]
> > 252,9 17 516 0.034312008 841 Q D 19935232 + 2048 [fstrim]
> > 252,9 17 517 0.034313122 841 Q D 19937280 + 2048 [fstrim]
> > 252,9 17 518 0.034314013 841 Q D 19939328 + 2048 [fstrim]
> > 252,9 17 519 0.034314940 841 Q D 19941376 + 2048 [fstrim]
> > 252,9 17 520 0.034315835 841 Q D 19943424 + 2048 [fstrim]
> > 252,9 17 521 0.034316662 841 Q D 19945472 + 2048 [fstrim]
> > 252,9 17 522 0.034317547 841 Q D 19947520 + 2048 [fstrim]
> > ...
> >
> > Here is the blktrace from xfs fstrim:
> > 252,12 16 1 0.000000000 554 Q D 96 + 2048 [fstrim]
> > 252,12 16 2 0.000010149 554 Q D 2144 + 2048 [fstrim]
> > 252,12 16 3 0.000011349 554 Q D 4192 + 2048 [fstrim]
> > 252,12 16 4 0.000012584 554 Q D 6240 + 2048 [fstrim]
> > 252,12 16 5 0.000013685 554 Q D 8288 + 2048 [fstrim]
> > 252,12 16 6 0.000014660 554 Q D 10336 + 2048 [fstrim]
> > 252,12 16 7 0.000015707 554 Q D 12384 + 2048 [fstrim]
> > 252,12 16 8 0.000016692 554 Q D 14432 + 2048 [fstrim]
> > 252,12 16 9 0.000017594 554 Q D 16480 + 2048 [fstrim]
> > 252,12 16 10 0.000018539 554 Q D 18528 + 2048 [fstrim]
> > 252,12 16 11 0.000019434 554 Q D 20576 + 2048 [fstrim]
> > 252,12 16 12 0.000020879 554 Q D 22624 + 2048 [fstrim]
> > 252,12 16 13 0.000021856 554 Q D 24672 + 2048 [fstrim]
> > 252,12 16 14 0.000022786 554 Q D 26720 + 2048 [fstrim]
> > 252,12 16 15 0.000023699 554 Q D 28768 + 2048 [fstrim]
> > 252,12 16 16 0.000024672 554 Q D 30816 + 2048 [fstrim]
> > 252,12 16 17 0.000025467 554 Q D 32864 + 2048 [fstrim]
> > 252,12 16 18 0.000026374 554 Q D 34912 + 2048 [fstrim]
> > 252,12 16 19 0.000027194 554 Q D 36960 + 2048 [fstrim]
> > 252,12 16 20 0.000028137 554 Q D 39008 + 2048 [fstrim]
> > 252,12 16 21 0.000029524 554 Q D 41056 + 2048 [fstrim]
> > 252,12 16 22 0.000030479 554 Q D 43104 + 2048 [fstrim]
> > 252,12 16 23 0.000031306 554 Q D 45152 + 2048 [fstrim]
> > 252,12 16 24 0.000032134 554 Q D 47200 + 2048 [fstrim]
> > 252,12 16 25 0.000032964 554 Q D 49248 + 2048 [fstrim]
> > 252,12 16 26 0.000033794 554 Q D 51296 + 2048 [fstrim]
> >
> >
> > As you can see, while ext4 correctly aligns the discards to 1MB, xfs
> > does not.
>
> XFs just sends a large extent to blkdev_issue_discard(), and cares
> nothing about discard alignment or granularity.
>
> > It looks like an fstrim or xfs bug: they don't look at
> > discard_alignment (=0 ... a less misleading name would be
> > discard_offset imho) + discard_granularity (=1MB) and they don't
> > base alignments on those.
>
> It looks like blkdev_issue_discard() has reduced each discard to
> bios of a single "granule" (1MB), and not aligned them, hence they
> are ignore by dm-thinp.
>
> what are the discard parameters exposed by dm-thinp in
> /sys/block/<thinp-blkdev>/queue/discard*
>
> It looks to me that dmthinp might be setting discard_max_bytes to
> 1MB rather than discard_granularity. Looking at dm-thin.c:
>
> static void set_discard_limits(struct pool *pool, struct queue_limits *limits)
> {
> /*
> * FIXME: these limits may be incompatible with the pool's data device
> */
> limits->max_discard_sectors = pool->sectors_per_block;
>
> /*
> * This is just a hint, and not enforced. We have to cope with
> * bios that overlap 2 blocks.
> */
> limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
> limits->discard_zeroes_data = pool->pf.zero_new_blocks;
> }
>
>
> Yes - discard_max_bytes == discard_granularity, and so
> blkdev_issue_discard fails to align the request properly. As it is,
> setting discard_max_bytes to the thinp block size is silly - it
> means you'll never get range requests, and we sent a discard for
> every single block in a range rather than having the thinp code
> iterate over a range itself.

So 2 different issues:
1) blkdev_issue_discard isn't properly aligning
2) thinp should accept larger discards (up to the stacked
discard_max_bytes rather than setting an override)

> i.e. this is not a filesystem bug that is causing the problem....

Paolo Bonzini fixed blkdev_issue_discard to properly align some time
ago; unfortunately the patches slipped through the cracks (cc'ing Paolo,
Jens, and Christoph).

Here are references to Paolo's patches:
0/2 https://lkml.org/lkml/2012/3/14/323
1/2 https://lkml.org/lkml/2012/3/14/324
2/2 https://lkml.org/lkml/2012/3/14/325

Patch 2/2 specifically addresses the case where:
discard_max_bytes == discard_granularity

Paolo, any chance you could resend to Jens (maybe with hch's comments on
patch#2 accounted for)? Also, please add hch's Reviewed-by when
reposting.

(would love to see this fixed for 3.5-rcX but if not 3.6 it is?)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 06-21-2012, 11:29 PM
Dave Chinner
 
Default Ext4 and xfs problems in dm-thin on allocation and discard

On Thu, Jun 21, 2012 at 01:47:43PM -0400, Mike Snitzer wrote:
> On Wed, Jun 20 2012 at 6:53pm -0400,
> Dave Chinner <david@fromorbit.com> wrote:
>
> > On Wed, Jun 20, 2012 at 02:11:31PM +0200, Spelic wrote:
> > > Ok guys, I think I found the bug. One or more bugs.
> > >
> > >
> > > Pool has chunksize 1MB.
> > > In sysfs the thin volume has: queue/discard_max_bytes and
> > > queue/discard_granularity are 1048576 .
> > > And it has discard_alignment = 0, which based on sysfs-block
> > > documentation is correct (a less misleading name would have been
> > > discard_offset imho).
> > > Here is the blktrace from ext4 fstrim:
> > > ...
> > > 252,9 17 498 0.030466556 841 Q D 19898368 + 2048 [fstrim]
> > > 252,9 17 499 0.030467501 841 Q D 19900416 + 2048 [fstrim]
> > > 252,9 17 500 0.030468359 841 Q D 19902464 + 2048 [fstrim]
....
> > > Here is the blktrace from xfs fstrim:
> > > 252,12 16 1 0.000000000 554 Q D 96 + 2048 [fstrim]
> > > 252,12 16 2 0.000010149 554 Q D 2144 + 2048 [fstrim]
> > > 252,12 16 3 0.000011349 554 Q D 4192 + 2048 [fstrim]
.....
> > It looks like blkdev_issue_discard() has reduced each discard to
> > bios of a single "granule" (1MB), and not aligned them, hence they
> > are ignore by dm-thinp.
> >
> > what are the discard parameters exposed by dm-thinp in
> > /sys/block/<thinp-blkdev>/queue/discard*
> >
> > It looks to me that dmthinp might be setting discard_max_bytes to
> > 1MB rather than discard_granularity. Looking at dm-thin.c:
> >
> > static void set_discard_limits(struct pool *pool, struct queue_limits *limits)
> > {
> > /*
> > * FIXME: these limits may be incompatible with the pool's data device
> > */
> > limits->max_discard_sectors = pool->sectors_per_block;
> >
> > /*
> > * This is just a hint, and not enforced. We have to cope with
> > * bios that overlap 2 blocks.
> > */
> > limits->discard_granularity = pool->sectors_per_block << SECTOR_SHIFT;
> > limits->discard_zeroes_data = pool->pf.zero_new_blocks;
> > }
> >
> >
> > Yes - discard_max_bytes == discard_granularity, and so
> > blkdev_issue_discard fails to align the request properly. As it is,
> > setting discard_max_bytes to the thinp block size is silly - it
> > means you'll never get range requests, and we sent a discard for
> > every single block in a range rather than having the thinp code
> > iterate over a range itself.
>
> So 2 different issues:
> 1) blkdev_issue_discard isn't properly aligning
> 2) thinp should accept larger discards (up to the stacked
> discard_max_bytes rather than setting an override)

Yes, in effect, but there's no real reason I can see why thinp can't
accept large discard requests than the underlying stack and break
them up appropriately itself....

> > i.e. this is not a filesystem bug that is causing the problem....
>
> Paolo Bonzini fixed blkdev_issue_discard to properly align some time
> ago; unfortunately the patches slipped through the cracks (cc'ing Paolo,
> Jens, and Christoph).
>
> Here are references to Paolo's patches:
> 0/2 https://lkml.org/lkml/2012/3/14/323
> 1/2 https://lkml.org/lkml/2012/3/14/324
> 2/2 https://lkml.org/lkml/2012/3/14/325
>
> Patch 2/2 specifically addresses the case where:
> discard_max_bytes == discard_granularity
>
> Paolo, any chance you could resend to Jens (maybe with hch's comments on
> patch#2 accounted for)? Also, please add hch's Reviewed-by when
> reposting.
>
> (would love to see this fixed for 3.5-rcX but if not 3.6 it is?)

That would be good...

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 07-01-2012, 02:53 PM
Paolo Bonzini
 
Default Ext4 and xfs problems in dm-thin on allocation and discard

Il 21/06/2012 19:47, Mike Snitzer ha scritto:
> Paolo Bonzini fixed blkdev_issue_discard to properly align some time
> ago; unfortunately the patches slipped through the cracks (cc'ing Paolo,
> Jens, and Christoph).
>
> Here are references to Paolo's patches:
> 0/2 https://lkml.org/lkml/2012/3/14/323
> 1/2 https://lkml.org/lkml/2012/3/14/324
> 2/2 https://lkml.org/lkml/2012/3/14/325
>
> Patch 2/2 specifically addresses the case where:
> discard_max_bytes == discard_granularity
>
> Paolo, any chance you could resend to Jens (maybe with hch's comments on
> patch#2 accounted for)? Also, please add hch's Reviewed-by when
> reposting.

Sure, I'll do it this week. I just need to retest.

Paolo

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 

Thread Tools




All times are GMT. The time now is 04:48 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org