Linux Archive

Linux Archive (http://www.linux-archive.org/)
-   Device-mapper Development (http://www.linux-archive.org/device-mapper-development/)
-   -   dm-thin f.req. : SEEK_DATA / SEEK_HOLE / SEEK_DISCARD (http://www.linux-archive.org/device-mapper-development/661791-dm-thin-f-req-seek_data-seek_hole-seek_discard.html)

Spelic 05-01-2012 12:53 PM

dm-thin f.req. : SEEK_DATA / SEEK_HOLE / SEEK_DISCARD
 
Dear dm-thin developers,
I thought that it would be immensely useful to have a SEEK_DATA /
SEEK_HOLE implementation for dm-thin and/or even for the older non-thin
snapshotting mechanism.
This would allow to implement a mechanism like the acclaimed "zfs send"
with dm snapshots, i.e. cheaply replicate a thin snapshot remotely once
the parent snapshot has been replicated already. Extremely useful imho.

Is there any plan to do that?
The "HOLE" would mean "data comes from parent snapshot/device", while
DATA is "data that has changed since the parent snapshot". Discarded
regions that were not discarded in the parent snapshot should preferably
appear as zeroed DATA and not HOLE, or a new type SEEK_DISCARD because
if you make it HOLE, you lose information (you lose: "such data region
was meaningful in the parent snapshot but is not meaningful in the child
snapshot", and this kind of information cannot be recovered later in any
way) and you lose the property that reading those regions return zeroed
data, which is a major problem for backups, see next paragraph.
Instead, if a discarded region returns zeroed DATA, not much information
is lost because any long string of zeroes is interchangeable with a
discard, i.e. you can detect zeroes and perform the discard afterwards.
A new type SEEK_DISCARD could still be better.


Another question / feature request: I would like to know if reading an
area of a thin device after a discard is guaranteed to return zeroes
(and/or can be identified as empty from userspace via a seek_data /
seek_hole or equivalent mechanism). This would be very important for
backups, so to not get scarcely compressible garbage out of an old and
now unused region.
If yes: how big should such discarded area be for that area to be seen
from userspace as hole/zeroes: 512b, 4k, or 64m? E.g. a 512b discarded
area surrounded by nondiscarded data will return zeroes on read?


Thank you
S.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Christoph Hellwig 05-01-2012 01:08 PM

dm-thin f.req. : SEEK_DATA / SEEK_HOLE / SEEK_DISCARD
 
On Tue, May 01, 2012 at 02:53:22PM +0200, Spelic wrote:
> Dear dm-thin developers,
> I thought that it would be immensely useful to have a SEEK_DATA /
> SEEK_HOLE implementation for dm-thin and/or even for the older
> non-thin snapshotting mechanism.

You can't implement it direct as device mapper doesn't implement the
file operations. But I think adding block device operations that back
it would be a good idea, they could also implemented for arrays
implementing thin provisioning using the GET LBA STATUS scsi command.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Joe Thornber 05-01-2012 02:10 PM

dm-thin f.req. : SEEK_DATA / SEEK_HOLE / SEEK_DISCARD
 
On Tue, May 01, 2012 at 02:53:22PM +0200, Spelic wrote:
> Dear dm-thin developers,
> I thought that it would be immensely useful to have a SEEK_DATA /
> SEEK_HOLE implementation for dm-thin and/or even for the older
> non-thin snapshotting mechanism.
> This would allow to implement a mechanism like the acclaimed "zfs
> send" with dm snapshots, i.e. cheaply replicate a thin snapshot
> remotely once the parent snapshot has been replicated already.
> Extremely useful imho.
> Is there any plan to do that?

I'm planning to do replication via userland. There's a new message
that allows userland to access a read-only copy of the metadata. From
this, and using some intermediate snapshots we can work out what data
is changing and replicate it (asynchronously).

> The "HOLE" would mean "data comes from parent snapshot/device",
> while DATA is "data that has changed since the parent snapshot".

This sounds like the external snapshots feature that I just added.
See documentation in latest kernel.

> Another question / feature request: I would like to know if reading
> an area of a thin device after a discard is guaranteed to return
> zeroes (and/or can be identified as empty from userspace via a
> seek_data / seek_hole or equivalent mechanism).

A great question. If the discard exactly covers some dm-thin blocks,
then the mappings will be removed. Any future io to that block will
trigger the block to be reprovisioned. Depending whether you've set
the block zeroing flag in the pool, you are guaranteed to have zeroes
come out.

Any partial block discards will get passed down to the underlying data
device (assuming you've selected that option). Any zeroing side
effects depend on the underlying device.

As for identifying empty blocks from userland: there is an inherant
race here. What would you do with the info?

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Spelic 05-01-2012 03:52 PM

dm-thin f.req. : SEEK_DATA / SEEK_HOLE / SEEK_DISCARD
 
On 05/01/12 16:10, Joe Thornber wrote:

On Tue, May 01, 2012 at 02:53:22PM +0200, Spelic wrote:

Dear dm-thin developers,
I thought that it would be immensely useful to have a SEEK_DATA /
SEEK_HOLE implementation for dm-thin and/or even for the older
non-thin snapshotting mechanism.
This would allow to implement a mechanism like the acclaimed "zfs
send" with dm snapshots, i.e. cheaply replicate a thin snapshot
remotely once the parent snapshot has been replicated already.
Extremely useful imho.
Is there any plan to do that?

I'm planning to do replication via userland. There's a new message
that allows userland to access a read-only copy of the metadata. From
this, and using some intermediate snapshots we can work out what data
is changing and replicate it (asynchronously).


The "HOLE" would mean "data comes from parent snapshot/device",
while DATA is "data that has changed since the parent snapshot".

This sounds like the external snapshots feature that I just added.
See documentation in latest kernel.


I'm looking at it right now
Well, I was thinking at a parent snapshot and child snapshot (or anyway
an older and a more recent snapshot of the same device) so I'm not sure
that's the feature I needed... probably I'm missing something and need
to study more




Another question / feature request: I would like to know if reading
an area of a thin device after a discard is guaranteed to return
zeroes (and/or can be identified as empty from userspace via a
seek_data / seek_hole or equivalent mechanism).

A great question. If the discard exactly covers some dm-thin blocks,


I'm not sure I have understood the full nomenclature of dm-thin yet :-)
... "dm-thin blocks" would be the same thing as so called "pool
blocksize" as talked in the thread " Re: [PATCH 2/2] dm thin: support
for non power of 2 pool blocksize" right? so that's customizable now and
not necessarily in power of 2...


But those are anyway quite big, default is what, 64 megabytes? (which is
in fact a good thing for preventing excessive fragmentation...)


Now an obvious question:
If userspace sends multiple smaller discards eventually covering the
whole block, the block will still be unmapped correctly, right?
If yes: so you do preserve the information of what part of the block is
has already been discarded, and what part is not... so it would be
possible to return zeroes if the unmapped sub-part of the block is being
read... right?




then the mappings will be removed. Any future io to that block will
trigger the block to be reprovisioned.


(note: here we are talking of a full block now unmapped, different
situation from above)
Ok, supposing I do *not* write, so it does not get reprovisioned, what
does reading from there return; does it return zeroes, or it returns
nonzero data coming from the parent snapshot at the same offset?





...
As for identifying empty blocks from userland: there is an inherant
race here. What would you do with the info?


You are right , I would definitely need to take a snapshot prior to
reading that... so consider my question related to reading a snapshot of
a device which has been partially discarded...



Thank you
S.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Joe Thornber 05-03-2012 09:14 AM

dm-thin f.req. : SEEK_DATA / SEEK_HOLE / SEEK_DISCARD
 
On Tue, May 01, 2012 at 05:52:45PM +0200, Spelic wrote:
> I'm looking at it right now
> Well, I was thinking at a parent snapshot and child snapshot (or
> anyway an older and a more recent snapshot of the same device) so
> I'm not sure that's the feature I needed... probably I'm missing
> something and need to study more

I'm not really following you here. You can have arbitrary depth of
snapshots (snaps of snaps) if that helps.

> I'm not sure I have understood the full nomenclature of dm-thin yet
> :-) ... "dm-thin blocks" would be the same thing as so called "pool
> blocksize" as talked in the thread " Re: [PATCH 2/2] dm thin:
> support for non power of 2 pool blocksize" right? so that's
> customizable now and not necessarily in power of 2...
>
> But those are anyway quite big, default is what, 64 megabytes?
> (which is in fact a good thing for preventing excessive
> fragmentation...)

Yes, this is the pool block size, it's the atomic unit used for
provisioning and copy-on-write. I think the LVM2 tools default this
to be 512 _k_. You'd only set it to 64M if you had little interest in
snapshot performance.

> Now an obvious question:
> If userspace sends multiple smaller discards eventually covering the
> whole block, the block will still be unmapped correctly, right?

No, I don't track anything smaller than a block. (Note, blocks are
typically much smaller than you've been envisioning.)

> If yes: so you do preserve the information of what part of the block
> is has already been discarded, and what part is not... so it would
> be possible to return zeroes if the unmapped sub-part of the block
> is being read... right?

No, but the underlying device may do ...

> >then the mappings will be removed. Any future io to that block will
> >trigger the block to be reprovisioned.
>
> (note: here we are talking of a full block now unmapped, different
> situation from above)
> Ok, supposing I do *not* write, so it does not get reprovisioned,
> what does reading from there return; does it return zeroes, or it
> returns nonzero data coming from the parent snapshot at the same
> offset?

zeroes.

> >...
> >As for identifying empty blocks from userland: there is an inherant
> >race here. What would you do with the info?
>
> You are right , I would definitely need to take a snapshot prior to
> reading that... so consider my question related to reading a
> snapshot of a device which has been partially discarded...

Y, I'll provide tools to let you do this. If you wish to help with
writing a replicator please email me. It's a project I'm keen to get
going.

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Spelic 05-04-2012 05:16 PM

dm-thin f.req. : SEEK_DATA / SEEK_HOLE / SEEK_DISCARD
 
On 05/03/12 11:14, Joe Thornber wrote:

On Tue, May 01, 2012 at 05:52:45PM +0200, Spelic wrote:

I'm looking at it right now
Well, I was thinking at a parent snapshot and child snapshot (or
anyway an older and a more recent snapshot of the same device) so
I'm not sure that's the feature I needed... probably I'm missing
something and need to study more

I'm not really following you here. You can have arbitrary depth of
snapshots (snaps of snaps) if that helps.


I'm not following you either (you pointed me to the external snapshot
feature but this would not be an "external origin" methinks...?), but
this is probably irrelevant after having seen the rest of the replies
because I now finally understand what metadata is available inside
dm-thin. Thanks for such clear replies.


With your implementation there's the problem of fragmentation and RAID
alignment vs discards implementation. With concurrent access to many
thin provisioned devices, if blocksize is small, fragmentation is likely
to come out bad, HDDs streaming reads can suffer a lot on fragmented
areas (up to a factor 1000), and on parity raid, write performance would
also suffer; while if blocksize is set to be large (such as one RAID
stripe), block unmapping on discards is not likely to work because one
discard per file would be received but most files would be smaller than
a thinpool block (smaller than a RAID stripe: in fact it is recommended
that the raid chunk is made equal to the prospected average file size so
average file size and average discard size would be 1/N of the thinpool
block size) so nothing would be unprovisioned.


There would be another way to do it (pls excuse my obvious arrogance and
I know I should write code instead of write emails) two layers:
blocksize for provisioning is e.g. 64M (this one should be customizable
like you have now), while blocksize for tracking writes and discards is
e.g. 4K. You make the btree only for the 64M blocks, and inside that you
keep 2 bitmaps for tracking its 16384 4K-blocks. One bit is "4K block
has been written", and if this is zero, reads go against the parent
snapshot (this avoids CoW costs when provisioning a new 64M block). The
other bit is "4K block has been discarded" and if this is set, reads
return zero, and if all 16384 bits are set, the 64M block gets
un-provisioned. This would play well with RAID alignment, with HDD
fragmentation, with CoW (normally no cow performed if writes are 4K or
bigger... "read optimizations" could do that afterwards if needed), with
multiple small discards, with tracking differences between parent
snapshot and current snapshot for remote replication, and with
compressed backups which would see zeroes on all discarded areas.
It should be possible to add this into your implementation because added
metadata is just 2 bitmaps more for each block than what you have now.
I would really like to try to write code for this but unfortunately I
foresee I won't have time to write code for a good while.
With this I don't want that to appear like I don't appreciate your
current implementation which is great work, was very much needed, and in
fact I will definitely use it for our production systems after 3.4 is
stable (I was waiting for discards)




Y, I'll provide tools to let you do this. If you wish to help with
writing a replicator please email me. It's a project I'm keen to get
going.


Thanks for the opportunity but for now it seems I can only be a leech,
at most I have time for writing a few emails :-(


Thank you
S.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Joe Thornber 05-09-2012 07:55 AM

dm-thin f.req. : SEEK_DATA / SEEK_HOLE / SEEK_DISCARD
 
On Fri, May 04, 2012 at 07:16:52PM +0200, Spelic wrote:
> On 05/03/12 11:14, Joe Thornber wrote:
> >On Tue, May 01, 2012 at 05:52:45PM +0200, Spelic wrote:
> >>I'm looking at it right now
> >>Well, I was thinking at a parent snapshot and child snapshot (or
> >>anyway an older and a more recent snapshot of the same device) so
> >>I'm not sure that's the feature I needed... probably I'm missing
> >>something and need to study more
> >I'm not really following you here. You can have arbitrary depth of
> >snapshots (snaps of snaps) if that helps.
>
> I'm not following you either (you pointed me to the external
> snapshot feature but this would not be an "external origin"
> methinks...?),

Yes, it's a snapshot of an external origin.

> With your implementation there's the problem of fragmentation and
> RAID alignment vs discards implementation.

This is always going to be an issue with thin provisioning.

> (such as one RAID stripe), block unmapping on discards is not likely
> to work because one discard per file would be received but most
> files would be smaller than a thinpool block (smaller than a RAID
> stripe: in fact it is recommended that the raid chunk is made equal
> to the prospected average file size so average file size and average
> discard size would be 1/N of the thinpool block size) so nothing
> would be unprovisioned.

You're right. In general discard is an expensive operation (on all
devices, not just thin), so you want to use it infrequently and on
large chunks. I suspect that most people, rather than turning on
discard withing the file system, will just periodically run a cleanup
program that inspects the fs and discards unused blocks.

> There would be another way to do it (pls excuse my obvious arrogance
> and I know I should write code instead of write emails) two layers:
> blocksize for provisioning is e.g. 64M (this one should be
> customizable like you have now), while blocksize for tracking writes
> and discards is e.g. 4K. You make the btree only for the 64M blocks,
> and inside that you keep 2 bitmaps for tracking its 16384
> 4K-blocks.

Yes, we could track discards and aggregate them into bigger blocks.
Doing so would require more metadata, and more commits (which are
synchronous operations). The 2 blocks size approach has a lot going
for it, but it does add a lot of complexity - I deliberately kept thin
simple. One concern I have is that it demotes the snapshots to second
class citizens since they're composed of the smaller blocks and will
not have the adjacency properties of the thin that is provisioned
solely with big blocks. I'd rather just do the CoW on the whole
block, and boost performance by putting an SSD (via a caching target)
in front of the data device. That way the CoW would complete
v. quickly, and could be written back to the device slowly in the
background iff it's infrequently used.

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


All times are GMT. The time now is 10:15 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.