FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Device-mapper Development

 
 
LinkBack Thread Tools
 
Old 08-04-2008, 08:22 AM
FUJITA Tomonori
 
Default dm snapshot: shared exception store

This is a new implementation of dm-snapshot.

The important design differences from the current dm-snapshot are:

- It uses one exception store per origin device that is shared by all snapshots.
- It doesn't keep the complete exception tables in memory.

I took the exception store code of Zumastor (http://zumastor.org/).

Zumastor is remote replication software (a local server sends the
delta between two snapshots to a remote server, and then the remote
server applies the delta in an atomic manner. So the data on the
remote server is always consistent).

Zumastor snapshot fulfills the above two requirements, but it is
implemented in user space. The dm kernel module sends the information
of a request to user space and the user space daemon tells the kernel
what to do.

Zumastor user-space daemon needs to take care about replication so the
user-space approach makes sense but I think that the pure user-space
approach is an overkill just for snapshot. I prefer to implement
snapshot in kernel space (as the current dm-snapshot does). I think
that we can add features for remote replication software like Zumastor
to it, that is, features to provide user space a delta between two
snapshots and apply the delta in an atomic manner (via ioctl or
something else).

Note that the code is still in a very early stage. There are lots of
TODO items:

- snapshot deletion support
- writable snapshot support
- protection for unexpected events (probably journaling)
- performance improvement (handling exception cache and format, locking, etc)
- better integration with the current snapshot code
- improvement on error handling
- cleanups
- generating a delta between two snapshots
- applying a delta to in a atomic manner

The patch against 2.6.26 is available at:

http://www.kernel.org/pub/linux/kernel/people/tomo/dm-snap/0001-dm-snapshot-dm-snapshot-shared-exception-store.patch


Here's an example (/dev/sdb1 as an origin device and /dev/sdg1 as a cow device):

- creates the set of an origin and a cow:

flax:~# echo 0 `blockdev --getsize /dev/sdb1` snapshot-origin /dev/sdb1 /dev/sdg1 P2 16 |dmsetup create work

- no snapshot yet:

flax:~# dmsetup status
work: 0 125017767 snapshot-origin : no snapshot


- creates one snapshot (the id of the snapshot is 0):

flax:~# dmsetup message /dev/mapper/work 0 snapshot create 0


- creates one snapshot (the id of the snapshot is 1):

flax:~# dmsetup message /dev/mapper/work 0 snapshot create 1


- there are two snapshots (#0 and #1):

flax:~# dmsetup status
work: 0 125017767 snapshot-origin 0 1


- let's access to the snapshots:

flax:~# echo 0 `blockdev --getsize /dev/sdb1` snapshot /dev/sdb1 0|dmsetup create work-snap0
flax:~# echo 0 `blockdev --getsize /dev/sdb1` snapshot /dev/sdb1 1|dmsetup create work-snap1

flax:~# ls /dev/mapper/
control work work-snap0 work-snap1

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-06-2008, 07:14 PM
Mikulas Patocka
 
Default dm snapshot: shared exception store

Hi

I looked at it.

Alasdair had some concerns about the interface on the phone call. From my
point of view, the Fujita's interface is OK (using messages to manipulate
the snapshot storage and using targets to access the snapshots). Alasdair,
could you be pls. more specific about it?

What I would propose to change in the upcoming redesign:

- develop it as a separate target, not patch against dm-snapshot. The code
reuse from dm-snapshot is minimal, and keeping the old code around will
likely consume more coding time then the potential code reuse will save.

- drop that limitation on maximum 64 snapshots. If we are going to
redesign it, we should design it without such a limit, so that we wouldn't
have to redesign it again (why we need more than 64 --- for example to
take periodic snapshots every few minutes to record system activity). The
limit on number of snapshots can be dropped if we index b-tree nodes by a
key that contains chunk number and range of snapshot numbers where this
applies.

- do some cache for metadata, don't read the b-tree from the root node
from disk all the time. Ideally the cache should be integrated with page
cache so that it's size would tune automatically (I'm not sure if it's
possible to cleanly code it, though).

- the b-tree is good structure, I'd create log-structured filesystem to
hold the b-tree. The advantage is that it will require less
synchronization overhead in clustering. Also, log-structured filesystem
will bring you crash recovery (with minimum coding overhead) and it has
very good write performance.

- deleting the snapshot --- this needs to walk the whole b-tree --- it is
slow. Keeping another b-tree of chunks belonging to the given snapshot
would be overkill. I think the best solution would be to split the device
into large areas and use per-snapshot bitmap that says if the snapshot has
some exceptions allocated in the pertaining area (similar to the
dirty-bitmap of raid1). For short lived snapshots this will save walking
the b-tree. For long-lived snapshots there is no help to speed it up...
But delete performance is not that critical anyway because deleting can be
done asynchronously without user waiting for it.

Mikulas

> This is a new implementation of dm-snapshot.
>
> The important design differences from the current dm-snapshot are:
>
> - It uses one exception store per origin device that is shared by all snapshots.
> - It doesn't keep the complete exception tables in memory.
>
> I took the exception store code of Zumastor (http://zumastor.org/).
>
> Zumastor is remote replication software (a local server sends the
> delta between two snapshots to a remote server, and then the remote
> server applies the delta in an atomic manner. So the data on the
> remote server is always consistent).
>
> Zumastor snapshot fulfills the above two requirements, but it is
> implemented in user space. The dm kernel module sends the information
> of a request to user space and the user space daemon tells the kernel
> what to do.
>
> Zumastor user-space daemon needs to take care about replication so the
> user-space approach makes sense but I think that the pure user-space
> approach is an overkill just for snapshot. I prefer to implement
> snapshot in kernel space (as the current dm-snapshot does). I think
> that we can add features for remote replication software like Zumastor
> to it, that is, features to provide user space a delta between two
> snapshots and apply the delta in an atomic manner (via ioctl or
> something else).
>
> Note that the code is still in a very early stage. There are lots of
> TODO items:
>
> - snapshot deletion support
> - writable snapshot support
> - protection for unexpected events (probably journaling)
> - performance improvement (handling exception cache and format, locking, etc)
> - better integration with the current snapshot code
> - improvement on error handling
> - cleanups
> - generating a delta between two snapshots
> - applying a delta to in a atomic manner
>
> The patch against 2.6.26 is available at:
>
> http://www.kernel.org/pub/linux/kernel/people/tomo/dm-snap/0001-dm-snapshot-dm-snapshot-shared-exception-store.patch
>
>
> Here's an example (/dev/sdb1 as an origin device and /dev/sdg1 as a cow device):
>
> - creates the set of an origin and a cow:
>
> flax:~# echo 0 `blockdev --getsize /dev/sdb1` snapshot-origin /dev/sdb1 /dev/sdg1 P2 16 |dmsetup create work
>
> - no snapshot yet:
>
> flax:~# dmsetup status
> work: 0 125017767 snapshot-origin : no snapshot
>
>
> - creates one snapshot (the id of the snapshot is 0):
>
> flax:~# dmsetup message /dev/mapper/work 0 snapshot create 0
>
>
> - creates one snapshot (the id of the snapshot is 1):
>
> flax:~# dmsetup message /dev/mapper/work 0 snapshot create 1
>
>
> - there are two snapshots (#0 and #1):
>
> flax:~# dmsetup status
> work: 0 125017767 snapshot-origin 0 1
>
>
> - let's access to the snapshots:
>
> flax:~# echo 0 `blockdev --getsize /dev/sdb1` snapshot /dev/sdb1 0|dmsetup create work-snap0
> flax:~# echo 0 `blockdev --getsize /dev/sdb1` snapshot /dev/sdb1 1|dmsetup create work-snap1
>
> flax:~# ls /dev/mapper/
> control work work-snap0 work-snap1
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
>

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-09-2008, 05:01 AM
FUJITA Tomonori
 
Default dm snapshot: shared exception store

On Wed, 6 Aug 2008 15:14:50 -0400 (EDT)
Mikulas Patocka <mpatocka@redhat.com> wrote:

> Hi
>
> I looked at it.

Thanks! I didn't expect someone read the patch. I'll submit patches in
more proper manner next time.


> Alasdair had some concerns about the interface on the phone call. From my
> point of view, the Fujita's interface is OK (using messages to manipulate
> the snapshot storage and using targets to access the snapshots). Alasdair,
> could you be pls. more specific about it?

Yeah, we can't use dmsetup create/destroy to create/delete
snapshots. We need something different.

I have no strong opinion about it. Whatever interface is fine by me as
long as it works.


> What I would propose to change in the upcoming redesign:
>
> - develop it as a separate target, not patch against dm-snapshot. The code
> reuse from dm-snapshot is minimal, and keeping the old code around will
> likely consume more coding time then the potential code reuse will save.

It's fine by me if the maintainer prefers it. Alasdair?


> - drop that limitation on maximum 64 snapshots. If we are going to
> redesign it, we should design it without such a limit, so that we wouldn't
> have to redesign it again (why we need more than 64 --- for example to
> take periodic snapshots every few minutes to record system activity). The
> limit on number of snapshots can be dropped if we index b-tree nodes by a
> key that contains chunk number and range of snapshot numbers where this
> applies.

Unfortunately it's the limitation of the current b-tree
format. As far as I know, there is no code that we can use, which
supports unlimited and writable snapshot.


> - do some cache for metadata, don't read the b-tree from the root node
> from disk all the time.

The current code already does.


> Ideally the cache should be integrated with page
> cache so that it's size would tune automatically (I'm not sure if it's
> possible to cleanly code it, though).

Agreed. The current code invents the own cache code. I don't like it
but there is no other option.


> - the b-tree is good structure, I'd create log-structured filesystem to
> hold the b-tree. The advantage is that it will require less
> synchronization overhead in clustering. Also, log-structured filesystem
> will bring you crash recovery (with minimum coding overhead) and it has
> very good write performance.

A log-structured filesystem is pretty complex. Even though we don't
need a complete log-structured filesystem, it's still too complex,
IMO.

A copy-on-Write manner to update the b-tree on disk (as some of the
latest file systems do) is a possible option. Another option is using
journaling as I wrote.


> - deleting the snapshot --- this needs to walk the whole b-tree --- it is
> slow. Keeping another b-tree of chunks belonging to the given snapshot
> would be overkill. I think the best solution would be to split the device
> into large areas and use per-snapshot bitmap that says if the snapshot has
> some exceptions allocated in the pertaining area (similar to the
> dirty-bitmap of raid1). For short lived snapshots this will save walking
> the b-tree. For long-lived snapshots there is no help to speed it up...
> But delete performance is not that critical anyway because deleting can be
> done asynchronously without user waiting for it.

Yeah, it would be nice to delete a snapshot really quickly but it's
not a must.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-11-2008, 10:12 PM
Mikulas Patocka
 
Default dm snapshot: shared exception store

> > - drop that limitation on maximum 64 snapshots. If we are going to
> > redesign it, we should design it without such a limit, so that we wouldn't
> > have to redesign it again (why we need more than 64 --- for example to
> > take periodic snapshots every few minutes to record system activity). The
> > limit on number of snapshots can be dropped if we index b-tree nodes by a
> > key that contains chunk number and range of snapshot numbers where this
> > applies.
>
> Unfortunately it's the limitation of the current b-tree
> format. As far as I know, there is no code that we can use, which
> supports unlimited and writable snapshot.

So use different format --- we in RedHat plan redesigning it too. One of
the needed features is "rolling snapshots" --- i.e. you take snapshot
every 5 minutes or so and you keep them around. The result is that you
have complete history of the system activity.

And this 64-snapshot limitation would not allow this. The problem if we
use this format is that we will spend a lot of time developing and
finalizing it --- and then a requirement for rolling snapshots comes ---
and we'll have to throw it away and start from scratch. So I'd rather do
b-tree without limitation on number of snapshots from the beginning.

Another good thing would be the ability to compress several consecutive
chunks into one b-tree entry. But I think with multiple snapshots, there
is no clean way how to do it. Maybe design it without this possibility,
and then use some dirty hack to compress consecutive chunks in most common
cases (such as for example when no one writes to the snapshots).

> > - do some cache for metadata, don't read the b-tree from the root node
> > from disk all the time.
>
> The current code already does.

I see. That GFP_NOFS allocation shouldn't be there, because
- it is not reliable
- it can recurse back into block writing via swapper (use GFP_NOIO to
avoid that)

The correct solution would be to preallocate one or more buffers in the
target constructor. When running, get additional buffers with GFP_NOIO,
but if that fails, use the preallocated buffer. --- this way it can handle
temporary memory shortage without data corruption.

I'll write some generic code for that caching, I think it could be useful
even for other targets, so it'd be best to write it into main dm module.

> > Ideally the cache should be integrated with page
> > cache so that it's size would tune automatically (I'm not sure if it's
> > possible to cleanly code it, though).
>
> Agreed. The current code invents the own cache code. I don't like it
> but there is no other option.

Yes. Theoretically you can create your own address_space_operations and
try to integrate it into memory management. Practically, it's hard to say
if it will work (and if it will be maintainable as memory management
changes).

> > - the b-tree is good structure, I'd create log-structured filesystem to
> > hold the b-tree. The advantage is that it will require less
> > synchronization overhead in clustering. Also, log-structured filesystem
> > will bring you crash recovery (with minimum coding overhead) and it has
> > very good write performance.
>
> A log-structured filesystem is pretty complex. Even though we don't
> need a complete log-structured filesystem, it's still too complex,
> IMO.

I think it's not really harder than journaling. Maybe it's even easier,
because in journaling you have replay code that is very hard to test and
debug (ext3 had some replay bug even recently). In log-structured
filesystem there is no replay code, it is always consistent.

(I obviously don't mean to develop the whole filesystem for that --- just
use the main idea that you write always forward into unallocated space)

+ good for performance, majority of operations are writes
+ doesn't need cache-synchronization for cluster
+ can be simultaneously read by more cluster nodes and written by one
cluster node (all other formats require read:write exclusion)

> A copy-on-Write manner to update the b-tree on disk (as some of the
> latest file systems do) is a possible option.

That is what I mean. When we modify a node, one possibility is to write
b-tree blocks back to the root to unallocated space. The other possibility
is to write just one block to new space and mark it in superblock as
"redirected" from the old location. When the array of redirected blocks
fills up, write all b-tree blocks up to the root and erase the array of
redirected blocks (this will improve performance because you don't have to
write the full path up to root on every block update).

Another question is, where the superblock should be located. Just one
superblock at the beginning would be bad for disk seeks, maybe have
superblock at each disk track (approximatelly ... we don't know where the
tracks area), use some sequence counter to tell which one is the newest,
and write to the one that is near to the data.

> Another option is using journaling as I wrote.
>
>
> > - deleting the snapshot --- this needs to walk the whole b-tree --- it is
> > slow. Keeping another b-tree of chunks belonging to the given snapshot
> > would be overkill. I think the best solution would be to split the device
> > into large areas and use per-snapshot bitmap that says if the snapshot has
> > some exceptions allocated in the pertaining area (similar to the
> > dirty-bitmap of raid1). For short lived snapshots this will save walking
> > the b-tree. For long-lived snapshots there is no help to speed it up...
> > But delete performance is not that critical anyway because deleting can be
> > done asynchronously without user waiting for it.
>
> Yeah, it would be nice to delete a snapshot really quickly but it's
> not a must.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-11-2008, 11:34 PM
FUJITA Tomonori
 
Default dm snapshot: shared exception store

On Mon, 11 Aug 2008 18:12:08 -0400 (EDT)
Mikulas Patocka <mpatocka@redhat.com> wrote:

> > > - drop that limitation on maximum 64 snapshots. If we are going to
> > > redesign it, we should design it without such a limit, so that we wouldn't
> > > have to redesign it again (why we need more than 64 --- for example to
> > > take periodic snapshots every few minutes to record system activity). The
> > > limit on number of snapshots can be dropped if we index b-tree nodes by a
> > > key that contains chunk number and range of snapshot numbers where this
> > > applies.
> >
> > Unfortunately it's the limitation of the current b-tree
> > format. As far as I know, there is no code that we can use, which
> > supports unlimited and writable snapshot.
>
> So use different format --- we in RedHat plan redesigning it too. One of
> the needed features is "rolling snapshots" --- i.e. you take snapshot
> every 5 minutes or so and you keep them around. The result is that you
> have complete history of the system activity.

I think that implementing a better format is far more difficult than
you think. for example, see the tux3 vs. HAMMER discussion between
Daniel Phillips and Matthew Dillon.

Unless Alasdair tells me that unlimited snapshots is a must, probably
I will not work on it. I'm focusing integrating a snapshot feature
into dm cleanly.

Of course, I'm happy to use the better snapshot code if it's
available.


> And this 64-snapshot limitation would not allow this. The problem if we
> use this format is that we will spend a lot of time developing and
> finalizing it --- and then a requirement for rolling snapshots comes ---
> and we'll have to throw it away and start from scratch. So I'd rather do
> b-tree without limitation on number of snapshots from the beginning.

The advantage of taking the snapshot code from Zumastor is that it has
worked for a while. I don't expect much effort to stabilize the
snapshot code. The main issue here is that how to integrate it into dm
nicely.

I think that we have the version number in the super block to handle
better snapshot formats.


> Another good thing would be the ability to compress several consecutive
> chunks into one b-tree entry. But I think with multiple snapshots, there
> is no clean way how to do it. Maybe design it without this possibility,
> and then use some dirty hack to compress consecutive chunks in most common
> cases (such as for example when no one writes to the snapshots).
>
> > > - do some cache for metadata, don't read the b-tree from the root node
> > > from disk all the time.
> >
> > The current code already does.
>
> I see. That GFP_NOFS allocation shouldn't be there, because
> - it is not reliable
> - it can recurse back into block writing via swapper (use GFP_NOIO to
> avoid that)
>
> The correct solution would be to preallocate one or more buffers in the
> target constructor. When running, get additional buffers with GFP_NOIO,
> but if that fails, use the preallocated buffer. --- this way it can handle
> temporary memory shortage without data corruption.
>
> I'll write some generic code for that caching, I think it could be useful
> even for other targets, so it'd be best to write it into main dm module.

I'm not sure that other dm targets need such feature but I'm happy to
use it if it is provided. Next time, I'll submit this feature as a
separate patch.


> > > - the b-tree is good structure, I'd create log-structured filesystem to
> > > hold the b-tree. The advantage is that it will require less
> > > synchronization overhead in clustering. Also, log-structured filesystem
> > > will bring you crash recovery (with minimum coding overhead) and it has
> > > very good write performance.
> >
> > A log-structured filesystem is pretty complex. Even though we don't
> > need a complete log-structured filesystem, it's still too complex,
> > IMO.
>
> I think it's not really harder than journaling. Maybe it's even easier,
> because in journaling you have replay code that is very hard to test and
> debug (ext3 had some replay bug even recently). In log-structured
> filesystem there is no replay code, it is always consistent.
>
> (I obviously don't mean to develop the whole filesystem for that --- just
> use the main idea that you write always forward into unallocated space)
>
> + good for performance, majority of operations are writes
> + doesn't need cache-synchronization for cluster
> + can be simultaneously read by more cluster nodes and written by one
> cluster node (all other formats require read:write exclusion)

A log-structured file system is much more difficult than
journaling. And it's not better than it looks.

If a log-structured file system is really nice, we have tons of
log-structured file systems. In reality, we don't. AFAIK, no
widely-used operating systems (such as Linux, *BSD, Solaris, Windows,
etc) don't use a log-structured file systems as a default file system.


> > A copy-on-Write manner to update the b-tree on disk (as some of the
> > latest file systems do) is a possible option.
>
> That is what I mean.

Then, I don't think you are talking about a log-structured file
system. In general, we don't classify a copy-on-write file system like
ZFS as a log-structured file system.


> When we modify a node, one possibility is to write
> b-tree blocks back to the root to unallocated space. The other possibility
> is to write just one block to new space and mark it in superblock as
> "redirected" from the old location. When the array of redirected blocks
> fills up, write all b-tree blocks up to the root and erase the array of
> redirected blocks (this will improve performance because you don't have to
> write the full path up to root on every block update).
>
> Another question is, where the superblock should be located. Just one
> superblock at the beginning would be bad for disk seeks, maybe have
> superblock at each disk track (approximatelly ... we don't know where the
> tracks area), use some sequence counter to tell which one is the newest,
> and write to the one that is near to the data.
>
> > Another option is using journaling as I wrote.
> >
> >
> > > - deleting the snapshot --- this needs to walk the whole b-tree --- it is
> > > slow. Keeping another b-tree of chunks belonging to the given snapshot
> > > would be overkill. I think the best solution would be to split the device
> > > into large areas and use per-snapshot bitmap that says if the snapshot has
> > > some exceptions allocated in the pertaining area (similar to the
> > > dirty-bitmap of raid1). For short lived snapshots this will save walking
> > > the b-tree. For long-lived snapshots there is no help to speed it up...
> > > But delete performance is not that critical anyway because deleting can be
> > > done asynchronously without user waiting for it.
> >
> > Yeah, it would be nice to delete a snapshot really quickly but it's
> > not a must.
>
> Mikulas
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-12-2008, 12:15 AM
Steve VanDeBogart
 
Default dm snapshot: shared exception store

On Tue, 12 Aug 2008, FUJITA Tomonori wrote:

On Mon, 11 Aug 2008 18:12:08 -0400 (EDT) Mikulas Patocka <mpatocka@redhat.com> wrote:

- drop that limitation on maximum 64 snapshots. If we are going to
redesign it, we should design it without such a limit, so that we wouldn't
have to redesign it again (why we need more than 64 --- for example to
take periodic snapshots every few minutes to record system activity). The
limit on number of snapshots can be dropped if we index b-tree nodes by a
key that contains chunk number and range of snapshot numbers where this
applies.


Unfortunately it's the limitation of the current b-tree
format. As far as I know, there is no code that we can use, which
supports unlimited and writable snapshot.


I've recently worked on the limit of 64 snapshots and the storage cost of
2x64bits per modified chunk. A btree format that fixes these two issue
is described in this post: http://lwn.net/Articles/288896/ If you have
the time / energy, I believe that this format will work well and be
simple and elegant. I can't speak for Daniel Phillips, but I suspect he
is concentrating on tux3 and not on getting this format into Zumastor.


If you don't want to implement "versioned pointers," an earlier format
change is implemented as a patch against Zumastor here:
http://groups.google.com/group/zumastor/browse_thread/thread/523ee7925add3dfc/a5d26a4b48fd8906?lnk=gst&q=#a5d26a4b48fd8906
It removes the 64 snapshot limit and reduces the meta-data storage
requirements, but does not support snapshots of snapshots. This patch
has undergone reasonable testing and can be considered beta level code.


With both of these formats, in the context of the Zumastor codebase, the
number of snapshots is limited by a requirement that all metadata about
a specific chunk fit within a single btree node. This limits the
number of snapshots to approximately a quarter the chunk size. i.e. 4k
chunks would support approximately 500 snapshots.
Removing that restriction would increase the number of supported
snapshots by a factor of eight, at which point the next restriction

is encountered.


- deleting the snapshot --- this needs to walk the whole b-tree --- it is
slow. Keeping another b-tree of chunks belonging to the given snapshot
would be overkill. I think the best solution would be to split the device
into large areas and use per-snapshot bitmap that says if the snapshot has
some exceptions allocated in the pertaining area (similar to the
dirty-bitmap of raid1). For short lived snapshots this will save walking
the b-tree. For long-lived snapshots there is no help to speed it up...
But delete performance is not that critical anyway because deleting can be
done asynchronously without user waiting for it.


I don't know if it would be useful for your port, but there are some
patches floating around the Zumastor mailing list that implement
background delete. They're not production ready but are a good start on
the implementation.

--
Steve

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-12-2008, 12:30 PM
Daniel Phillips
 
Default dm snapshot: shared exception store

Hi Steve,

On Monday 11 August 2008 17:15, Steve VanDeBogart wrote:
> On Tue, 12 Aug 2008, FUJITA Tomonori wrote:
> > On Mon, 11 Aug 2008 18:12:08 -0400 (EDT) Mikulas Patocka <mpatocka@redhat.com> wrote:
> >> - drop that limitation on maximum 64 snapshots. If we are going to
> >> redesign it, we should design it without such a limit, so that we wouldn't
> >> have to redesign it again (why we need more than 64 --- for example to
> >> take periodic snapshots every few minutes to record system activity). The
> >> limit on number of snapshots can be dropped if we index b-tree nodes by a
> >> key that contains chunk number and range of snapshot numbers where this
> >> applies.
> >
> > Unfortunately it's the limitation of the current b-tree
> > format. As far as I know, there is no code that we can use, which
> > supports unlimited and writable snapshot.
>
> I've recently worked on the limit of 64 snapshots and the storage cost of
> 2x64bits per modified chunk. A btree format that fixes these two issue
> is described in this post: http://lwn.net/Articles/288896/ If you have
> the time / energy, I believe that this format will work well and be
> simple and elegant. I can't speak for Daniel Phillips, but I suspect he
> is concentrating on tux3 and not on getting this format into Zumastor.

It is very much the intention to get the versioned pointer code into
ddsnap. There is also this code:

http://tux3.org/tux3?f=81a1dd303e2a;file=user/test/dleaf.c

which implements a compressed leaf dictionary format that I believe you
last saw on a whiteboard a few weeks ago. It now works pretty well, in
part thanks to Shapor. The idea is to thoroughly shake out this code in
Tux3 then backport to ddsnap. But nothing stands in the way of somebody
just putting that in now.

Incidentally, it did turn out to be possible to make the group entries
32 bits. Demented code to be honest, but the leaf compression is really
good while the speed is roughly the same as the existing code, and it has
the benefit of supporting 48 bit block numbers while the existing code
only supports 32. It also has the pleasant property of most of the
memmoves being zero bytes, because I got it right this time and put the
leaf dictionary upside down at the top of the block instead of having
the exceptions at the top.

You are right that I will not be merging this code in the immediate
future. Anybody who wants to take that on is more than welcome. It will
not be a hard project to integrate that code and the algorithms are quite
interesting.

Over time, a few other pieces of Tux3 will get merged back into ddsnap,
for example, the forward logging atomic update method to eliminate most
of the remaining journal overhead.

> With both of these formats, in the context of the Zumastor codebase, the
> number of snapshots is limited by a requirement that all metadata about
> a specific chunk fit within a single btree node. This limits the
> number of snapshots to approximately a quarter the chunk size. i.e. 4k
> chunks would support approximately 500 snapshots.

One eighth the chunk size, you meant. Chunk pointers being 8 bytes,
and the leaf directory overhead being insignificant by the time a
block has been split down to just a single logical address.

> Removing that restriction would increase the number of supported
> snapshots by a factor of eight, at which point the next restriction
> is encountered.

I think the next restriction is the size of the version table in the
superblock, which is easily overcome. Then the next one after that is
the number of bits available in the block pointer for the version,
which can resonably be 16 with 48 bit block pointers, giving 2^16 user
visible snapshots, which is getting pretty close to unlimited.

Regards,

Daniel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-12-2008, 12:56 PM
Daniel Phillips
 
Default dm snapshot: shared exception store

Hi tomonori,

An impressive patch. You just want to use getblk for alloc_chunk_buffer,
not vmalloc. That is what the buffer cache is for, and that is why I
emulated it in userspace.

More comments later.

Regards,

Daniel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-12-2008, 01:14 PM
FUJITA Tomonori
 
Default dm snapshot: shared exception store

On Tue, 12 Aug 2008 05:56:57 -0700
Daniel Phillips <phillips@phunq.net> wrote:

> Hi tomonori,
>
> An impressive patch.

Thanks, and thanks a lot for the nice snapshot code.


> You just want to use getblk for alloc_chunk_buffer,
> not vmalloc.

I think that it means that we cache all the chunks, both btree chunks
and the data chunks (which are passed to the upper layer such as file
systems). I think that we don't want cache the latter in dm.


> That is what the buffer cache is for, and that is why I
> emulated it in userspace.
>
> More comments later.

Thanks.

I'll post a new patchset in a more reasonable format shortly (tomorrow
or day after tomorrow, hopefully).

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-12-2008, 07:00 PM
Daniel Phillips
 
Default dm snapshot: shared exception store

On Tuesday 12 August 2008 06:14, FUJITA Tomonori wrote:
> > You just want to use getblk for alloc_chunk_buffer,
> > not vmalloc.
>
> I think that it means that we cache all the chunks, both btree chunks
> and the data chunks (which are passed to the upper layer such as file
> systems). I think that we don't want cache the latter in dm.

That is true. However your code should not be reading data chunks into
memory at all. The only time the snapshot code has to read a data
chunk is when performing the copy from origin to snapshot store in
make_unique. Your code does not directly perform this task as far as I
can see. That would be done in a part of the dm snapshot code your
patch does not touch, which I seem to recall uses the kcopyd mechanism.

Regards,

Daniel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 

Thread Tools




All times are GMT. The time now is 09:29 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org