FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Device-mapper Development

 
 
LinkBack Thread Tools
 
Old 08-14-2011, 08:18 PM
Mikulas Patocka
 
Default convert dm-thin to use dm-bufio

Hi

I created patches that convert dm-thin to use dm-bufio interface (so that
it is consistent with dm-zeroed and dm-multisnap).

The patches are here:
http://people.redhat.com/mpatocka/patches/kernel/dm-thinp-bufio/

Advantages:

* It uses kmalloc, __get_free_pages or __vmalloc, so the block size is not
limited and memory fragmentation is not an issue. The original code only
used kmalloc.

* It has dynamic cache sizing, cache size can be configured by the user.
In case of inactivity over some time (a buffer is unused for a minute),
the buffers are freed. The original code had cache size hardcoded and it
couldn't be changed.

* Submit bios directly (if we can) without dm-io layer.

Notes:

* Buffer locking is not supported, I suppose it is not used for anything
anyway. If it is used, tell me, I can add it after reviewing it.

* dm_bm_rebind_block_device changes the block device, but it is not used
for anything (dm-thinp never changes the device). I think it could be
removed.

* Two small bugs in dm-thin are fixed --- not closing the superblock
buffer on exit and improper termination sequence (the block devices were
closed before the buffer interface exited).

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-15-2011, 09:04 AM
Joe Thornber
 
Default convert dm-thin to use dm-bufio

On Sun, Aug 14, 2011 at 04:18:29PM -0400, Mikulas Patocka wrote:
> * It has dynamic cache sizing, cache size can be configured by the user.
> In case of inactivity over some time (a buffer is unused for a minute),
> the buffers are freed. The original code had cache size hardcoded and it
> couldn't be changed.

A big advantage, something that would have to be done for dm-block-manager.c.

> * Submit bios directly (if we can) without dm-io layer.

Also good.

> Notes:
>
> * Buffer locking is not supported, I suppose it is not used for anything
> anyway. If it is used, tell me, I can add it after reviewing it.

Locking is used throughout the btree code, and the persistent-data
library in general. The only thing stopping us getting the benefit
specifically in thinp is the big rw lock in dm_thin_metadata.c. As
the code matures I plan to relax this lock to allow one writer,
multiple readers. We've spoken about this before.

Aside from that the locking is great for debugging. I'd have it for
that alone, and consider turning it off for release builds.

So, I'm not going to back down on the locking. If we merge
block-manager/bufio we need locking in the interface.

> * dm_bm_rebind_block_device changes the block device, but it is not used
> for anything (dm-thinp never changes the device). I think it could be
> removed.

I admit it's ugly, but it _is_ used. The need comes about from the
metadata device's lifetime being longer than that of the dm table that
opens it. So when a different pool table is loaded we reopen the
metadata device and 'rebind' it under the block-manager.

> * Two small bugs in dm-thin are fixed --- not closing the superblock
> buffer on exit and improper termination sequence (the block devices were
> closed before the buffer interface exited).

Thx, I'll grab these.


This comment in the bufio header worries me:

+ * WARNING: to avoid deadlocks, the thread can hold at most one buffer. Multiple
+ * threads can hold each one buffer simultaneously.

There are many cases where persistent-data holds 2 locks, (eg, rolling
locks as we update the spine of a btree). Also the superblock is
currently held while transactions are open (we can easily change
this). Is this limitation easy to get round?


- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-15-2011, 06:26 PM
Mikulas Patocka
 
Default convert dm-thin to use dm-bufio

> > Notes:
> >
> > * Buffer locking is not supported, I suppose it is not used for anything
> > anyway. If it is used, tell me, I can add it after reviewing it.
>
> Locking is used throughout the btree code, and the persistent-data
> library in general. The only thing stopping us getting the benefit
> specifically in thinp is the big rw lock in dm_thin_metadata.c. As
> the code matures I plan to relax this lock to allow one writer,
> multiple readers. We've spoken about this before.

If you want to relax this locking, you need:

* Lock the root pointer specially

Thread 1 is splitting the root node and thread 2 is grabbing a read lock
on the root node. Then, when thread 1 unlocks the node, thread 2 locks the
node, but the node is no longer a root, so thread 2 sees only part of the
tree.

Also, 64-bit numbers are not atomic on 32-bit architectures, thus if one
thread is splitting the root node and another thread is reading the root
block number, it can read garbage.

* Don't allow concurrent reads while you commit

* Don't allow concurrent reads when you delete something from the tree

* Don't allow concurrent reads when you increase reference counts (i.e.
cloning existing devices)

> Aside from that the locking is great for debugging. I'd have it for
> that alone, and consider turning it off for release builds.
>
> So, I'm not going to back down on the locking. If we merge
> block-manager/bufio we need locking in the interface.
>
> > * dm_bm_rebind_block_device changes the block device, but it is not used
> > for anything (dm-thinp never changes the device). I think it could be
> > removed.
>
> I admit it's ugly, but it _is_ used. The need comes about from the
> metadata device's lifetime being longer than that of the dm table that
> opens it. So when a different pool table is loaded we reopen the
> metadata device and 'rebind' it under the block-manager.

BTW. what about this --- if function "pool_find" found an existing pool
based on equivalent "metadata_dev", instead of "pool_md" as it does now. I
think it would solve the rebind problem.

> > * Two small bugs in dm-thin are fixed --- not closing the superblock
> > buffer on exit and improper termination sequence (the block devices were
> > closed before the buffer interface exited).
>
> Thx, I'll grab these.
>
>
> This comment in the bufio header worries me:
>
> + * WARNING: to avoid deadlocks, the thread can hold at most one buffer. Multiple
> + * threads can hold each one buffer simultaneously.
>
> There are many cases where persistent-data holds 2 locks, (eg, rolling
> locks as we update the spine of a btree). Also the superblock is
> currently held while transactions are open (we can easily change
> this). Is this limitation easy to get round?

That comment is outdated, it currently supports several buffers. But only
one thread may hold multiple buffers (except threads that do only
dm_bm_read_try_lock).

Mikulas

> - Joe
>

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-16-2011, 09:16 AM
Joe Thornber
 
Default convert dm-thin to use dm-bufio

On Mon, Aug 15, 2011 at 02:26:16PM -0400, Mikulas Patocka wrote:
> * Lock the root pointer specially
>
> Thread 1 is splitting the root node and thread 2 is grabbing a read lock
> on the root node. Then, when thread 1 unlocks the node, thread 2 locks the
> node, but the node is no longer a root, so thread 2 sees only part of the
> tree.

I'm not sure if you're talking about the superblock for the thin
metadata, or the root of any btree.

The lock I was talking about relaxing is the top level rw semaphore in
the the thin-metadata. The locks on individual _blocks_ will still be
exclusive for writes. So you can have concurrent lookups and inserts
happening in the btree, but the rolling locks scheme prevents races
(for instance the lookup overtaking the insert and accessing partially
written data). This is a standard technique for copy-on-write btrees.

> Also, 64-bit numbers are not atomic on 32-bit architectures, thus if one
> thread is splitting the root node and another thread is reading the root
> block number, it can read garbage.
>
> * Don't allow concurrent reads while you commit
>
> * Don't allow concurrent reads when you delete something from the tree
>
> * Don't allow concurrent reads when you increase reference counts (i.e.
> cloning existing devices)

See answer above, I wasn't suggesting allowing reading and writing to
the same block concurrently. So concurrent read with delete or
increasing ref counts is fine.

Concurrent reads with commit is more complicated, but not for the
reasons you give. In this case we can let the reads run in parallel
with the commit, but we have to be careful to ensure they've completed
before we begin the new transaction to remove the possibility of recycling
one of the blocks that the read is accessing.

> > I admit it's ugly, but it _is_ used. The need comes about from the
> > metadata device's lifetime being longer than that of the dm table that
> > opens it. So when a different pool table is loaded we reopen the
> > metadata device and 'rebind' it under the block-manager.
>
> BTW. what about this --- if function "pool_find" found an existing pool
> based on equivalent "metadata_dev", instead of "pool_md" as it does now. I
> think it would solve the rebind problem.

Originally I did bind on the metadata dev - this was the reason for
the extra metadata dev parameter in the thin target lines which was
used as the key to find the pool.

I don't think it gets round the problem that the metadata dev is
initially opened by a pool target that cannot be destroyed (or hence
reloaded) until all the devs it's opened have been closed. So about a
month ago I was opening the metadata dev directly, rather than using
the dm_open_device thingy, which meant it could hang around longer
than the table. I do however think the way it is now is cleaner,
because we don't duplicate dm_open_device.

> > This comment in the bufio header worries me:
> >
> > + * WARNING: to avoid deadlocks, the thread can hold at most one buffer. Multiple
> > + * threads can hold each one buffer simultaneously.
> >
> That comment is outdated, it currently supports several buffers. But only
> one thread may hold multiple buffers (except threads that do only
> dm_bm_read_try_lock).

That sounds perfect (why didn't you say this on the call yesterday
when asked about this?)

For me the main attraction to switching to bufio is you don't use
dm-io.c.

Hopefully Alasdair will allow you to keep the different memory types;
I was allocating the buffers in bm either via kmalloc, or pages
depending on size, but he wanted this changing to a kmem_cache in
order to 'segregate' the memory.

I'm not clear how you ascertain the correct working set size for your
cache, or choose when and how many blocks to flush. Have you got any
documentation for this? thinp shouldn't need a big cache and I'm a
bit worried a 'grab another buffer if there's memory available' policy
will make it grow too much.

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-16-2011, 10:03 PM
Mikulas Patocka
 
Default convert dm-thin to use dm-bufio

> For me the main attraction to switching to bufio is you don't use
> dm-io.c.
>
> Hopefully Alasdair will allow you to keep the different memory types;
> I was allocating the buffers in bm either via kmalloc, or pages
> depending on size, but he wanted this changing to a kmem_cache in
> order to 'segregate' the memory.
>
> I'm not clear how you ascertain the correct working set size for your
> cache, or choose when and how many blocks to flush. Have you got any
> documentation for this? thinp shouldn't need a big cache and I'm a
> bit worried a 'grab another buffer if there's memory available' policy
> will make it grow too much.

The working set is 2% of memory or 1/4 of vmalloc space (whichever is
smaller). This can be changed in
"/sys/module/dm_bufio/parameters/dm_bufio_cache_size".

Buffers are freed when they are unused for 60 seconds (can be changed in
"/sys/module/dm_bufio/parameters/dm_bufio_max_age").

Total size is divided by the number of active clients using dm-bufio. So
that each client gets a fair share.

Writeback is done on commit. When 3/4 of the cache size is used,
background writeback is started.

Mikulas

> - Joe
>

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-17-2011, 08:26 AM
Joe Thornber
 
Default convert dm-thin to use dm-bufio

On Tue, Aug 16, 2011 at 06:03:00PM -0400, Mikulas Patocka wrote:
> The working set is 2% of memory or 1/4 of vmalloc space (whichever is
> smaller). This can be changed in
> "/sys/module/dm_bufio/parameters/dm_bufio_cache_size".
>
> Buffers are freed when they are unused for 60 seconds (can be changed in
> "/sys/module/dm_bufio/parameters/dm_bufio_max_age").
>
> Total size is divided by the number of active clients using dm-bufio. So
> that each client gets a fair share.
>
> Writeback is done on commit. When 3/4 of the cache size is used,
> background writeback is started.

This all sounds good; get the locking interface in and I'll switch to
bufio straight away.

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-18-2011, 10:31 PM
Mikulas Patocka
 
Default convert dm-thin to use dm-bufio

> This all sounds good; get the locking interface in and I'll switch to
> bufio straight away.
>
> - Joe

I uploaded bufio-based block manager at
http://people.redhat.com/mpatocka/patches/kernel/dm-thinp-bufio/. It
supports locks, but it defines new functions down_write_non_owner and
up_write_non_owner.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-19-2011, 07:04 AM
Mike Snitzer
 
Default convert dm-thin to use dm-bufio

On Thu, Aug 18 2011 at 6:31pm -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> > This all sounds good; get the locking interface in and I'll switch to
> > bufio straight away.
> >
> > - Joe
>
> I uploaded bufio-based block manager at
> http://people.redhat.com/mpatocka/patches/kernel/dm-thinp-bufio/. It
> supports locks, but it defines new functions down_write_non_owner and
> up_write_non_owner.

dm-bufio.patch:
drivers/md/Kconfig needs a more comprehensive description for DM_BUFIO's
help.


dm-thinp-bufio.patch:
1)
This drivers/md/persistent-data/dm-block-manager.h change avoids lots of
block manager interface churn:

-struct dm_block;
+#define dm_block dm_buffer
+#define dm_block_manager dm_bufio_client

But I think it'd be best, in the long run, to have a follow-on patch
that does away with the aliases and just use the bufio structs
throughout the code. Anyway, don't need to worry about this now. But
what you've done is hack that should probably be cleaned up.

2)
dm_bm_rebind_block_device:
+ * !!! FIXME: remove this. It is supposedly unused.

I'll need to look closer at the overall changes to dm-block-manager.c
but you've clearly gotten rid of struct dm_block_manager; so there is no
reference to a bdev that needs to be flushed.

Which begs the question:
Have you tested this bufio conversion with the thinp-test-suite?
git://github.com/jthornber/thinp-test-suite.git




Question for Joe:
You're making conflicting changes quick enough that I wonder if you
and Mikulas will ever converge (e.g. why do multiple block managers need
to have access to the same metadata device!?).

- - [ ] Sometimes I'm getting kmem_cache name clashes in the block
+ - [X] Sometimes I'm getting kmem_cache name clashes in the block
manager. We're obviously not deleting the cache in a particular
error path. Recently introduced. :ejt:

commit 049bf17f41147ba3d51bac6ebee038d3d79a086c
Author: Joe Thornber <ejt@redhat.com>
Date: Wed Aug 17 13:46:12 2011 +0100

[block-manager, thin-metadata] Make the bm kmemcache name unique to
the pool.

ATM the kmem_cache identifier just includes the major/minor of the
device that the bm is looking at. This means you can't create 2 bms
that point to the same device without generating a nasty stack
trace.

This patch makes the name unique to the pool. It's a quick hack to
allow me to keep on testing. Could someone who cares about these
kmemcaches decide on a proper solution please.


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-19-2011, 09:11 AM
Joe Thornber
 
Default convert dm-thin to use dm-bufio

On Fri, Aug 19, 2011 at 03:04:36AM -0400, Mike Snitzer wrote:
> Question for Joe:
> You're making conflicting changes quick enough that I wonder if you
> and Mikulas will ever converge (e.g. why do multiple block managers need
> to have access to the same metadata device!?).

They don't; my issue is with getting an oops if they do through user
error. I clearly said in the commit message that this was a hack to
get round issues introduced by agk's move to a kmemcache. Cook
something cleaner up between yourselves, or wait for me to look at it
again once I've got through some more pressing issues.

An alternative would be to iterate through all the pools in the system
check whether any of them already had the same metadata device open.
Of course this doesn't catch the cases where the stacking is used and
a metadata device eventually maps onto the same physical disk as an
existing md area.

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 08-19-2011, 09:12 AM
Joe Thornber
 
Default convert dm-thin to use dm-bufio

On Thu, Aug 18, 2011 at 06:31:13PM -0400, Mikulas Patocka wrote:
> I uploaded bufio-based block manager at
> http://people.redhat.com/mpatocka/patches/kernel/dm-thinp-bufio/. It
> supports locks, but it defines new functions down_write_non_owner and
> up_write_non_owner.

Thanks Mikulas, I'll try and get it merged over the next couple of days.

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 

Thread Tools




All times are GMT. The time now is 04:36 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org