FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Cluster Development

 
 
LinkBack Thread Tools
 
Old 12-17-2008, 09:03 PM
Joel Becker
 
Default configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item()

On Wed, Dec 17, 2008 at 01:40:20PM -0800, Andrew Morton wrote:
> On Fri, 12 Dec 2008 16:29:11 +0100
> Louis Rilling <louis.rilling@kerlabs.com> wrote:
>
> > When attaching default groups (subdirs) of a new group (in mkdir() or
> > in configfs_register()), configfs recursively takes inode's mutexes
> > along the path from the parent of the new group to the default
> > subdirs. This is needed to ensure that the VFS will not race with
> > operations on these sub-dirs. This is safe for the following reasons:
> >
> > - the VFS allows one to lock first an inode and second one of its
> > children (The lock subclasses for this pattern are respectively
> > I_MUTEX_PARENT and I_MUTEX_CHILD);
> > - from this rule any inode path can be recursively locked in
> > descending order as long as it stays under a single mountpoint and
> > does not follow symlinks.
> >
> > Unfortunately lockdep does not know (yet?) how to handle such
> > recursion.
> >
> > I've tried to use Peter Zijlstra's lock_set_subclass() helper to
> > upgrade i_mutexes from I_MUTEX_CHILD to I_MUTEX_PARENT when we know
> > that we might recursively lock some of their descendant, but this
> > usage does not seem to fit the purpose of lock_set_subclass() because
> > it leads to several i_mutex locked with subclass I_MUTEX_PARENT by
> > the same task.
> >
> > >From inside configfs it is not possible to serialize those recursive
> > locking with a top-level one, because mkdir() and rmdir() are already
> > called with inodes locked by the VFS. So using some
> > mutex_lock_nest_lock() is not an option.
> >
> > I am proposing two solutions:
> > 1) one that wraps recursive mutex_lock()s with
> > lockdep_off()/lockdep_on().
> > 2) (as suggested earlier by Peter Zijlstra) one that puts the
> > i_mutexes recursively locked in different classes based on their
> > depth from the top-level config_group created. This
> > induces an arbitrary limit (MAX_LOCK_DEPTH - 2 == 46) on the
> > nesting of configfs default groups whenever lockdep is activated
> > but this limit looks reasonably high. Unfortunately, this alos
> > isolates VFS operations on configfs default groups from the others
> > and thus lowers the chances to detect locking issues.
> >
> > This patch implements solution 1).
> >
> > Solution 2) looks better from lockdep's point of view, but fails with
> > configfs_depend_item(). This needs to rework the locking
> > scheme of configfs_depend_item() by removing the variable lock recursion
> > depth, and I think that it's doable thanks to the configfs_dirent_lock.
> > For now, let's stick to solution 1).
> >
> > Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com>
> > ---
> > fs/configfs/dir.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++ +++
> > 1 files changed, 59 insertions(+), 0 deletions(-)
> >
> > diff --git a/fs/configfs/dir.c b/fs/configfs/dir.c
> > index 8e93341..9c23583 100644
> > --- a/fs/configfs/dir.c
> > +++ b/fs/configfs/dir.c
> > @@ -553,12 +553,24 @@ static void detach_groups(struct config_group *group)
> >
> > child = sd->s_dentry;
> >
> > + /*
> > + * Note: we hide this from lockdep since we have no way
> > + * to teach lockdep about recursive
> > + * I_MUTEX_PARENT -> I_MUTEX_CHILD patterns along a path
> > + * in an inode tree, which are valid as soon as
> > + * I_MUTEX_PARENT -> I_MUTEX_CHILD is valid from a
> > + * parent inode to one of its children.
> > + */
> > + lockdep_off();
> > mutex_lock(&child->d_inode->i_mutex);
> > + lockdep_on();
> >
> > [etc]
> >
>
> Oh dear, what an unpleasant patch.
>
> Peter, can this be saved?

I'd love to see it work within lockdep, but it seems rather
hard, so that's why I recommended Louis cook up this version. I see you
picked it up in -mm. Do you want me to push it through ocfs2.git?

Joel


--

Life's Little Instruction Book #157

"Take time to smell the roses."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
 
Old 12-18-2008, 08:27 AM
Joel Becker
 
Default configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item()

On Thu, Dec 18, 2008 at 08:26:48AM +0100, Peter Zijlstra wrote:
> On Wed, 2008-12-17 at 13:40 -0800, Andrew Morton wrote:
> > On Fri, 12 Dec 2008 16:29:11 +0100
> > Louis Rilling <louis.rilling@kerlabs.com> wrote:
> > > >From inside configfs it is not possible to serialize those recursive
> > > locking with a top-level one, because mkdir() and rmdir() are already
> > > called with inodes locked by the VFS. So using some
> > > mutex_lock_nest_lock() is not an option.

<snip>

> >
> > Oh dear, what an unpleasant patch.
> >
> > Peter, can this be saved?
>
> I'm still not sure why configfs is the only virtual filesystem that
> suffers this - something in its design is weird (and not so wonderfull).
>
> Also, while I usually applaud fine grained locking, is configfs really
> in need of that?, I mean, its not like its meant to create and remove
> directories all day every day at breakneck speed, right? AFAIK you just
> stomp some data in to configure some kernel stuff and then let it sit
> for the duration of whatever that kernel thing does while it does it.

It's not about breakneck speed, it's about living in the VFS.
But I think you get that later.

> See I'm not even clear on what configfs is.. and why its better than
> sysfs for example.. - /me goes read configfs.txt
>
> Right, so basically we avoid syscalls by making vfs ops do stuff..
> lovely - still not seeing it though - regular filesystems seems to cope
> just fine and they get to create arbitrary tree structures too.

It's about the default_groups and how they build up and tear
down small bits of tree.
A simple creation of a config_item, a mkdir(2), is a normal VFS
lock set and doesn't make lockdep unhappy. But if the new config_item
has a default_group or two, they need locking too. Not so much on
mkdir(2), but on rmdir(2).

> Yes lockdep has 2 major weaknesses - 1) it doesn't support arbitrary
> lock depths (and I've tried, twice, to fix that, but its a _hard_
> problem) and 2) it can't deal with arbitrary recursion.

I know it's hard, or I'd have sent you patches :-) In fact,
Louis tried to use the subclass bits to make this work to a depth of N
(where N was probably deep enough in practice). However, this creates
subclasses that don't get seen by the regular VFS locking - and the big
deal here is making sure configfs's use of i_mutex meshes with the VFS.
That is, his code made the warnings go away, but removed much of
lockdep's ability to see when we got the locking wrong.

> The thing is, in practise it turns out that reworking code to not run
> into these issues often makes the code better - if only for the fact
> that taking locks is expensive and doing less is better, and holding
> locks stifles concurrency, so holding less it better (yes, I said
> _often_, there likely are counter cases but I don't believe configfs is
> one of them).

This isn't about concurrency or speed. This is about safety
while configfs is attaching or (especially) detaching config_items from
the filesystem view it presents. When the VFS travels down a path, it
unlocks the trailing directory. We can't do that when tearing down
default groups, because we need to lock that small hunk and tear it out
atomically.

> Anyway - I'm against just turning lockdep off, that will make folks
> complacent and let the stuff rot to bits inside - and I for one will
> refuse to run anything using it (but since that only seems to be
> dlm/ocfs and I'm of the believe that centralized cluster stuff sucks
> rocks anyway that won't be a problem).

Oh, be nice :-)
You are absolutely right that turning off lockdep leaves the
possibility of complacency and bitrot. That's precisely why I didn't
like Louis' subclass solution - again, bitrot might go unnoticed.
Now, I know that I will be paying attention to the locking and
going over it with a fine-toothed comb. But I'd much rather have an
actual working lockdep analysis. Whether that means we find a way for
lockdep to describe what's happening here, or we find another way to
keep folks out of the tree we're removing, I don't care.

Joel

--

Life's Little Instruction Book #109

"Know how to drive a stick shift."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
 
Old 12-18-2008, 10:26 AM
Steven Whitehouse
 
Default configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item()

Hi,

I have to say that when I brought this subject up, I didn't realise
quite how tricky this was going to be, however...

On Thu, 2008-12-18 at 01:27 -0800, Joel Becker wrote:
> On Thu, Dec 18, 2008 at 08:26:48AM +0100, Peter Zijlstra wrote:
> > On Wed, 2008-12-17 at 13:40 -0800, Andrew Morton wrote:
> > > On Fri, 12 Dec 2008 16:29:11 +0100
> > > Louis Rilling <louis.rilling@kerlabs.com> wrote:
> > > > >From inside configfs it is not possible to serialize those recursive
> > > > locking with a top-level one, because mkdir() and rmdir() are already
> > > > called with inodes locked by the VFS. So using some
> > > > mutex_lock_nest_lock() is not an option.
>
> <snip>
>
<more snipping>

> > Right, so basically we avoid syscalls by making vfs ops do stuff..
> > lovely - still not seeing it though - regular filesystems seems to cope
> > just fine and they get to create arbitrary tree structures too.
>
> It's about the default_groups and how they build up and tear
> down small bits of tree.
> A simple creation of a config_item, a mkdir(2), is a normal VFS
> lock set and doesn't make lockdep unhappy. But if the new config_item
> has a default_group or two, they need locking too. Not so much on
> mkdir(2), but on rmdir(2).
>
So if I've understood this correctly, the dentries created upon mkdir
live until such time as they are removed at some later date, presumably
with rmdir?

When creating the tree, would it be possible to build it starting from
the bottom and working towards the point of attachment and then to not
actually attach it until the last moment? That way it would not be
visible from userland until it was linked into the existing dir and that
solves the locking issue for mkdir I think.

As you say, rmdir seems the harder problem, but again, is it possible to
separate the unlink operation from the destruction of the tree by
keeping the tree, after its been unlinked, until the last userland
reference has gone away? At least I assume that is why the locking is
there. I may be a bit off the mark, but it seems like this is quite
similar to how the VFS does umount, so maybe there are some hints in
that code which may help us solve this issue?

Steve.
 
Old 12-18-2008, 09:58 PM
Joel Becker
 
Default configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item()

On Thu, Dec 18, 2008 at 01:28:28PM +0100, Peter Zijlstra wrote:
> In fact, both (configfs) mkdir and rmdir seem to synchronize on
> su_mutex..
>
> mkdir B/C/bar
>
> C.i_mutex
> su_mutex
>
> vs
>
> rmdir foo
>
> parent(foo).i_mutex
> foo.i_mutex
> su_mutex
>
>
> once holding the rmdir su_mutex you can check foo's user-content, since
> any mkdir will be blocked. All you have to do is then re-validate in
> mkdir's su_mutex that !IS_DEADDIR(C).

We explicitly do not take any i_mutex locks after taking
su_mutex. That's an ABBA risk. su_mutex protects the hierarchy of
config_items. i_mutex protects the vfs view thereof.
If you look in mkdir, we take su_mutex, get a new item from the
client subsystem, then drop su_mutex. After that, we go about building
our filesystem structure, using i_mutex where appropriate. More
importantly is rmdir(2), where we use i_mutex in
configfs_detach_group(), but are not holding su_sem. Only when
configfs_detach_group() has successfully returned and we have torn down
the filesystem structure do we take su_mutex and tear down the
config_item structure.
In fact, we're part of the way there. Check out that
USET_DROPPING flag we set in detach_prep() while scanning for user
objects. That flags us racing mkdir(2). When we are done with
detach_prep(), we know that mkdir(2) calls racing behind us will do
nothing until we safely lock them out with the locking in
detach_group(). All mkdir(2) calls will have exited by the time we get
the mutex, and no new mkdir(2) call can start because we have the mutex.
Now look in detach_groups(). We drop the groups children before
marking them DEAD. Louis' plan, I think, is to perhaps mark a group
DEAD, disconnect it from the vfs, and then operate on its children. In
this fashion, perhaps we can unlock the trailing lock like a normal VFS
operation.
This will require some serious auditing, however, because now
vfs functions can get into the vfs objects behind us. And more vfs
changes affect us. Whereas the current locking relies on the vfs's
parent->child lock ordering only, something that isn't likely to be
changed.

Joel


--

"You cannot bring about prosperity by discouraging thrift. You cannot
strengthen the weak by weakening the strong. You cannot help the wage
earner by pulling down the wage payer. You cannot further the
brotherhood of man by encouraging class hatred. You cannot help the
poor by destroying the rich. You cannot build character and courage by
taking away a man's initiative and independence. You cannot help men
permanently by doing for them what they could and should do for
themselves."
- Abraham Lincoln

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
 
Old 01-28-2009, 02:05 AM
Joel Becker
 
Default configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item()

Thank you both for keeping up on this. I owe Louis some review that I'm
going to try to get to.

On Mon, Jan 26, 2009 at 03:55:36PM +0100, Louis Rilling wrote:
> On 26/01/09 15:19 +0100, Peter Zijlstra wrote:
> > On Mon, 2009-01-26 at 15:00 +0100, Louis Rilling wrote:
> >
> > > > Its not a locking correctness thing, but simply not being able to do it
> > > > from the vfs calls because those assume locks held?
> > > >
> > > > Can't you simply punt the work to a worklet once you've created/removed
> > > > the non-default group, which can be done from within the vfs callback ?
> > >
> > > I'm not sure to understand your suggestion. Is this:
> > > 1) for mkdir(), create the non-default group, but without its default groups,
> > > and defer their creation to a worker which won't have constraints on locks held
> > > by any caller;
> > > 2) for rmdir(), unlink the non-default group, but without unlinking its default
> > > groups, and defer the recursive work to a lock-free context?
> > >
> > > For mkdir(), this may work. Maybe a bit confusing for userspace, since mkdir(A)
> > > returns as soon as A is created, but A may be populated later and userspace may
> > > rely on A being populated as soon as it is created (current behavior). As a
> > > configfs user, this makes my life harder...
> >
> > Right, so that is the whole crux of the matter?

The "appearance" of the entire default group hierarchy should be
atomic. When mkdir(2) returns, it is all there. This does make our
lives a little difficult inside configfs, but it makes everyone else's
lives much easier.

> Probably not. I'm not the maintainer of configfs, but I guess that Joel is a bit
> reluctant to deeply rework parts of something that actually works (conflicts
> with lockdep excepted).

That is true - it works, and safely, and the lockdep conflict is
the only real known issue. I'm not wary of changing it, I'm only wary
of breaking it. That is, I want to understand what is being changed and
be sure that we're keeping the correctness we have so far. I don't want
to change it and "hope" we got it right, you know? This makes me
cautious, but don't think I'm against change. As I stated, having
lockdep work for us is a good thing.

More replies to the other mails coming.

Joel

--

You can use a screwdriver to screw in screws or to clean your ears,
however, the latter needs real skill, determination and a lack of fear
of injuring yourself. It is much the same with JavaScript.
- Chris Heilmann

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
 
Old 01-28-2009, 02:41 AM
Joel Becker
 
Default configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item()

On Mon, Jan 26, 2009 at 01:30:09PM +0100, Peter Zijlstra wrote:
> I don't think I was suggesting that. All you need is to serialize any
> mkdir/creat against the rmdir of the youngest non-default group, and you
> can do that by holding su_mutex.
>
> In rmdir, you already own all the i_mutex instances you need to uncouple
> the whole tree, all you need to do is validate that its indeed empty --
> you don't need i_mutex's for that, because you're holding su_mutex, and
> any concurrent mkdir/creat will be blocking on that.
>
> If you find it empty, just mark everybody DEAD, drop su_mutex and
> decouple. All concurrent mkdir/creat thingies that were blocking will
> now bail because their parent is found DEAD.

Right, that's what I was talking about when I said:

> > Now look in detach_groups(). We drop the groups children before
> > marking them DEAD. Louis' plan, I think, is to perhaps mark a group
> > DEAD, disconnect it from the vfs, and then operate on its children. In
> > this fashion, perhaps we can unlock the trailing lock like a normal VFS
> > operation.
> > This will require some serious auditing, however, because now
> > vfs functions can get into the vfs objects behind us. And more vfs
> > changes affect us. Whereas the current locking relies on the vfs's
> > parent->child lock ordering only, something that isn't likely to be
> > changed.

That is, the vfs has already walked past this directory,
dropped i_mutex, and is in a child default group holding its i_mutex.
It wants to mkdir(2) down there. You're saying that, if mkdir(2) holds
su_mutex higher up, it can check S_DEAD and compare with us, and that's
exactly the scheme I mentioned in the first of the quoted paragraphs
(Louis proposed it a while back). Thus, though we don't hold i_mutex on that
child, it will eventually either a) have gotten su_mutex first, and will
cause rmdir to ENOTEMPTY or b) have gotten su_mutex second, will see
S_DEAD, and return -ENOENT. As far as preventing mkdir(2), I don't see
why it wouldn't work.
The issue is in that second quoted paragraph. We know that the
VFS can look up our children if we're not holding i_mutex. In fact,
cached_lookup() can find them without i_mutex. Now, we know that
mkdir(2) and rmdir(2) will block at su_mutex. But what about all other
file operations, both on the child directories *and* the attribute
files? For attribute files, we prevent access at creation time with a
flag. We can trust the flag because we hold i_mutex. This might hold
anyway because we're holding that toplevel i_mutex. At teardown time,
though, we know they can't be found because of i_mutex. Now we don't
have that protection for processes that are farther down the tree.
But the bigger issue is just the plain regular operations on our
directories. An example is ->readdir(). We currently lock out
->readdir() during rmdir(2) with our holding of i_mutex. However, if we
are not holding i_mutex, vfs_readdir() can call into our ->readdir()
right as we're tearing things down. We may not have gotten to S_DEAD
yet in the rmdir(2) path, and the two will race. We can't take su_mutex
in ->readdir(), because that sort of solution effectively says "we have
to take su_mutex for all operations", and we end up serializing all
operations on a subsustem.
Now, I can think of a way to make ->readdir() safe without
su_mutex. But what other operation is next? How do I know I have them
all? How do I notice when someone adds a new operation or code path to
the generic code that I have to protect against? With i_mutex, I *know*
that everyone has agreed that is the gatekeeper. Without it...
*That's* the big worry. That's what I'm worried about being
sure of. I'd love to hear a solution that we know will work, or at
least move towards one.

Joel

PS: And we haven't even talked about configfs_depend_item() yet.

--

Life's Little Instruction Book #252

"Take good care of those you love."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
 

Thread Tools




All times are GMT. The time now is 07:00 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org