FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Cluster Development

 
 
LinkBack Thread Tools
 
Old 12-19-2011, 04:47 PM
David Teigland
 
Default gfs2: dlm based recovery coordination

On Mon, Dec 19, 2011 at 01:07:38PM +0000, Steven Whitehouse wrote:
> > struct lm_lockstruct {
> > int ls_jid;
> > unsigned int ls_first;
> > - unsigned int ls_first_done;
> > unsigned int ls_nodir;
> Since ls_flags and ls_first also also only boolean flags, they could
> potentially be moved into the flags, though we can always do that later.

yes, I can use a flag in place of ls_first.

> > + int ls_recover_jid_done; /* read by gfs_controld */
> > + int ls_recover_jid_status; /* read by gfs_controld */
> ^^^^^^^^^^^ this isn't
> actually true any more. All recent gfs_controld versions take their cue
> from the uevents, so this is here only for backwards compatibility
> reasons and these two will be removed at some future date.

I'll add a longer comment saying something like that.

> > + /*
> > + * Other nodes need to do some work in dlm recovery and gfs2_control
> > + * before the recover_done and control_lock will be ready for us below.
> > + * A delay here is not required but often avoids having to retry.
> > + */
> > +
> > + msleep(500);
> Can we get rid of this then? I'd rather just wait for the lock, rather
> than adding delays of arbitrary time periods into the code.

I dislike arbitrary delays also, so I'm hesitant to add them.
The choices here are:
- removing NOQUEUE from the requests below, but with NOQUEUE you have a
much better chance of killing a mount command, which is a fairly nice
feature, I think.
- removing the delay, which results in nodes often doing fast+repeated
lock attempts, which could get rather excessive. I'd be worried about
having that kind of unlimited loop sitting there.
- using some kind of delay.

While I don't like the look of the delay, I like the other options less.
Do you have a preference, or any other ideas?


> > +static int control_first_done(struct gfs2_sbd *sdp)
> > +{
> > + struct lm_lockstruct *ls = &sdp->sd_lockstruct;
> > + char lvb_bits[GDLM_LVB_SIZE];
> > + uint32_t start_gen, block_gen;
> > + int error;
> > +
> > +restart:
> > + spin_lock(&ls->ls_recover_spin);
> > + start_gen = ls->ls_recover_start;
> > + block_gen = ls->ls_recover_block;
> > +
> > + if (test_bit(DFL_BLOCK_LOCKS, &ls->ls_recover_flags) ||
> > + !test_bit(DFL_MOUNT_DONE, &ls->ls_recover_flags) ||
> > + !test_bit(DFL_FIRST_MOUNT, &ls->ls_recover_flags)) {
> > + /* sanity check, should not happen */
> > + fs_err(sdp, "control_first_done start %u block %u flags %lx
",
> > + start_gen, block_gen, ls->ls_recover_flags);
> > + spin_unlock(&ls->ls_recover_spin);
> > + control_unlock(sdp);
> > + return -1;
> > + }
> > +
> > + if (start_gen == block_gen) {
> > + /*
> > + * Wait for the end of a dlm recovery cycle to switch from
> > + * first mounter recovery. We can ignore any recover_slot
> > + * callbacks between the recover_prep and next recover_done
> > + * because we are still the first mounter and any failed nodes
> > + * have not fully mounted, so they don't need recovery.
> > + */
> > + spin_unlock(&ls->ls_recover_spin);
> > + fs_info(sdp, "control_first_done wait gen %u
", start_gen);
> > + msleep(500);
> Again - I don't want to add arbitrary delays into the code. Why is this
> waiting for half a second? Why not some other length of time? We should
> figure out how to wait for the end of the first mounter recovery some
> other way if that is what is required.

This msleep slows down a rare loop to wake up a couple times vs once with
a proper wait mechanism. It's waiting for the next recover_done()
callback, which the dlm will call when it is done with recovery. We do
have the option here of using a standard wait mechanism, wait_on_bit() or
something. I'll see if any of those would work here without adding too
much to the code.


> > +static void gdlm_recovery_result(struct gfs2_sbd *sdp, unsigned int jid,
> > + unsigned int result)
> > +{
> > + struct lm_lockstruct *ls = &sdp->sd_lockstruct;
> > +
> > + /* don't care about the recovery of own journal during mount */
> > + if (jid == ls->ls_jid)
> > + return;
> > +
> > + /* another node is recovering the journal, give it a chance to
> > + finish before trying again */
> > + if (result == LM_RD_GAVEUP)
> > + msleep(1000);
> Again, lets put in a proper wait for this condition. If the issue is one
> of races between cluster nodes (thundering herd type problem), then we
> might need some kind of back off, but in that case, it should probably
> be for a random time period.

In this case, while one node is recovering a journal, the other nodes will
all try to recover the same journal (and fail), as quickly as they can. I
looked at using queue_delayed_work here, but couldn't tell if that was ok
with zero delay... I now see others use 0, so I'll try it.


> > + error = dlm_new_lockspace(fsname, cluster, flags, GDLM_LVB_SIZE,
> > + &ops, &ls->ls_dlm);
> > +
> > + if (error == -EOPNOTSUPP) {
> > + /*
> > + * dlm does not support ops callbacks,
> > + * old dlm_controld/gfs_controld are used, try without ops.
> > + */
> > + fs_info(sdp, "dlm lockspace ops not used %d
", error);
> > + free_recover_size(ls);
> > +
> > + error = dlm_new_lockspace(fsname, cluster, flags, GDLM_LVB_SIZE,
> > + NULL, &ls->ls_dlm);
> > + if (error)
> > + fs_err(sdp, "dlm_new_lockspace error %d
", error);
> > + return error;
> > + }
> > +
> Hmm. This is a bit complicated. Can't we just make it return 0 anyway?
> If we do need to know whether the dlm supports the recovery ops, then
> lets just make it signal that somehow (e.g. returns 1 so that >= 0 means
> success and -ve means error). It doesn't matter if we don't call
> free_recover_size until umount time I think, even if the dlm doesn't
> support that since the data structures are fairly small.

I went with this because I thought it was simpler than adding a second
return value for the ops status. It would also let us simply drop the
special case in the future. The alternative is:

int dlm_new_lockspace(const char *name, const char *cluster,
uint32_t flags, int lvblen,
struct dlm_lockspace_ops *ops, void *ops_arg,
int *ops_error, dlm_lockspace_t **lockspace);

I'm willing to try that if you think it's clearer to understand.

Dave
 
Old 12-20-2011, 06:16 PM
David Teigland
 
Default gfs2: dlm based recovery coordination

On Tue, Dec 20, 2011 at 10:39:08AM +0000, Steven Whitehouse wrote:
> > I dislike arbitrary delays also, so I'm hesitant to add them.
> > The choices here are:
> > - removing NOQUEUE from the requests below, but with NOQUEUE you have a
> > much better chance of killing a mount command, which is a fairly nice
> > feature, I think.
> > - removing the delay, which results in nodes often doing fast+repeated
> > lock attempts, which could get rather excessive. I'd be worried about
> > having that kind of unlimited loop sitting there.
> > - using some kind of delay.
> >
> > While I don't like the look of the delay, I like the other options less.
> > Do you have a preference, or any other ideas?
> >
> Well, I'd prefer to just remove the NOQUEUE command in that case, so
> that we don't spin here. The dlm request is async anyway, so we should
> be able to wait for it in an interruptible manner and send a cancel if
> required.

I won't do async+cancel here, that would make the code unnecessarily ugly
and complicated. There's really no reason to be so dogmatic about delays,
but since you refuse I'll just make it block, assuming I don't find any
new problems with that.

> > > Again - I don't want to add arbitrary delays into the code. Why is this
> > > waiting for half a second? Why not some other length of time? We should
> > > figure out how to wait for the end of the first mounter recovery some
> > > other way if that is what is required.
> >
> > This msleep slows down a rare loop to wake up a couple times vs once with
> > a proper wait mechanism. It's waiting for the next recover_done()
> > callback, which the dlm will call when it is done with recovery. We do
> > have the option here of using a standard wait mechanism, wait_on_bit() or
> > something. I'll see if any of those would work here without adding too
> > much to the code.
> >
> Ok. That would be a better option I think.

Only if it doesn't make things more (unnecessarily) complex.

Dave
 
Old 12-20-2011, 08:04 PM
David Teigland
 
Default gfs2: dlm based recovery coordination

On Tue, Dec 20, 2011 at 02:16:43PM -0500, David Teigland wrote:
> On Tue, Dec 20, 2011 at 10:39:08AM +0000, Steven Whitehouse wrote:
> > > I dislike arbitrary delays also, so I'm hesitant to add them.
> > > The choices here are:
> > > - removing NOQUEUE from the requests below, but with NOQUEUE you have a
> > > much better chance of killing a mount command, which is a fairly nice
> > > feature, I think.
> > > - removing the delay, which results in nodes often doing fast+repeated
> > > lock attempts, which could get rather excessive. I'd be worried about
> > > having that kind of unlimited loop sitting there.
> > > - using some kind of delay.
> > >
> > > While I don't like the look of the delay, I like the other options less.
> > > Do you have a preference, or any other ideas?
> > >
> > Well, I'd prefer to just remove the NOQUEUE command in that case, so
> > that we don't spin here. The dlm request is async anyway, so we should
> > be able to wait for it in an interruptible manner and send a cancel if
> > required.
>
> I won't do async+cancel here, that would make the code unnecessarily ugly
> and complicated. There's really no reason to be so dogmatic about delays,
> but since you refuse I'll just make it block, assuming I don't find any
> new problems with that.

Now that I look at it, waiting vs blocking on the lock requests is not the
main issue; removing NOQUEUE doesn't really do anything. We're waiting
for the other nodes to finish their work and update the state in the lvb.
The only option is to periodically check the lvb, so the only choices are
to do that as fast as possible (not nice), or introduce a delay.
 
Old 12-21-2011, 02:40 PM
David Teigland
 
Default gfs2: dlm based recovery coordination

On Wed, Dec 21, 2011 at 10:45:21AM +0000, Steven Whitehouse wrote:
> I don't think I understand whats going on in that case. What I thought
> should be happening was this:
>
> - Try to get mounter lock in EX
> - If successful, then we are the first mounter so recover all
> journals
> - Write info into LVB
> - Drop mounter lock to PR so other nodes can mount
>
> - If failed to get mounter lock in EX, then wait for lock in PR state
> - This will block until the EX lock is dropped to PR
> - Read info from LVB
>
> So a node with the mounter lock in EX knows that it is always the first
> mounter and will recover all journals before demoting the mounter lock
> to PR. A node with the mounter lock in PR may only recover its own
> journal (at mount time).

I previously used one lock similar to that, but had to change it a bit.
I had to split it across two separate locks, called control_lock and
mounted_lock. There need to be two because of two conflicting requirements.

The control_lock lvb is used to communicate the generation number and jid
bits. Writing the lvb requires an EX lock, and EX prevents others from
continually holding a PR lock. Without mounted nodes continually holding
a PR lock we can't use EX to indicate first mounter.

So, the mounted_lock (no lvb) is used to indicate the first mounter.
Here all mounted nodes continually hold a PR lock, and a mounting node
attempts to get an EX lock, so any node to get an EX lock is the first
mounter.

(I previously used control_lock with "zero lvb" to indicate first mounter,
but there are some fairly common cases where the lvb may not be zero when
we need a first mounter.)

Now back to the reason why we need to retry lock requests and can't just
block. It's not related to the first mounter case. When a node mounts,
it needs to wait for other (previously mounted) nodes to update the
control_lock lvb with the latest generation number, and then it also needs
to wait for any bits set in the lvb to be cleared. i.e. it needs to wait
for any unrecovered journals to be recovered before it finishes mounting.

To do this, it needs to wait in a loop reading the control_lock lvb. The
question is whether we want to add some sort of delay to that loop or not,
and how. msleep_interruptible(), schedule_timeout_interruptible(),
something else?

Dave
 
Old 12-22-2011, 08:23 PM
David Teigland
 
Default gfs2: dlm based recovery coordination

On Mon, Dec 19, 2011 at 12:47:38PM -0500, David Teigland wrote:
> On Mon, Dec 19, 2011 at 01:07:38PM +0000, Steven Whitehouse wrote:
> > > struct lm_lockstruct {
> > > int ls_jid;
> > > unsigned int ls_first;
> > > - unsigned int ls_first_done;
> > > unsigned int ls_nodir;
> > Since ls_flags and ls_first also also only boolean flags, they could
> > potentially be moved into the flags, though we can always do that later.
>
> yes, I can use a flag in place of ls_first.

I went back to ls_first after finding the flag broke the old code path for
gfs_controld, and making it work involved changing more of the old code
that I wanted to in this patch. We can go back and reorganize how some of
that old code works (and remove ls_first), in a subsequent patch.

> > > + /*
> > > + * Other nodes need to do some work in dlm recovery and gfs2_control
> > > + * before the recover_done and control_lock will be ready for us below.
> > > + * A delay here is not required but often avoids having to retry.
> > > + */
> > > +
> > > + msleep(500);

This is now msleep_interruptible(500), I couldn't get around adding some
sort of delay here. If we think of another way to delay this I'll be
happy to try it.

I've finished and tested the rest of the changes.
https://github.com/teigland/linux-dlm/commits/devel11
http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/teigland/linux-dlm.git;a=shortlog;h=refs/heads/next
 
Old 01-05-2012, 02:08 PM
Bob Peterson
 
Default gfs2: dlm based recovery coordination

----- Original Message -----
| This new method of managing recovery is an alternative to
| the previous approach of using the userland gfs_controld.
|
| - use dlm slot numbers to assign journal id's
| - use dlm recovery callbacks to initiate journal recovery
| - use a dlm lock to determine the first node to mount fs
| - use a dlm lock to track journals that need recovery
|
| Signed-off-by: David Teigland <teigland@redhat.com>
| ---
| --- a/fs/gfs2/lock_dlm.c
| +++ b/fs/gfs2/lock_dlm.c
(snip)
| +#include <linux/gfs2_ondisk.h>
| #include <linux/gfs2_ondisk.h>

Hi,

Dave, are you going to post a replacement patch or addendum patch
that addresses Steve's concerns, such as the above?
I'd like to review this, but I want the review the latest/greatest.

Regards,

Bob Peterson
Red Hat File Systems
 
Old 01-05-2012, 02:21 PM
David Teigland
 
Default gfs2: dlm based recovery coordination

On Thu, Jan 05, 2012 at 10:08:15AM -0500, Bob Peterson wrote:
> ----- Original Message -----
> | This new method of managing recovery is an alternative to
> | the previous approach of using the userland gfs_controld.
> |
> | - use dlm slot numbers to assign journal id's
> | - use dlm recovery callbacks to initiate journal recovery
> | - use a dlm lock to determine the first node to mount fs
> | - use a dlm lock to track journals that need recovery
> |
> | Signed-off-by: David Teigland <teigland@redhat.com>
> | ---
> | --- a/fs/gfs2/lock_dlm.c
> | +++ b/fs/gfs2/lock_dlm.c
> (snip)
> | +#include <linux/gfs2_ondisk.h>
> | #include <linux/gfs2_ondisk.h>
>
> Hi,
>
> Dave, are you going to post a replacement patch or addendum patch
> that addresses Steve's concerns, such as the above?
> I'd like to review this, but I want the review the latest/greatest.

I haven't resent the patches after making the changes (which were fairly
minor.) I'll resend them shortly for another check before a pull request.

Dave
 
Old 01-05-2012, 03:16 PM
David Teigland
 
Default gfs2: dlm based recovery coordination

On Thu, Jan 05, 2012 at 03:40:09PM +0000, Steven Whitehouse wrote:
> I think it would be a good plan to not send this last patch for the
> current merge window and let it settle for a bit longer. Running things
> so fine with the timing makes me nervous bearing in mind the number of
> changes,

To allay your fears, keep in mind that the big new feature here does not
impact the current mode in the slightest. Everything should continue to
work just as it has before given the current versions of dlm_controld and
gfs_controld. So, there should be no reason for worry. A version of
dlm_controld under development will enable the new feature.

> and that three issues have been caught in the last few days.

The issues that arose from -next were completely inconsequential in
practice, so that's a moot point.

> Lets try and resolve the remaining points and then we can have something
> really solid ready for the next window.

I don't know of any remaining points.

> I don't think there is any particular rush to get it in at the moment.

I think people will want to start using the new cluster4 components (which
require these patches) before a 3.4 kernel would be in a Fedora release.
So the motivation is to avoid making everyone build their own kernels and
allow cluster4 versions of things to be put in Fedora.

Dave
 
Old 01-05-2012, 03:45 PM
Bob Peterson
 
Default gfs2: dlm based recovery coordination

----- Original Message -----
| This new method of managing recovery is an alternative to
| the previous approach of using the userland gfs_controld.
|
| - use dlm slot numbers to assign journal id's
| - use dlm recovery callbacks to initiate journal recovery
| - use a dlm lock to determine the first node to mount fs
| - use a dlm lock to track journals that need recovery
|
| Signed-off-by: David Teigland <teigland@redhat.com>
| ---
Hi,

(snip)
| + if (!test_bit_le(i, lvb_bits+JID_BITMAP_OFFSET))
Nit, but I'd prefer " + " rather than "+" with no spaces.
| + continue;
| +
| + __clear_bit_le(i, lvb_bits+JID_BITMAP_OFFSET);
Nit, same.
(snip)
| + __set_bit_le(i, lvb_bits+JID_BITMAP_OFFSET);
Again, please globally change: s/+/ + / except in instances of "var++".
(snip)

Other than these nits, I have no concerns (other than what Steve pointed out).

Regards,

Bob Peterson
Red Hat File Systems
 
Old 01-05-2012, 04:13 PM
David Teigland
 
Default gfs2: dlm based recovery coordination

On Thu, Jan 05, 2012 at 04:58:22PM +0000, Steven Whitehouse wrote:
> > + clear_bit(SDF_NOJOURNALID, &sdp->sd_flags);
> > + smp_mb__after_clear_bit();
> > + wake_up_bit(&sdp->sd_flags, SDF_NOJOURNALID);
> > + ls->ls_first = !!test_bit(DFL_FIRST_MOUNT, &ls->ls_recover_flags);
> > + return 0;
> > +
>
> This bit of code, which was correct last time you posted this patch
> appears to have reverted to its previous incorrect state. ls_first must

Thanks, I'll move it back, I removed ls_first and put it back in the wrong
place. I keep forgetting about it because...

> be set before SDF_NOJOURNALID is cleared, otherwise the uninitialised
> value may be read,

in this case there can be no other reader, so it doesn't matter.

Dave
 

Thread Tools




All times are GMT. The time now is 10:36 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org