This is an updated version of the RCU glock patch which I'm hoping to
push into the -nmw tree as soon as the merge window is over, so it
will be queued for the following merge window.
This patch is pretty similar to the previous version of the patch
with some minor changes to include bit_spinlock.h, plus some lines
have moved due to the recent gfs2 merge.
I'm hoping that this will be the final form, assuming that no bugs
are discovered in the mean time.
The main aim of this patch is to simplify the locking relating to the
glock hash table and to reduce the possibility of lock contention.
This patch has a secondary advantage of being quite a nice clean up
and simplification of the glock code. Hopefully it will be a bit
easier to follow after this change. In particular the code for
the examine_bucket() function is greatly simplified since it does
not need to remove glocks from the list and can run under only
rcu_read_lock() protection.
I'm copying in Paul McKenney in case he can spot any howlers in the RCU
code. I think its all ok, but .....
[Paul, any hints and tips are greatly appreciated. I did look at your
docs carefully when working out how best to do this, but I may have
missed something]
I've been testing the patch by running postmark (setting "number" to a large
value to get lots of glocks) and then continually reading the debugfs
glock file at the same time. So far, so good.
Note that we could potentially use SLAB_RCU here, but it appears to add
a lot of extra complexity in the look up side of things, so I've not
done that at the moment. It would be a possible future enhancement.
I need to do some performance tests on this patch to see what we
get in terms of speed up. I have a suitable cluster (with a decent
number of cpu cores) to try it out. So I hope to have some results
in the not too distant future.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
-/*
- * Despite what you might think, the numbers below are not arbitrary :-)
- * They are taken from the ipv4 routing hash code, which is well tested
- * and thus should be nearly optimal. Later on we might tweek the numbers
- * but for now this should be fine.
- *
- * The reason for putting the locks in a separate array from the list heads
- * is that we can have fewer locks than list heads and save memory. We use
- * the same hash function for both, but with a different hash mask.
- */
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) ||
- defined(CONFIG_PROVE_LOCKING)
-
-#ifdef CONFIG_LOCKDEP
-# define GL_HASH_LOCK_SZ 256
-#else
-# if NR_CPUS >= 32
-# define GL_HASH_LOCK_SZ 4096
-# elif NR_CPUS >= 16
-# define GL_HASH_LOCK_SZ 2048
-# elif NR_CPUS >= 8
-# define GL_HASH_LOCK_SZ 1024
-# elif NR_CPUS >= 4
-# define GL_HASH_LOCK_SZ 512
-# else
-# define GL_HASH_LOCK_SZ 256
-# endif
-#endif
-
-/* We never want more locks than chains */
-#if GFS2_GL_HASH_SIZE < GL_HASH_LOCK_SZ
-# undef GL_HASH_LOCK_SZ
-# define GL_HASH_LOCK_SZ GFS2_GL_HASH_SIZE
-#endif
-
-static rwlock_t gl_hash_locks[GL_HASH_LOCK_SZ];
-
-static inline rwlock_t *gl_lock_addr(unsigned int x)
-{
- return &gl_hash_locks[x & (GL_HASH_LOCK_SZ-1)];
-}
-#else /* not SMP, so no spinlocks required */
-static inline rwlock_t *gl_lock_addr(unsigned int x)
-{
- return NULL;
-}
-#endif
-
/**
* gl_hash() - Turn glock number into hash bucket number
* @lock: The glock number
@@ -141,25 +91,30 @@ static unsigned int gl_hash(const struct gfs2_sbd *sdp,
return h;
}
-/**
- * glock_free() - Perform a few checks and then release struct gfs2_glock
- * @gl: The glock to release
- *
- * Also calls lock module to release its internal structure for this glock.
- *
- */
+static inline void spin_lock_bucket(unsigned int hash)
+{
+ struct hlist_bl_head *bl = &gl_hash_table[hash];
+ bit_spin_lock(0, (unsigned long *)bl);
+}
+
+static inline void spin_unlock_bucket(unsigned int hash)
+{
+ struct hlist_bl_head *bl = &gl_hash_table[hash];
+ __bit_spin_unlock(0, (unsigned long *)bl);
+}
+ /* assert_spin_locked(&gl->gl_spin); */
+
if (gl->gl_state == LM_ST_UNLOCKED)
return 0;
- if (!list_empty(&gl->gl_holders))
+ if (test_bit(GLF_LFLUSH, &gl->gl_flags))
+ return 0;
+ if ((gl->gl_name.ln_type != LM_TYPE_INODE) &&
+ !list_empty(&gl->gl_holders))
return 0;
if (glops->go_demote_ok)
return glops->go_demote_ok(gl);
return 1;
}
+
/**
- * gfs2_glock_schedule_for_reclaim - Add a glock to the reclaim list
+ * __gfs2_glock_schedule_for_reclaim - Add a glock to the reclaim list
* @gl: the glock
*
+ * If the glock is demotable, then we add it (or move it) to the end
+ * of the glock LRU list.
*/
On Sat, 2011-03-12 at 23:04 -0800, Paul E. McKenney wrote:
> On Mon, Jan 10, 2011 at 12:25:18PM +0000, Steven Whitehouse wrote:
> > This is an updated version of the RCU glock patch which I'm hoping to
> > push into the -nmw tree as soon as the merge window is over, so it
> > will be queued for the following merge window.
> >
> > This patch is pretty similar to the previous version of the patch
> > with some minor changes to include bit_spinlock.h, plus some lines
> > have moved due to the recent gfs2 merge.
> >
> > I'm hoping that this will be the final form, assuming that no bugs
> > are discovered in the mean time.
> >
> > The main aim of this patch is to simplify the locking relating to the
> > glock hash table and to reduce the possibility of lock contention.
> >
> > This patch has a secondary advantage of being quite a nice clean up
> > and simplification of the glock code. Hopefully it will be a bit
> > easier to follow after this change. In particular the code for
> > the examine_bucket() function is greatly simplified since it does
> > not need to remove glocks from the list and can run under only
> > rcu_read_lock() protection.
> >
> > I'm copying in Paul McKenney in case he can spot any howlers in the RCU
> > code. I think its all ok, but .....
> >
> > [Paul, any hints and tips are greatly appreciated. I did look at your
> > docs carefully when working out how best to do this, but I may have
> > missed something]
>
> This is ridiculously late, hopefully better late than never...
>
> Please accept my apologies -- lost it in the email heap. :-/
>
No problem, we've been testing the patch a lot in the mean time, and so
far no major issues. So far the only correction has been this:
http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-2.6-nmw.git;a=commitdiff;h=fc0e38dae645f65424d1fb5d2a9 38aab8ce48a58
Which is more to do with our stats counting than the RCU side of things.
> > I've been testing the patch by running postmark (setting "number" to a large
> > value to get lots of glocks) and then continually reading the debugfs
> > glock file at the same time. So far, so good.
> >
> > Note that we could potentially use SLAB_RCU here, but it appears to add
> > a lot of extra complexity in the look up side of things, so I've not
> > done that at the moment. It would be a possible future enhancement.
> >
> > I need to do some performance tests on this patch to see what we
> > get in terms of speed up. I have a suitable cluster (with a decent
> > number of cpu cores) to try it out. So I hope to have some results
> > in the not too distant future.
> >
> > Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
> > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>
> Please see below for one question (and some random commentary).
>
> Thanx, Paul
>
> > diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
> > index 08a8beb..c75d499 100644
> > --- a/fs/gfs2/glock.c
> > +++ b/fs/gfs2/glock.c
> > @@ -26,6 +26,9 @@
> > #include <linux/freezer.h>
> > #include <linux/workqueue.h>
> > #include <linux/jiffies.h>
> > +#include <linux/rcupdate.h>
> > +#include <linux/rculist_bl.h>
> > +#include <linux/bit_spinlock.h>
> >
> > #include "gfs2.h"
> > #include "incore.h"
> > @@ -41,10 +44,6 @@
> > #define CREATE_TRACE_POINTS
> > #include "trace_gfs2.h"
> >
> > -struct gfs2_gl_hash_bucket {
> > - struct hlist_head hb_list;
> > -};
> > -
> > struct gfs2_glock_iter {
> > int hash; /* hash bucket index */
> > struct gfs2_sbd *sdp; /* incore superblock */
> > @@ -54,7 +53,6 @@ struct gfs2_glock_iter {
> >
> > typedef void (*glock_examiner) (struct gfs2_glock * gl);
> >
> > -static int gfs2_dump_lockstate(struct gfs2_sbd *sdp);
> > static int __dump_glock(struct seq_file *seq, const struct gfs2_glock *gl);
> > #define GLOCK_BUG_ON(gl,x) do { if (unlikely(x)) { __dump_glock(NULL, gl); BUG(); } } while(0)
> > static void do_xmote(struct gfs2_glock *gl, struct gfs2_holder *gh, unsigned int target);
> > @@ -70,57 +68,9 @@ static DEFINE_SPINLOCK(lru_lock);
> > #define GFS2_GL_HASH_SIZE (1 << GFS2_GL_HASH_SHIFT)
> > #define GFS2_GL_HASH_MASK (GFS2_GL_HASH_SIZE - 1)
> >
> > -static struct gfs2_gl_hash_bucket gl_hash_table[GFS2_GL_HASH_SIZE];
> > +static struct hlist_bl_head gl_hash_table[GFS2_GL_HASH_SIZE];
> > static struct dentry *gfs2_root;
> >
> > -/*
> > - * Despite what you might think, the numbers below are not arbitrary :-)
> > - * They are taken from the ipv4 routing hash code, which is well tested
> > - * and thus should be nearly optimal. Later on we might tweek the numbers
> > - * but for now this should be fine.
> > - *
> > - * The reason for putting the locks in a separate array from the list heads
> > - * is that we can have fewer locks than list heads and save memory. We use
> > - * the same hash function for both, but with a different hash mask.
> > - */
> > -#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) ||
> > - defined(CONFIG_PROVE_LOCKING)
> > -
> > -#ifdef CONFIG_LOCKDEP
> > -# define GL_HASH_LOCK_SZ 256
> > -#else
> > -# if NR_CPUS >= 32
> > -# define GL_HASH_LOCK_SZ 4096
> > -# elif NR_CPUS >= 16
> > -# define GL_HASH_LOCK_SZ 2048
> > -# elif NR_CPUS >= 8
> > -# define GL_HASH_LOCK_SZ 1024
> > -# elif NR_CPUS >= 4
> > -# define GL_HASH_LOCK_SZ 512
> > -# else
> > -# define GL_HASH_LOCK_SZ 256
> > -# endif
> > -#endif
> > -
> > -/* We never want more locks than chains */
> > -#if GFS2_GL_HASH_SIZE < GL_HASH_LOCK_SZ
> > -# undef GL_HASH_LOCK_SZ
> > -# define GL_HASH_LOCK_SZ GFS2_GL_HASH_SIZE
> > -#endif
> > -
> > -static rwlock_t gl_hash_locks[GL_HASH_LOCK_SZ];
> > -
> > -static inline rwlock_t *gl_lock_addr(unsigned int x)
> > -{
> > - return &gl_hash_locks[x & (GL_HASH_LOCK_SZ-1)];
> > -}
> > -#else /* not SMP, so no spinlocks required */
> > -static inline rwlock_t *gl_lock_addr(unsigned int x)
> > -{
> > - return NULL;
> > -}
> > -#endif
> > -
> > /**
> > * gl_hash() - Turn glock number into hash bucket number
> > * @lock: The glock number
> > @@ -141,25 +91,30 @@ static unsigned int gl_hash(const struct gfs2_sbd *sdp,
> > return h;
> > }
> >
> > -/**
> > - * glock_free() - Perform a few checks and then release struct gfs2_glock
> > - * @gl: The glock to release
> > - *
> > - * Also calls lock module to release its internal structure for this glock.
> > - *
> > - */
> > +static inline void spin_lock_bucket(unsigned int hash)
> > +{
> > + struct hlist_bl_head *bl = &gl_hash_table[hash];
> > + bit_spin_lock(0, (unsigned long *)bl);
> > +}
> > +
> > +static inline void spin_unlock_bucket(unsigned int hash)
> > +{
> > + struct hlist_bl_head *bl = &gl_hash_table[hash];
> > + __bit_spin_unlock(0, (unsigned long *)bl);
> > +}
> >
> > -static void glock_free(struct gfs2_glock *gl)
> > +void gfs2_glock_free(struct rcu_head *rcu)
> > {
> > + struct gfs2_glock *gl = container_of(rcu, struct gfs2_glock, gl_rcu);
> > struct gfs2_sbd *sdp = gl->gl_sbd;
> > - struct address_space *mapping = gfs2_glock2aspace(gl);
> > - struct kmem_cache *cachep = gfs2_glock_cachep;
> >
> > - GLOCK_BUG_ON(gl, mapping && mapping->nrpages);
> > - trace_gfs2_glock_put(gl);
> > - if (mapping)
> > - cachep = gfs2_glock_aspace_cachep;
> > - sdp->sd_lockstruct.ls_ops->lm_put_lock(cachep, gl);
> > + if (gl->gl_ops->go_flags & GLOF_ASPACE)
> > + kmem_cache_free(gfs2_glock_aspace_cachep, gl);
> > + else
> > + kmem_cache_free(gfs2_glock_cachep, gl);
> > +
> > + if (atomic_dec_and_test(&sdp->sd_glock_disposal))
> > + wake_up(&sdp->sd_glock_wait);
> > }
> >
> > /**
> > @@ -185,34 +140,49 @@ static int demote_ok(const struct gfs2_glock *gl)
> > {
> > const struct gfs2_glock_operations *glops = gl->gl_ops;
> >
> > + /* assert_spin_locked(&gl->gl_spin); */
> > +
> > if (gl->gl_state == LM_ST_UNLOCKED)
> > return 0;
> > - if (!list_empty(&gl->gl_holders))
> > + if (test_bit(GLF_LFLUSH, &gl->gl_flags))
> > + return 0;
> > + if ((gl->gl_name.ln_type != LM_TYPE_INODE) &&
> > + !list_empty(&gl->gl_holders))
> > return 0;
> > if (glops->go_demote_ok)
> > return glops->go_demote_ok(gl);
> > return 1;
> > }
> >
> > +
> > /**
> > - * gfs2_glock_schedule_for_reclaim - Add a glock to the reclaim list
> > + * __gfs2_glock_schedule_for_reclaim - Add a glock to the reclaim list
> > * @gl: the glock
> > *
> > + * If the glock is demotable, then we add it (or move it) to the end
> > + * of the glock LRU list.
> > */
> >
> > -static void gfs2_glock_schedule_for_reclaim(struct gfs2_glock *gl)
> > +static void __gfs2_glock_schedule_for_reclaim(struct gfs2_glock *gl)
> > {
> > - int may_reclaim;
> > - may_reclaim = (demote_ok(gl) &&
> > - (atomic_read(&gl->gl_ref) == 1 ||
> > - (gl->gl_name.ln_type == LM_TYPE_INODE &&
> > - atomic_read(&gl->gl_ref) <= 2)));
> > - spin_lock(&lru_lock);
> > - if (list_empty(&gl->gl_lru) && may_reclaim) {
> > + if (demote_ok(gl)) {
> > + spin_lock(&lru_lock);
> > +
> > + if (!list_empty(&gl->gl_lru))
> > + list_del_init(&gl->gl_lru);
> > + else
> > + atomic_inc(&lru_count);
> > +
> > list_add_tail(&gl->gl_lru, &lru_list);
> > - atomic_inc(&lru_count);
> > + spin_unlock(&lru_lock);
> > }
> > - spin_unlock(&lru_lock);
> > +}
> > +
> > +void gfs2_glock_schedule_for_reclaim(struct gfs2_glock *gl)
> > +{
> > + spin_lock(&gl->gl_spin);
> > + __gfs2_glock_schedule_for_reclaim(gl);
> > + spin_unlock(&gl->gl_spin);
> > }
> >
> > /**
> > @@ -227,7 +197,6 @@ void gfs2_glock_put_nolock(struct gfs2_glock *gl)
> > {
> > if (atomic_dec_and_test(&gl->gl_ref))
> > GLOCK_BUG_ON(gl, 1);
> > - gfs2_glock_schedule_for_reclaim(gl);
> > }
> >
> > /**
> > @@ -236,30 +205,26 @@ void gfs2_glock_put_nolock(struct gfs2_glock *gl)
> > *
> > */
> >
> > -int gfs2_glock_put(struct gfs2_glock *gl)
> > +void gfs2_glock_put(struct gfs2_glock *gl)
> > {
> > - int rv = 0;
> > + struct gfs2_sbd *sdp = gl->gl_sbd;
> > + struct address_space *mapping = gfs2_glock2aspace(gl);
> >
> > - write_lock(gl_lock_addr(gl->gl_hash));
> > - if (atomic_dec_and_lock(&gl->gl_ref, &lru_lock)) {
> > - hlist_del(&gl->gl_list);
> > + if (atomic_dec_and_test(&gl->gl_ref)) {
> > + spin_lock_bucket(gl->gl_hash);
> > + hlist_bl_del_rcu(&gl->gl_list);
>
> OK, protected by spin_lock_bucket(gl->gl_hash).
>
> > + spin_unlock_bucket(gl->gl_hash);
> > + spin_lock(&lru_lock);
> > if (!list_empty(&gl->gl_lru)) {
> > list_del_init(&gl->gl_lru);
> > atomic_dec(&lru_count);
> > }
> > spin_unlock(&lru_lock);
> > - write_unlock(gl_lock_addr(gl->gl_hash));
> > GLOCK_BUG_ON(gl, !list_empty(&gl->gl_holders));
> > - glock_free(gl);
> > - rv = 1;
> > - goto out;
> > + GLOCK_BUG_ON(gl, mapping && mapping->nrpages);
> > + trace_gfs2_glock_put(gl);
> > + sdp->sd_lockstruct.ls_ops->lm_put_lock(gl);
> > }
> > - spin_lock(&gl->gl_spin);
> > - gfs2_glock_schedule_for_reclaim(gl);
> > - spin_unlock(&gl->gl_spin);
> > - write_unlock(gl_lock_addr(gl->gl_hash));
> > -out:
> > - return rv;
> > }
> >
> > /**
> > @@ -275,17 +240,15 @@ static struct gfs2_glock *search_bucket(unsigned int hash,
> > const struct lm_lockname *name)
> > {
> > struct gfs2_glock *gl;
> > - struct hlist_node *h;
> > + struct hlist_bl_node *h;
> >
> > - hlist_for_each_entry(gl, h, &gl_hash_table[hash].hb_list, gl_list) {
> > + hlist_bl_for_each_entry_rcu(gl, h, &gl_hash_table[hash], gl_list) {
>
> Presumably all callers hold rcu_read_lock() or the update-side lock...
>
> Looks that way, at least for calls appearing in the patch. And the
> mainline version has the same number of calls, so should be OK.
>
Yes, there are only two callers and they are both from the glock "get"
function. The first is the normal look up, and the second is the lookup
under the spin lock which is only done on the insert path to check that
we didn't race with another thread trying to create the same glock.
On Mon, 2011-03-14 at 06:07 -0700, Paul E. McKenney wrote:
> On Mon, Mar 14, 2011 at 10:03:16AM +0000, Steven Whitehouse wrote:
[snip]
> > > > +
> > > > +static inline struct gfs2_glock *glock_hash_next(struct gfs2_glock *gl)
> > > > +{
> > > > + return hlist_bl_entry(rcu_dereference_raw(gl->gl_list.next),
> > >
> > > Isn't this always called either with the update-side lock held or under
> > > rcu_read_lock() protection? If so, please use rcu_dereference_protected().
> > > This probable means applying lockdep to the hash-chain locks, but that
> > > should be doable.
> > >
> > There will always be rcu_read_lock() protection. These are used only by
> > the seq_file based "glocks" file which appears in debugfs. I can try and
> > create a patch to fix the dereference function, but I'm not sure what
> > I'll need to do about lockdep. I guess I'll test it and see whether I
> > get any messages from it.
>
> OK, if there is always rcu_read_lock() protection, then you can just use
> rcu_dereference(). If lockdep is enabled, it will the the appropriate
> checking.
>
Ok, cool. I've attached an incremental patch below. I'll send this to
Linus with the rest of the queue shortly,
Steve.
>From 99f4f4266bf3619216a4de386921fdd7530e6b1e Mon Sep 17 00:00:00 2001
From: Steven Whitehouse <swhiteho@redhat.com>
Date: Tue, 15 Mar 2011 08:32:14 +0000
Subject: [PATCH] GFS2: Don't use _raw version of RCU dereference
As per RCU glock patch review comments, don't use the _raw
version of this function here.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>