FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Cluster Development

 
 
LinkBack Thread Tools
 
Old 10-05-2011, 08:05 PM
David Teigland
 
Default DLM: Convert rsb data from linked list to rb_tree

On Wed, Oct 05, 2011 at 03:25:39PM -0400, Bob Peterson wrote:
> Hi,
>
> This upstream patch changes the way DLM keeps track of RSBs.
> Before, they were in a linked list off a hash table. Now,
> they're an rb_tree off the same hash table. This speeds up
> DLM lookups greatly.
>
> Today's DLM is faster than older DLMs for many file systems,
> (e.g. in RHEL5) due to the larger hash table size. However,
> this rb_tree implementation scales much better. For my
> 1000-directories-with-1000-files test, the patch doesn't
> show much of an improvement. But when I scale the file system
> to 4000 directories with 4000 files (16 million files), it
> helps greatly. The time to do rm -fR /mnt/gfs2/* drops from
> 42.01 hours to 23.68 hours.

How many hash table buckets were you using in that test?
If it was the default (1024), I'd be interested to know how
16k compares.

> With this patch I believe we could also reduce the size of
> the hash table again or eliminate it completely, but we can
> evaluate and do that later.
>
> NOTE: Today's upstream DLM code has special code to
> pre-allocate RSB structures for faster lookup. This patch
> eliminates that step, since it doesn't have a resource name
> at that time for inserting new entries in the rb_tree.

We need to keep that; why do you say there's no resource name?
pre_rsb_struct() and get_rsb_struct() are specially designed to work
as they do because:

> @@ -367,28 +336,16 @@ static int get_rsb_struct(struct dlm_ls *ls, char *name, int len,
> struct dlm_rsb **r_ret)
> + r = dlm_allocate_rsb(ls);
> + if (!r)
> + return -ENOMEM;

That's not allowed here because a spinlock is held:

> spin_lock(&ls->ls_rsbtbl[bucket].lock);
>
> error = _search_rsb(ls, name, namelen, bucket, flags, &r);
> @@ -508,10 +492,6 @@ static int find_rsb(struct dlm_ls *ls, char *name, int namelen,
> goto out_unlock;
>
> error = get_rsb_struct(ls, name, namelen, &r);
> - if (error == -EAGAIN) {
> - spin_unlock(&ls->ls_rsbtbl[bucket].lock);
> - goto retry;
> - }
> if (error)
> goto out_unlock;

If you try to fix the problem above by releasing the spinlock between the
search and the malloc, then you have to repeat the search. Eliminating
the repeated search is the main reason for pre_rsb/get_rsb.


Wed Oct 5 23:30:01 2011
Return-path: <bounce-debian-user=tom=linux-archive.org@lists.debian.org>
Envelope-to: tom@linux-archive.org
Delivery-date: Wed, 05 Oct 2011 23:00:19 +0300
Received: from liszt.debian.org ([82.195.75.100]:57656)
by s2.java-tips.org with esmtps (TLSv1:AES256-SHA:256)
(Exim 4.69)
(envelope-from <bounce-debian-user=tom=linux-archive.org@lists.debian.org>)
id 1RBXdb-0005gp-Ia
for tom@linux-archive.org; Wed, 05 Oct 2011 23:00:19 +0300
Received: from localhost (localhost [127.0.0.1])
by liszt.debian.org (Postfix) with QMQP
id A96A713A720C; Wed, 5 Oct 2011 20:09:12 +0000 (UTC)
Old-Return-Path: <irek.szczesniak@gmail.com>
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on liszt.debian.org
X-Spam-Level:
X-Spam-Status: No, score=-12.0 required=4.0 tests=LDOSUBSCRIBER,LDO_WHITELIST,
RCVD_IN_DNSWL_LOW autolearn=failed version=3.2.5
X-Original-To: lists-debian-user@liszt.debian.org
Delivered-To: lists-debian-user@liszt.debian.org
Received: from localhost (localhost [127.0.0.1])
by liszt.debian.org (Postfix) with ESMTP id D60ED13A5A30
for <lists-debian-user@liszt.debian.org>; Wed, 5 Oct 2011 20:09:05 +0000 (UTC)
X-Virus-Scanned: at lists.debian.org with policy bank en-ht
X-Amavis-Spam-Status: No, score=-8 tagged_above=-10000 required=5.3
tests=[BAYES_00=-2, LDO_WHITELIST=-5, RCVD_IN_DNSWL_LOW=-1]
autolearn=ham
Received: from liszt.debian.org ([127.0.0.1])
by localhost (lists.debian.org [127.0.0.1]) (amavisd-new, port 2525)
with ESMTP id nlm-IP61ParN for <lists-debian-user@liszt.debian.org>;
Wed, 5 Oct 2011 20:08:58 +0000 (UTC)
X-policyd-weight: using cached result; rate: -7
Received: from mail-yx0-f175.google.com (mail-yx0-f175.google.com [209.85.213.175])
(using TLSv1 with cipher RC4-SHA (128/128 bits))
(Client CN "smtp.gmail.com", Issuer "Google Internet Authority" (not verified))
by liszt.debian.org (Postfix) with ESMTPS id 34CC213A6454
for <debian-user@lists.debian.org>; Wed, 5 Oct 2011 20:08:58 +0000 (UTC)
Received: by yxj17 with SMTP id 17so1972259yxj.6
for <debian-user@lists.debian.org>; Wed, 05 Oct 2011 13:08:49 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=gamma;
h=message-id:date:from:user-agent:mime-version:to:subject:references
:in-reply-to:content-type:content-transfer-encoding;
bh=XEsf4XufMlK/TqziHWaUTlcGrkIQ39iPxSPVQrSC6+w=;
b=TtI3Lbg/t9T9oWyXqR943yQnM3A05KDMxfv7K0kG6Xyr/ZGteJqZAvk8g7EC1mhOOt
Nl7plaUglmWXXeGbUv3SxrqSgYbQRhIRPXFBfwjtDaoi+Wa4lN 4DIl53IIR5kh6j7CTj
lTYKWHoUfn9CEvLBfMF7MiHP5vW7rDsEXUgME=
Received: by 10.223.35.202 with SMTP id q10mr4018630fad.138.1317845328902;
Wed, 05 Oct 2011 13:08:48 -0700 (PDT)
Received: from [192.168.2.156] (host-83.2.142.85.tvksmp.pl. [83.2.142.85])
by mx.google.com with ESMTPS id x22sm3740052faa.5.2011.10.05.13.08.47
(version=TLSv1/SSLv3 cipher=OTHER);
Wed, 05 Oct 2011 13:08:48 -0700 (PDT)
Message-ID: <4E8CB94E.4080707@gmail.com>
Date: Wed, 05 Oct 2011 22:08:46 +0200
From: =?UTF-8?B?SXJlbmV1c3ogU3pjemXFm25pYWs=?=
<irek.szczesniak@gmail.com>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.16) Gecko/20110818 Icedove/3.0.11
MIME-Version: 1.0
To: debian-user@lists.debian.org
Subject: Re: IP address depending on the MAC
References: <4E8891FF.4020606@gmail.com> <4E8967A4.3010406@gmail.com>
In-Reply-To: <4E8967A4.3010406@gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Rc-Virus: 2007-09-13_01
X-Rc-Spam: 2008-11-04_01
Resent-Message-ID: <wILty1KBEc.A.10B.olLjOB@liszt>
Resent-From: debian-user@lists.debian.org
X-Mailing-List: <debian-user@lists.debian.org> archive/latest/614232
X-Loop: debian-user@lists.debian.org
List-Id: <debian-user.lists.debian.org>
List-Post: <mailto:debian-user@lists.debian.org>
List-Help: <mailto:debian-user-request@lists.debian.org?subject=help>
List-Subscribe: <mailto:debian-user-request@lists.debian.org?subject=subscribe>
List-Unsubscribe: <mailto:debian-user-request@lists.debian.org?subject=unsubscribe>
Precedence: list
Resent-Sender: debian-user-request@lists.debian.org
Resent-Date: Wed, 5 Oct 2011 20:09:12 +0000 (UTC)

On 03.10.2011 09:43, Scott Ferguson wrote:

> If you have a large number of identical hardware machines to image - a
> multicast solution that allows defining build rules (ie static network
> addresses, hostname, /etc/host, passwords, usernames, network shares and
> logons etc)
> might save time and make management easier. eg. FAI (most excellent)
> Clonezilla SE (excellent), systemimager (I haven't used), and others.
>
> For identical software and different hardware using a package list (dpkg
> --get-selections> package.list) can be used.
>
> For large deployments consider Puppet (very good, especially in mixed
> environments).

Thanks for this info! I looked at Clonezilla, but I didn't configure
it yet.

I want to use multicast, because at the same time I need to reimage 12
computers. However, I looked at FAI, which looks very nice, and it
doesn't seem to support multicast.


Best,
Irek

--
Ireneusz (Irek) Szczesniak
http://www.irkos.org


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: http://lists.debian.org/4E8CB94E.4080707@gmail.com


Wed Oct 5 23:30:01 2011
Return-path: <bounce-debian-user=tom=linux-archive.org@lists.debian.org>
Envelope-to: tom@linux-archive.org
Delivery-date: Wed, 05 Oct 2011 23:02:53 +0300
Received: from liszt.debian.org ([82.195.75.100]:32869)
by s2.java-tips.org with esmtps (TLSv1:AES256-SHA:256)
(Exim 4.69)
(envelope-from <bounce-debian-user=tom=linux-archive.org@lists.debian.org>)
id 1RBXg5-0005lA-2r
for tom@linux-archive.org; Wed, 05 Oct 2011 23:02:53 +0300
Received: from localhost (localhost [127.0.0.1])
by liszt.debian.org (Postfix) with QMQP
id 1B54413A7059; Wed, 5 Oct 2011 20:12:11 +0000 (UTC)
Old-Return-Path: <irek.szczesniak@gmail.com>
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on liszt.debian.org
X-Spam-Level:
X-Spam-Status: No, score=-12.0 required=4.0 tests=LDOSUBSCRIBER,LDO_WHITELIST,
RCVD_IN_DNSWL_LOW autolearn=failed version=3.2.5
X-Original-To: lists-debian-user@liszt.debian.org
Delivered-To: lists-debian-user@liszt.debian.org
Received: from localhost (localhost [127.0.0.1])
by liszt.debian.org (Postfix) with ESMTP id 6C8E213A7052
for <lists-debian-user@liszt.debian.org>; Wed, 5 Oct 2011 20:12:04 +0000 (UTC)
X-Virus-Scanned: at lists.debian.org with policy bank en-ht
X-Amavis-Spam-Status: No, score=-8 tagged_above=-10000 required=5.3
tests=[BAYES_00=-2, LDO_WHITELIST=-5, RCVD_IN_DNSWL_LOW=-1]
autolearn=ham
Received: from liszt.debian.org ([127.0.0.1])
by localhost (lists.debian.org [127.0.0.1]) (amavisd-new, port 2525)
with ESMTP id oQeGGFUxJtMP for <lists-debian-user@liszt.debian.org>;
Wed, 5 Oct 2011 20:11:57 +0000 (UTC)
X-policyd-weight: using cached result; rate: -6.9
Received: from mail-gx0-f175.google.com (mail-gx0-f175.google.com [209.85.161.175])
(using TLSv1 with cipher RC4-SHA (128/128 bits))
(Client CN "smtp.gmail.com", Issuer "Google Internet Authority" (not verified))
by liszt.debian.org (Postfix) with ESMTPS id BAE4913A5D94
for <debian-user@lists.debian.org>; Wed, 5 Oct 2011 20:11:56 +0000 (UTC)
Received: by ggnq1 with SMTP id q1so1088563ggn.6
for <debian-user@lists.debian.org>; Wed, 05 Oct 2011 13:11:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=gamma;
h=message-id:date:from:user-agent:mime-version:to:subject:references
:in-reply-to:content-type:content-transfer-encoding;
bh=3Kr6FFzfNJTRbALQuHB1YVw1zqclzKE897V4WCPgU00=;
b=C2mkKUt2Qb76XqSKaszPX2Vxzio+IUOv6dxPoXQ+q1dpFHis 0C3QF2L6jZxcXEJ8go
u/tewSyldKDJoHf6TIbDdaxlLSTbkYl71nftYwKJhtuGjsYE7OkN W/hBNw3xCGmFwXak
2DkGikFh5xhI8ZIpOWOm3D1Ooun9H54hzwpCE=
Received: by 10.223.52.91 with SMTP id h27mr2520163fag.18.1317845507376;
Wed, 05 Oct 2011 13:11:47 -0700 (PDT)
Received: from [192.168.2.156] (host-83.2.142.85.tvksmp.pl. [83.2.142.85])
by mx.google.com with ESMTPS id k26sm3730212fab.12.2011.10.05.13.11.46
(version=TLSv1/SSLv3 cipher=OTHER);
Wed, 05 Oct 2011 13:11:46 -0700 (PDT)
Message-ID: <4E8CBA01.10900@gmail.com>
Date: Wed, 05 Oct 2011 22:11:45 +0200
From: =?UTF-8?B?SXJlbmV1c3ogU3pjemXFm25pYWs=?=
<irek.szczesniak@gmail.com>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.16) Gecko/20110818 Icedove/3.0.11
MIME-Version: 1.0
To: debian-user@lists.debian.org
Subject: Re: IP address depending on the MAC
References: <4E8891FF.4020606@gmail.com> <20111003103830.GD14119@darac.org.uk>
In-Reply-To: <20111003103830.GD14119@darac.org.uk>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Rc-Virus: 2007-09-13_01
X-Rc-Spam: 2008-11-04_01
Resent-Message-ID: <xSC29uxAEsL.A.vRC.boLjOB@liszt>
Resent-From: debian-user@lists.debian.org
X-Mailing-List: <debian-user@lists.debian.org> archive/latest/614233
X-Loop: debian-user@lists.debian.org
List-Id: <debian-user.lists.debian.org>
List-Post: <mailto:debian-user@lists.debian.org>
List-Help: <mailto:debian-user-request@lists.debian.org?subject=help>
List-Subscribe: <mailto:debian-user-request@lists.debian.org?subject=subscribe>
List-Unsubscribe: <mailto:debian-user-request@lists.debian.org?subject=unsubscribe>
Precedence: list
Resent-Sender: debian-user-request@lists.debian.org
Resent-Date: Wed, 5 Oct 2011 20:12:11 +0000 (UTC)

On 03.10.2011 12:38, Darac Marjal wrote:

> I'm not sure about hostname but there IS a way to assign a static IP
> address to an interface based on it's MAC address. It's called StateLess
> Address Autoconfiguration (SLAAC). Basically, you take the 48-bit mac
> address, modify it a little (mainly adding FF:FE in the middle) and you
> now have a 64-bit interface identifier. This becomes the second half of
> your IP address.
>
> For the first half of the address, you will either use the standard
> link-local prefix, or if you can use a route-advertising daemon
> (package: radvd), you can advertise a site-wide prefix to allow the
> machines to be addressed from anywhere. I suspect, though, that the
> link-local prefix should be fine for your purposes, though.

This is alright for the IPv6 addresses, but I'm using the IPv4.
However, maybe I should migrate to IPv6. Thanks for the tip!

--
Ireneusz (Irek) Szczesniak
http://www.irkos.org


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: http://lists.debian.org/4E8CBA01.10900@gmail.com
 
Old 10-10-2011, 02:43 PM
David Teigland
 
Default DLM: Convert rsb data from linked list to rb_tree

On Sat, Oct 08, 2011 at 06:13:52AM -0400, Bob Peterson wrote:
> ----- Original Message -----
> | On Wed, Oct 05, 2011 at 03:25:39PM -0400, Bob Peterson wrote:
> | > Hi,
> | >
> | > This upstream patch changes the way DLM keeps track of RSBs.
> | > Before, they were in a linked list off a hash table. Now,
> | > they're an rb_tree off the same hash table. This speeds up
> | > DLM lookups greatly.
> | >
> | > Today's DLM is faster than older DLMs for many file systems,
> | > (e.g. in RHEL5) due to the larger hash table size. However,
> | > this rb_tree implementation scales much better. For my
> | > 1000-directories-with-1000-files test, the patch doesn't
> | > show much of an improvement. But when I scale the file system
> | > to 4000 directories with 4000 files (16 million files), it
> | > helps greatly. The time to do rm -fR /mnt/gfs2/* drops from
> | > 42.01 hours to 23.68 hours.
> |
> | How many hash table buckets were you using in that test?
> | If it was the default (1024), I'd be interested to know how
> | 16k compares.
>
> Hi,
>
> Interestingly, on the stock 2.6.32-206.el6.x86_64 kernel
> and 16K hash buckets, the time was virtually the same as
> with my patch: 1405m46.519s (23.43 hours). So perhaps we
> should re-evaluate whether we should use the rb_tree
> implementation or just increase the hash buckets as needed.
> I guess the question is now mainly related to scaling and
> memory usage for all those hash tables at this point.

I'm still interested in possibly using an rbtree with fewer hash buckets.

At the same time, I think the bigger problem may be why gfs2 is caching so
many locks in the first place, especially for millions of unlinked files
whose locks will never benefit you again.
 
Old 10-10-2011, 03:51 PM
Steven Whitehouse
 
Default DLM: Convert rsb data from linked list to rb_tree

Hi,

On Mon, 2011-10-10 at 10:43 -0400, David Teigland wrote:
> On Sat, Oct 08, 2011 at 06:13:52AM -0400, Bob Peterson wrote:
> > ----- Original Message -----
> > | On Wed, Oct 05, 2011 at 03:25:39PM -0400, Bob Peterson wrote:
> > | > Hi,
> > | >
> > | > This upstream patch changes the way DLM keeps track of RSBs.
> > | > Before, they were in a linked list off a hash table. Now,
> > | > they're an rb_tree off the same hash table. This speeds up
> > | > DLM lookups greatly.
> > | >
> > | > Today's DLM is faster than older DLMs for many file systems,
> > | > (e.g. in RHEL5) due to the larger hash table size. However,
> > | > this rb_tree implementation scales much better. For my
> > | > 1000-directories-with-1000-files test, the patch doesn't
> > | > show much of an improvement. But when I scale the file system
> > | > to 4000 directories with 4000 files (16 million files), it
> > | > helps greatly. The time to do rm -fR /mnt/gfs2/* drops from
> > | > 42.01 hours to 23.68 hours.
> > |
> > | How many hash table buckets were you using in that test?
> > | If it was the default (1024), I'd be interested to know how
> > | 16k compares.
> >
> > Hi,
> >
> > Interestingly, on the stock 2.6.32-206.el6.x86_64 kernel
> > and 16K hash buckets, the time was virtually the same as
> > with my patch: 1405m46.519s (23.43 hours). So perhaps we
> > should re-evaluate whether we should use the rb_tree
> > implementation or just increase the hash buckets as needed.
> > I guess the question is now mainly related to scaling and
> > memory usage for all those hash tables at this point.
>
> I'm still interested in possibly using an rbtree with fewer hash buckets.
>
> At the same time, I think the bigger problem may be why gfs2 is caching so
> many locks in the first place, especially for millions of unlinked files
> whose locks will never benefit you again.
>

I doubt that it is caching locks for unlinked files for any great period
of time if there is any memory pressure. It is memory pressure which
instigates the dropping of glocks relating to inodes usually. If the
glock is unlocked then simply dropping the ref count on the glock will
cause the deallocation (dlm lock drop) to occur.

The reason why this tends to be seen at unlink time is just that
unlinking small files is a good way to iterate over a lot of inodes in a
relatively short time period, and thus locking can start to dominate the
timings if it isn't very low latency. I suspect there are lots of other
workloads which would see this (find perhaps?) as well. The faster the
array, the more noticeable the effect is likely to be.

Also, we do get some benefit if glocks are cached after inodes have been
unlinked. If an inode is then recreated in the same disk block, we'll
have all the glocks ready for it, without having to wait for them to be
set up from scratch.

I am slightly concerned that we ought to be testing with much greater
numbers of inodes for these kinds of tests. Considering the memory sizes
of modern servers, having 1m or even 10m files in cache is really not a
lot these days.

There is a second consideration in selecting the data structure for the
dlm lock/resource etc. tables, which is the locking. Using a simple hash
table does make it much easier to use RCU based locking which is a great
advantage if there are a lot of look up operations compared with the
number of insert/remove operations. That becomes a lot more difficult
(and requires a much greater lookup to insert/remove ratio) when trees
are involved.

We have run into the same kind of issues with GFS2's glock hash table.
We firstly reduced the size of the hash buckets to a single pointer so
that we could have a lot more hash heads for the given amount of memory.
Currently we have 2^16 hash table entries.

Also, we are now using RCU in order to reduce the locking overhead as
much as possible. We still have a spinlock for the lru list, but that is
not taken that much by comparison, and there is no locking in the lookup
path now except to disable preempt through the small critical section.

I'm not yet convinced that this is enough, even so. I would like to see
a tree based system for the glocks at some future point, but the
complexity of tree based RCU is currently putting me off, despite the
obvious benefits of a log N (where N is the number of items in the tree)
lookup compared with the N/2 average lookup time for a hash chain. We'd
need to add an seq lock on the read side which would take away some of
the benefit of using RCU in the first place.

However, where we have scored real performance gains is simply by
avoiding looking up the glocks at all. Hence why we cache them in the
inodes, and this is also why the void pointer we pass to the dlm is a
pointer to the glock, so there are then no hash table lookups when we
receive dlm replies. I think there is scope to explore this option in
the dlm in a few places too.

Our most recent expansion of this concept in gfs2 was the addition of a
variable in the inode to cache the last resource group used. This
reduces the number of lookups of the resource group data structure
dramatically during block allocations and deallocations.

Anyway, I just wanted to point out some of the tradeoffs of the possible
choices available in making gfs2/dlm scale to larger numbers of inodes.
I would very much like to have an RCU based rbtree (or similar) which
doesn't require the seq_lock if anybody has suggestions on how to
achieve that,

Steve.
 
Old 10-10-2011, 05:01 PM
David Teigland
 
Default DLM: Convert rsb data from linked list to rb_tree

On Mon, Oct 10, 2011 at 04:51:01PM +0100, Steven Whitehouse wrote:
> Hi,
>
> On Mon, 2011-10-10 at 10:43 -0400, David Teigland wrote:
> > On Sat, Oct 08, 2011 at 06:13:52AM -0400, Bob Peterson wrote:
> > > ----- Original Message -----
> > > | On Wed, Oct 05, 2011 at 03:25:39PM -0400, Bob Peterson wrote:
> > > | > Hi,
> > > | >
> > > | > This upstream patch changes the way DLM keeps track of RSBs.
> > > | > Before, they were in a linked list off a hash table. Now,
> > > | > they're an rb_tree off the same hash table. This speeds up
> > > | > DLM lookups greatly.
> > > | >
> > > | > Today's DLM is faster than older DLMs for many file systems,
> > > | > (e.g. in RHEL5) due to the larger hash table size. However,
> > > | > this rb_tree implementation scales much better. For my
> > > | > 1000-directories-with-1000-files test, the patch doesn't
> > > | > show much of an improvement. But when I scale the file system
> > > | > to 4000 directories with 4000 files (16 million files), it
> > > | > helps greatly. The time to do rm -fR /mnt/gfs2/* drops from
> > > | > 42.01 hours to 23.68 hours.
> > > |
> > > | How many hash table buckets were you using in that test?
> > > | If it was the default (1024), I'd be interested to know how
> > > | 16k compares.
> > >
> > > Hi,
> > >
> > > Interestingly, on the stock 2.6.32-206.el6.x86_64 kernel
> > > and 16K hash buckets, the time was virtually the same as
> > > with my patch: 1405m46.519s (23.43 hours). So perhaps we
> > > should re-evaluate whether we should use the rb_tree
> > > implementation or just increase the hash buckets as needed.
> > > I guess the question is now mainly related to scaling and
> > > memory usage for all those hash tables at this point.
> >
> > I'm still interested in possibly using an rbtree with fewer hash buckets.
> >
> > At the same time, I think the bigger problem may be why gfs2 is caching so
> > many locks in the first place, especially for millions of unlinked files
> > whose locks will never benefit you again.
> >
>
> I doubt that it is caching locks for unlinked files for any great period
> of time if there is any memory pressure. It is memory pressure which
> instigates the dropping of glocks relating to inodes usually. If the
> glock is unlocked then simply dropping the ref count on the glock will
> cause the deallocation (dlm lock drop) to occur.
>
> The reason why this tends to be seen at unlink time is just that
> unlinking small files is a good way to iterate over a lot of inodes in a
> relatively short time period, and thus locking can start to dominate the
> timings if it isn't very low latency. I suspect there are lots of other
> workloads which would see this (find perhaps?) as well. The faster the
> array, the more noticeable the effect is likely to be.
>
> Also, we do get some benefit if glocks are cached after inodes have been
> unlinked. If an inode is then recreated in the same disk block, we'll
> have all the glocks ready for it, without having to wait for them to be
> set up from scratch.
>
> I am slightly concerned that we ought to be testing with much greater
> numbers of inodes for these kinds of tests. Considering the memory sizes
> of modern servers, having 1m or even 10m files in cache is really not a
> lot these days.

The fact remains that caching "as much as possible" tends to be harmful,
and some careful limiting would be a good investment.

> There is a second consideration in selecting the data structure for the
> dlm lock/resource etc. tables, which is the locking. Using a simple hash
> table does make it much easier to use RCU based locking which is a great
> advantage if there are a lot of look up operations compared with the
> number of insert/remove operations. That becomes a lot more difficult
> (and requires a much greater lookup to insert/remove ratio) when trees
> are involved.
>
> We have run into the same kind of issues with GFS2's glock hash table.
> We firstly reduced the size of the hash buckets to a single pointer so
> that we could have a lot more hash heads for the given amount of memory.
> Currently we have 2^16 hash table entries.
>
> Also, we are now using RCU in order to reduce the locking overhead as
> much as possible. We still have a spinlock for the lru list, but that is
> not taken that much by comparison, and there is no locking in the lookup
> path now except to disable preempt through the small critical section.
>
> I'm not yet convinced that this is enough, even so. I would like to see
> a tree based system for the glocks at some future point, but the
> complexity of tree based RCU is currently putting me off, despite the
> obvious benefits of a log N (where N is the number of items in the tree)
> lookup compared with the N/2 average lookup time for a hash chain. We'd
> need to add an seq lock on the read side which would take away some of
> the benefit of using RCU in the first place.
>
> However, where we have scored real performance gains is simply by
> avoiding looking up the glocks at all. Hence why we cache them in the
> inodes, and this is also why the void pointer we pass to the dlm is a
> pointer to the glock, so there are then no hash table lookups when we
> receive dlm replies. I think there is scope to explore this option in
> the dlm in a few places too.
>
> Our most recent expansion of this concept in gfs2 was the addition of a
> variable in the inode to cache the last resource group used. This
> reduces the number of lookups of the resource group data structure
> dramatically during block allocations and deallocations.
>
> Anyway, I just wanted to point out some of the tradeoffs of the possible
> choices available in making gfs2/dlm scale to larger numbers of inodes.
> I would very much like to have an RCU based rbtree (or similar) which
> doesn't require the seq_lock if anybody has suggestions on how to
> achieve that,

While you don't want to be blatently inefficient, I suspect investing time
into things like RCU runs afoul of Amdahl's Law. i.e. you could reduce
the time of the data structure operations to 0 and not affect real world
performance.

When deciding where to focus improvements, I think cache limting could
produce a big payoff, and data structure optimizations a small payoff.
 
Old 10-10-2011, 07:00 PM
Steven Whitehouse
 
Default DLM: Convert rsb data from linked list to rb_tree

Hi,

On Mon, 2011-10-10 at 13:01 -0400, David Teigland wrote:
> On Mon, Oct 10, 2011 at 04:51:01PM +0100, Steven Whitehouse wrote:
> > Hi,
> >
> > On Mon, 2011-10-10 at 10:43 -0400, David Teigland wrote:
> > > On Sat, Oct 08, 2011 at 06:13:52AM -0400, Bob Peterson wrote:
> > > > ----- Original Message -----
> > > > | On Wed, Oct 05, 2011 at 03:25:39PM -0400, Bob Peterson wrote:
> > > > | > Hi,
> > > > | >
> > > > | > This upstream patch changes the way DLM keeps track of RSBs.
> > > > | > Before, they were in a linked list off a hash table. Now,
> > > > | > they're an rb_tree off the same hash table. This speeds up
> > > > | > DLM lookups greatly.
> > > > | >
> > > > | > Today's DLM is faster than older DLMs for many file systems,
> > > > | > (e.g. in RHEL5) due to the larger hash table size. However,
> > > > | > this rb_tree implementation scales much better. For my
> > > > | > 1000-directories-with-1000-files test, the patch doesn't
> > > > | > show much of an improvement. But when I scale the file system
> > > > | > to 4000 directories with 4000 files (16 million files), it
> > > > | > helps greatly. The time to do rm -fR /mnt/gfs2/* drops from
> > > > | > 42.01 hours to 23.68 hours.
> > > > |
> > > > | How many hash table buckets were you using in that test?
> > > > | If it was the default (1024), I'd be interested to know how
> > > > | 16k compares.
> > > >
> > > > Hi,
> > > >
> > > > Interestingly, on the stock 2.6.32-206.el6.x86_64 kernel
> > > > and 16K hash buckets, the time was virtually the same as
> > > > with my patch: 1405m46.519s (23.43 hours). So perhaps we
> > > > should re-evaluate whether we should use the rb_tree
> > > > implementation or just increase the hash buckets as needed.
> > > > I guess the question is now mainly related to scaling and
> > > > memory usage for all those hash tables at this point.
> > >
> > > I'm still interested in possibly using an rbtree with fewer hash buckets.
> > >
> > > At the same time, I think the bigger problem may be why gfs2 is caching so
> > > many locks in the first place, especially for millions of unlinked files
> > > whose locks will never benefit you again.
> > >
> >
> > I doubt that it is caching locks for unlinked files for any great period
> > of time if there is any memory pressure. It is memory pressure which
> > instigates the dropping of glocks relating to inodes usually. If the
> > glock is unlocked then simply dropping the ref count on the glock will
> > cause the deallocation (dlm lock drop) to occur.
> >
> > The reason why this tends to be seen at unlink time is just that
> > unlinking small files is a good way to iterate over a lot of inodes in a
> > relatively short time period, and thus locking can start to dominate the
> > timings if it isn't very low latency. I suspect there are lots of other
> > workloads which would see this (find perhaps?) as well. The faster the
> > array, the more noticeable the effect is likely to be.
> >
> > Also, we do get some benefit if glocks are cached after inodes have been
> > unlinked. If an inode is then recreated in the same disk block, we'll
> > have all the glocks ready for it, without having to wait for them to be
> > set up from scratch.
> >
> > I am slightly concerned that we ought to be testing with much greater
> > numbers of inodes for these kinds of tests. Considering the memory sizes
> > of modern servers, having 1m or even 10m files in cache is really not a
> > lot these days.
>
> The fact remains that caching "as much as possible" tends to be harmful,
> and some careful limiting would be a good investment.
>
There is a limit. The point is that the limit is dynamic and depends on
memory pressure. The VM requests a reduction in the number of cached
objects when it wants to reduce the size of what we have cached. So the
larger the amount of RAM, the more inodes may be potentially cached.

I don't agree that caching as much as possible (given the constraints
just mentioned) is bad. The more we cache, the more disk I/O is avoided
so the better the performance will be.

> > There is a second consideration in selecting the data structure for the
> > dlm lock/resource etc. tables, which is the locking. Using a simple hash
> > table does make it much easier to use RCU based locking which is a great
> > advantage if there are a lot of look up operations compared with the
> > number of insert/remove operations. That becomes a lot more difficult
> > (and requires a much greater lookup to insert/remove ratio) when trees
> > are involved.
> >
> > We have run into the same kind of issues with GFS2's glock hash table.
> > We firstly reduced the size of the hash buckets to a single pointer so
> > that we could have a lot more hash heads for the given amount of memory.
> > Currently we have 2^16 hash table entries.
> >
> > Also, we are now using RCU in order to reduce the locking overhead as
> > much as possible. We still have a spinlock for the lru list, but that is
> > not taken that much by comparison, and there is no locking in the lookup
> > path now except to disable preempt through the small critical section.
> >
> > I'm not yet convinced that this is enough, even so. I would like to see
> > a tree based system for the glocks at some future point, but the
> > complexity of tree based RCU is currently putting me off, despite the
> > obvious benefits of a log N (where N is the number of items in the tree)
> > lookup compared with the N/2 average lookup time for a hash chain. We'd
> > need to add an seq lock on the read side which would take away some of
> > the benefit of using RCU in the first place.
> >
> > However, where we have scored real performance gains is simply by
> > avoiding looking up the glocks at all. Hence why we cache them in the
> > inodes, and this is also why the void pointer we pass to the dlm is a
> > pointer to the glock, so there are then no hash table lookups when we
> > receive dlm replies. I think there is scope to explore this option in
> > the dlm in a few places too.
> >
> > Our most recent expansion of this concept in gfs2 was the addition of a
> > variable in the inode to cache the last resource group used. This
> > reduces the number of lookups of the resource group data structure
> > dramatically during block allocations and deallocations.
> >
> > Anyway, I just wanted to point out some of the tradeoffs of the possible
> > choices available in making gfs2/dlm scale to larger numbers of inodes.
> > I would very much like to have an RCU based rbtree (or similar) which
> > doesn't require the seq_lock if anybody has suggestions on how to
> > achieve that,
>
> While you don't want to be blatently inefficient, I suspect investing time
> into things like RCU runs afoul of Amdahl's Law. i.e. you could reduce
> the time of the data structure operations to 0 and not affect real world
> performance.
>
> When deciding where to focus improvements, I think cache limting could
> produce a big payoff, and data structure optimizations a small payoff.
>

We already have a limit (as per comments above) and I'm not sure whether
you are proposing any further constraints here. The fact remains though
that we ought to be able to make maximum use of whatever RAM is
available on the server, and it makes no sense to artificially limit the
amount of RAM used by glocks/dlm locks (i.e. number of objects which are
cached) beyond that point.

As a consequence we need to ensure that locking operations do not slow
down too much as the number of locks increases. I had thought that we
both were agreed on that point, at least, even if not what to do about
it.

The point of RCU is to avoid the cache pollution associated with
reader/writer locks, and that can be a big win, depending on the access
pattern to the data structure. So it does depend a great deal on that
pattern as to what pay off you get, but it was certainly worth doing for
the glocks.

One side effect of this is that nothing can ever block a glock dump now.
That can potentially be useful for debugging, although it was not the
main reason for doing it.

I would like to use RCU for resource groups too - they are "read only"
from mount time to umount time (aside from if we grow the filesystem)
but the tree structure we have for resource groups rather puts me off.
Most of the performance gain we've got so far has been down to the
caching of the resource groups in the inode, and there may be little
left to gain from RCU. That said, we'll be testing it soon I hope, and
we'll see if is worth doing or not,

Steve.
 
Old 10-10-2011, 07:33 PM
David Teigland
 
Default DLM: Convert rsb data from linked list to rb_tree

On Mon, Oct 10, 2011 at 08:00:07PM +0100, Steven Whitehouse wrote:
> > The fact remains that caching "as much as possible" tends to be harmful,
> > and some careful limiting would be a good investment.
> >
> There is a limit. The point is that the limit is dynamic and depends on
> memory pressure.

the "as possible" part of "as much as possible".

> The VM requests a reduction in the number of cached
> objects when it wants to reduce the size of what we have cached. So the
> larger the amount of RAM, the more inodes may be potentially cached.
>
> I don't agree that caching as much as possible (given the constraints
> just mentioned) is bad.

the "tends to" part meant that it can be good or bad, depending.

> The more we cache, the more disk I/O is avoided so the better the
> performance will be.

You don't need to argue that more caching can be good. My point is it's
not universally true in practice, as Bob's tests show.
 
Old 10-25-2011, 11:13 PM
David Teigland
 
Default DLM: Convert rsb data from linked list to rb_tree

Hi Bob, I've made a few minor/cosmetic changes and attached my current
version (not tested yet).

> static int shrink_bucket(struct dlm_ls *ls, int b)
> {
> + struct rb_node *n = NULL;
> struct dlm_rsb *r;
> int count = 0, found;
>
> for (; {
> found = 0;
> spin_lock(&ls->ls_rsbtbl[b].lock);
> - list_for_each_entry_reverse(r, &ls->ls_rsbtbl[b].toss,
> - res_hashchain) {
> + for (n = rb_first(&ls->ls_rsbtbl[b].toss); n;
> + n = rb_next(&r->res_hashnode)) {
> + r = rb_entry(n, struct dlm_rsb, res_hashnode);
> + if (unlikely(&r->res_hashnode == n)) {
> + spin_unlock(&ls->ls_rsbtbl[b].lock);
> + return 0;
> + }

Isn't that unlikely statement actually always true?

> if (!time_after_eq(jiffies, r->res_toss_time +
> dlm_config.ci_toss_secs * HZ))
> continue;
> @@ -1108,7 +1167,7 @@ static int shrink_bucket(struct dlm_ls *ls, int b)
> }
>
> if (kref_put(&r->res_ref, kill_rsb)) {
> - list_del(&r->res_hashchain);
> + rb_erase(&r->res_hashnode, &ls->ls_rsbtbl[b].toss);
> spin_unlock(&ls->ls_rsbtbl[b].lock);
>
> if (is_master(r))


diff --git a/fs/dlm/debug_fs.c b/fs/dlm/debug_fs.c
index 5977923..a834f6b 100644
--- a/fs/dlm/debug_fs.c
+++ b/fs/dlm/debug_fs.c
@@ -393,6 +393,7 @@ static const struct seq_operations format3_seq_ops;

static void *table_seq_start(struct seq_file *seq, loff_t *pos)
{
+ struct rb_node *node;
struct dlm_ls *ls = seq->private;
struct rsbtbl_iter *ri;
struct dlm_rsb *r;
@@ -418,9 +419,10 @@ static void *table_seq_start(struct seq_file *seq, loff_t *pos)
ri->format = 3;

spin_lock(&ls->ls_rsbtbl[bucket].lock);
- if (!list_empty(&ls->ls_rsbtbl[bucket].list)) {
- list_for_each_entry(r, &ls->ls_rsbtbl[bucket].list,
- res_hashchain) {
+ if (!RB_EMPTY_ROOT(&ls->ls_rsbtbl[bucket].keep)) {
+ for (node = rb_first(&ls->ls_rsbtbl[bucket].keep); node;
+ node = rb_next(node)) {
+ r = rb_entry(node, struct dlm_rsb, res_hashnode);
if (!entry--) {
dlm_hold_rsb(r);
ri->rsb = r;
@@ -449,9 +451,8 @@ static void *table_seq_start(struct seq_file *seq, loff_t *pos)
}

spin_lock(&ls->ls_rsbtbl[bucket].lock);
- if (!list_empty(&ls->ls_rsbtbl[bucket].list)) {
- r = list_first_entry(&ls->ls_rsbtbl[bucket].list,
- struct dlm_rsb, res_hashchain);
+ if (!RB_EMPTY_ROOT(&ls->ls_rsbtbl[bucket].keep)) {
+ r = rb_entry(node, struct dlm_rsb, res_hashnode);
dlm_hold_rsb(r);
ri->rsb = r;
ri->bucket = bucket;
@@ -467,7 +468,7 @@ static void *table_seq_next(struct seq_file *seq, void *iter_ptr, loff_t *pos)
{
struct dlm_ls *ls = seq->private;
struct rsbtbl_iter *ri = iter_ptr;
- struct list_head *next;
+ struct rb_node *next;
struct dlm_rsb *r, *rp;
loff_t n = *pos;
unsigned bucket;
@@ -480,10 +481,10 @@ static void *table_seq_next(struct seq_file *seq, void *iter_ptr, loff_t *pos)

spin_lock(&ls->ls_rsbtbl[bucket].lock);
rp = ri->rsb;
- next = rp->res_hashchain.next;
+ next = rb_next(&rp->res_hashnode);

- if (next != &ls->ls_rsbtbl[bucket].list) {
- r = list_entry(next, struct dlm_rsb, res_hashchain);
+ if (next) {
+ r = rb_entry(next, struct dlm_rsb, res_hashnode);
dlm_hold_rsb(r);
ri->rsb = r;
spin_unlock(&ls->ls_rsbtbl[bucket].lock);
@@ -511,9 +512,9 @@ static void *table_seq_next(struct seq_file *seq, void *iter_ptr, loff_t *pos)
}

spin_lock(&ls->ls_rsbtbl[bucket].lock);
- if (!list_empty(&ls->ls_rsbtbl[bucket].list)) {
- r = list_first_entry(&ls->ls_rsbtbl[bucket].list,
- struct dlm_rsb, res_hashchain);
+ if (!RB_EMPTY_ROOT(&ls->ls_rsbtbl[bucket].keep)) {
+ next = rb_first(&ls->ls_rsbtbl[bucket].keep);
+ r = rb_entry(next, struct dlm_rsb, res_hashnode);
dlm_hold_rsb(r);
ri->rsb = r;
ri->bucket = bucket;
diff --git a/fs/dlm/dlm_internal.h b/fs/dlm/dlm_internal.h
index fe2860c..5685a9a 100644
--- a/fs/dlm/dlm_internal.h
+++ b/fs/dlm/dlm_internal.h
@@ -103,8 +103,8 @@ struct dlm_dirtable {
};

struct dlm_rsbtable {
- struct list_head list;
- struct list_head toss;
+ struct rb_root keep;
+ struct rb_root toss;
spinlock_t lock;
};

@@ -285,7 +285,10 @@ struct dlm_rsb {
unsigned long res_toss_time;
uint32_t res_first_lkid;
struct list_head res_lookup; /* lkbs waiting on first */
- struct list_head res_hashchain; /* rsbtbl */
+ union {
+ struct list_head res_hashchain;
+ struct rb_node res_hashnode; /* rsbtbl */
+ };
struct list_head res_grantqueue;
struct list_head res_convertqueue;
struct list_head res_waitqueue;
diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c
index 83b5e32..1373f62 100644
--- a/fs/dlm/lock.c
+++ b/fs/dlm/lock.c
@@ -56,6 +56,7 @@
L: receive_xxxx_reply() <- R: send_xxxx_reply()
*/
#include <linux/types.h>
+#include <linux/rbtree.h>
#include <linux/slab.h>
#include "dlm_internal.h"
#include <linux/dlm_device.h>
@@ -380,6 +381,8 @@ static int get_rsb_struct(struct dlm_ls *ls, char *name, int len,

r = list_first_entry(&ls->ls_new_rsb, struct dlm_rsb, res_hashchain);
list_del(&r->res_hashchain);
+ /* Convert the empty list_head to a NULL rb_node for tree usage: */
+ memset(&r->res_hashnode, 0, sizeof(struct rb_node));
ls->ls_new_rsb_count--;
spin_unlock(&ls->ls_new_rsb_spin);

@@ -388,7 +391,6 @@ static int get_rsb_struct(struct dlm_ls *ls, char *name, int len,
memcpy(r->res_name, name, len);
mutex_init(&r->res_mutex);

- INIT_LIST_HEAD(&r->res_hashchain);
INIT_LIST_HEAD(&r->res_lookup);
INIT_LIST_HEAD(&r->res_grantqueue);
INIT_LIST_HEAD(&r->res_convertqueue);
@@ -400,14 +402,31 @@ static int get_rsb_struct(struct dlm_ls *ls, char *name, int len,
return 0;
}

-static int search_rsb_list(struct list_head *head, char *name, int len,
+static int rsb_cmp(struct dlm_rsb *r, const char *name, int nlen)
+{
+ char maxname[DLM_RESNAME_MAXLEN];
+
+ memset(maxname, 0, DLM_RESNAME_MAXLEN);
+ memcpy(maxname, name, nlen);
+ return memcmp(r->res_name, maxname, DLM_RESNAME_MAXLEN);
+}
+
+static int search_rsb_tree(struct rb_root *tree, char *name, int len,
unsigned int flags, struct dlm_rsb **r_ret)
{
+ struct rb_node **node = &tree->rb_node;
struct dlm_rsb *r;
int error = 0;
-
- list_for_each_entry(r, head, res_hashchain) {
- if (len == r->res_length && !memcmp(name, r->res_name, len))
+ int rc;
+
+ while (*node) {
+ r = rb_entry(*node, struct dlm_rsb, res_hashnode);
+ rc = rsb_cmp(r, name, len);
+ if (rc < 0)
+ node = &((*node)->rb_left);
+ else if (rc > 0)
+ node = &((*node)->rb_right);
+ else
goto found;
}
*r_ret = NULL;
@@ -420,22 +439,54 @@ static int search_rsb_list(struct list_head *head, char *name, int len,
return error;
}

+static int rsb_insert(struct dlm_rsb *rsb, struct rb_root *tree)
+{
+ struct rb_node **newn = &tree->rb_node;
+ struct rb_node *parent = NULL;
+ int rc;
+
+ while (*newn) {
+ struct dlm_rsb *cur = rb_entry(*newn, struct dlm_rsb,
+ res_hashnode);
+
+ parent = *newn;
+ rc = rsb_cmp(cur, rsb->res_name, rsb->res_length);
+ if (rc < 0)
+ newn = &parent->rb_left;
+ else if (rc > 0)
+ newn = &parent->rb_right;
+ else {
+ log_print("rsb_insert match");
+ dlm_dump_rsb(rsb);
+ dlm_dump_rsb(cur);
+ return -EEXIST;
+ }
+ }
+
+ rb_link_node(&rsb->res_hashnode, parent, newn);
+ rb_insert_color(&rsb->res_hashnode, tree);
+ return 0;
+}
+
static int _search_rsb(struct dlm_ls *ls, char *name, int len, int b,
unsigned int flags, struct dlm_rsb **r_ret)
{
struct dlm_rsb *r;
int error;

- error = search_rsb_list(&ls->ls_rsbtbl[b].list, name, len, flags, &r);
+ error = search_rsb_tree(&ls->ls_rsbtbl[b].keep, name, len, flags, &r);
if (!error) {
kref_get(&r->res_ref);
goto out;
}
- error = search_rsb_list(&ls->ls_rsbtbl[b].toss, name, len, flags, &r);
+ error = search_rsb_tree(&ls->ls_rsbtbl[b].toss, name, len, flags, &r);
if (error)
goto out;

- list_move(&r->res_hashchain, &ls->ls_rsbtbl[b].list);
+ rb_erase(&r->res_hashnode, &ls->ls_rsbtbl[b].toss);
+ error = rsb_insert(r, &ls->ls_rsbtbl[b].keep);
+ if (error)
+ return error;

if (dlm_no_directory(ls))
goto out;
@@ -527,8 +578,7 @@ static int find_rsb(struct dlm_ls *ls, char *name, int namelen,
nodeid = 0;
r->res_nodeid = nodeid;
}
- list_add(&r->res_hashchain, &ls->ls_rsbtbl[bucket].list);
- error = 0;
+ error = rsb_insert(r, &ls->ls_rsbtbl[bucket].keep);
out_unlock:
spin_unlock(&ls->ls_rsbtbl[bucket].lock);
out:
@@ -556,7 +606,8 @@ static void toss_rsb(struct kref *kref)

DLM_ASSERT(list_empty(&r->res_root_list), dlm_print_rsb(r);
kref_init(&r->res_ref);
- list_move(&r->res_hashchain, &ls->ls_rsbtbl[r->res_bucket].toss);
+ rb_erase(&r->res_hashnode, &ls->ls_rsbtbl[r->res_bucket].keep);
+ rsb_insert(r, &ls->ls_rsbtbl[r->res_bucket].toss);
r->res_toss_time = jiffies;
if (r->res_lvbptr) {
dlm_free_lvb(r->res_lvbptr);
@@ -1082,19 +1133,19 @@ static void dir_remove(struct dlm_rsb *r)
r->res_name, r->res_length);
}

-/* FIXME: shouldn't this be able to exit as soon as one non-due rsb is
- found since they are in order of newest to oldest? */
+/* FIXME: make this more efficient */

static int shrink_bucket(struct dlm_ls *ls, int b)
{
+ struct rb_node *n;
struct dlm_rsb *r;
int count = 0, found;

for (; {
found = 0;
spin_lock(&ls->ls_rsbtbl[b].lock);
- list_for_each_entry_reverse(r, &ls->ls_rsbtbl[b].toss,
- res_hashchain) {
+ for (n = rb_first(&ls->ls_rsbtbl[b].toss); n; n = rb_next(n)) {
+ r = rb_entry(n, struct dlm_rsb, res_hashnode);
if (!time_after_eq(jiffies, r->res_toss_time +
dlm_config.ci_toss_secs * HZ))
continue;
@@ -1108,7 +1159,7 @@ static int shrink_bucket(struct dlm_ls *ls, int b)
}

if (kref_put(&r->res_ref, kill_rsb)) {
- list_del(&r->res_hashchain);
+ rb_erase(&r->res_hashnode, &ls->ls_rsbtbl[b].toss);
spin_unlock(&ls->ls_rsbtbl[b].lock);

if (is_master(r))
@@ -4441,10 +4492,12 @@ int dlm_purge_locks(struct dlm_ls *ls)

static struct dlm_rsb *find_purged_rsb(struct dlm_ls *ls, int bucket)
{
+ struct rb_node *n;
struct dlm_rsb *r, *r_ret = NULL;

spin_lock(&ls->ls_rsbtbl[bucket].lock);
- list_for_each_entry(r, &ls->ls_rsbtbl[bucket].list, res_hashchain) {
+ for (n = rb_first(&ls->ls_rsbtbl[bucket].keep); n; n = rb_next(n)) {
+ r = rb_entry(n, struct dlm_rsb, res_hashnode);
if (!rsb_flag(r, RSB_LOCKS_PURGED))
continue;
hold_rsb(r);
diff --git a/fs/dlm/lockspace.c b/fs/dlm/lockspace.c
index a1d8f1a..3a1319c 100644
--- a/fs/dlm/lockspace.c
+++ b/fs/dlm/lockspace.c
@@ -457,8 +457,8 @@ static int new_lockspace(const char *name, int namelen, void **lockspace,
if (!ls->ls_rsbtbl)
goto out_lsfree;
for (i = 0; i < size; i++) {
- INIT_LIST_HEAD(&ls->ls_rsbtbl[i].list);
- INIT_LIST_HEAD(&ls->ls_rsbtbl[i].toss);
+ ls->ls_rsbtbl[i].keep.rb_node = NULL;
+ ls->ls_rsbtbl[i].toss.rb_node = NULL;
spin_lock_init(&ls->ls_rsbtbl[i].lock);
}

@@ -685,7 +685,7 @@ static int lockspace_busy(struct dlm_ls *ls, int force)
static int release_lockspace(struct dlm_ls *ls, int force)
{
struct dlm_rsb *rsb;
- struct list_head *head;
+ struct rb_node *n;
int i, busy, rv;

busy = lockspace_busy(ls, force);
@@ -746,26 +746,22 @@ static int release_lockspace(struct dlm_ls *ls, int force)
*/

for (i = 0; i < ls->ls_rsbtbl_size; i++) {
- head = &ls->ls_rsbtbl[i].list;
- while (!list_empty(head)) {
- rsb = list_entry(head->next, struct dlm_rsb,
- res_hashchain);
-
- list_del(&rsb->res_hashchain);
+ while ((n = rb_first(&ls->ls_rsbtbl[i].keep))) {
+ rsb = rb_entry(n, struct dlm_rsb, res_hashnode);
+ rb_erase(n, &ls->ls_rsbtbl[i].keep);
dlm_free_rsb(rsb);
}

- head = &ls->ls_rsbtbl[i].toss;
- while (!list_empty(head)) {
- rsb = list_entry(head->next, struct dlm_rsb,
- res_hashchain);
- list_del(&rsb->res_hashchain);
+ while ((n = rb_first(&ls->ls_rsbtbl[i].toss))) {
+ rsb = rb_entry(n, struct dlm_rsb, res_hashnode);
+ rb_erase(n, &ls->ls_rsbtbl[i].toss);
dlm_free_rsb(rsb);
}
}

vfree(ls->ls_rsbtbl);

+ /* Free up the new rsb list */
while (!list_empty(&ls->ls_new_rsb)) {
rsb = list_first_entry(&ls->ls_new_rsb, struct dlm_rsb,
res_hashchain);
diff --git a/fs/dlm/memory.c b/fs/dlm/memory.c
index da64df7..81b4802 100644
--- a/fs/dlm/memory.c
+++ b/fs/dlm/memory.c
@@ -64,6 +64,7 @@ struct dlm_rsb *dlm_allocate_rsb(struct dlm_ls *ls)
struct dlm_rsb *r;

r = kmem_cache_zalloc(rsb_cache, GFP_NOFS);
+ INIT_LIST_HEAD(&r->res_hashchain);
return r;
}

diff --git a/fs/dlm/recover.c b/fs/dlm/recover.c
index 1463823..50467ce 100644
--- a/fs/dlm/recover.c
+++ b/fs/dlm/recover.c
@@ -715,6 +715,7 @@ void dlm_recover_rsbs(struct dlm_ls *ls)

int dlm_create_root_list(struct dlm_ls *ls)
{
+ struct rb_node *n;
struct dlm_rsb *r;
int i, error = 0;

@@ -727,7 +728,8 @@ int dlm_create_root_list(struct dlm_ls *ls)

for (i = 0; i < ls->ls_rsbtbl_size; i++) {
spin_lock(&ls->ls_rsbtbl[i].lock);
- list_for_each_entry(r, &ls->ls_rsbtbl[i].list, res_hashchain) {
+ for (n = rb_first(&ls->ls_rsbtbl[i].keep); n; n = rb_next(n)) {
+ r = rb_entry(n, struct dlm_rsb, res_hashnode);
list_add(&r->res_root_list, &ls->ls_root_list);
dlm_hold_rsb(r);
}
@@ -741,7 +743,8 @@ int dlm_create_root_list(struct dlm_ls *ls)
continue;
}

- list_for_each_entry(r, &ls->ls_rsbtbl[i].toss, res_hashchain) {
+ for (n = rb_first(&ls->ls_rsbtbl[i].toss); n; n = rb_next(n)) {
+ r = rb_entry(n, struct dlm_rsb, res_hashnode);
list_add(&r->res_root_list, &ls->ls_root_list);
dlm_hold_rsb(r);
}
@@ -771,16 +774,18 @@ void dlm_release_root_list(struct dlm_ls *ls)

void dlm_clear_toss_list(struct dlm_ls *ls)
{
- struct dlm_rsb *r, *safe;
+ struct rb_node *n, *next;
+ struct dlm_rsb *rsb;
int i;

for (i = 0; i < ls->ls_rsbtbl_size; i++) {
spin_lock(&ls->ls_rsbtbl[i].lock);
- list_for_each_entry_safe(r, safe, &ls->ls_rsbtbl[i].toss,
- res_hashchain) {
- if (dlm_no_directory(ls) || !is_master(r)) {
- list_del(&r->res_hashchain);
- dlm_free_rsb(r);
+ for (n = rb_first(&ls->ls_rsbtbl[i].toss); n; n = next) {
+ next = rb_next(n);;
+ rsb = rb_entry(n, struct dlm_rsb, res_hashnode);
+ if (dlm_no_directory(ls) || !is_master(rsb)) {
+ rb_erase(n, &ls->ls_rsbtbl[i].toss);
+ dlm_free_rsb(rsb);
}
}
spin_unlock(&ls->ls_rsbtbl[i].lock);
 

Thread Tools




All times are GMT. The time now is 06:10 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org