GFS2: Add readahead to sequential directory traversal
Hi,
This patch adds read-ahead capability to GFS2's
directory hash table management. It greatly improves
performance for some directory operations. For example:
In one of my file systems that has 1000 directories, each
of which has 1000 files, time to execute a recursive
ls (time ls -fR /mnt/gfs2 > /dev/null) was reduced
from 2m2.814s on a stock kernel to 0m45.938s.
/**
* dir_e_read - Reads the entries from a directory into a filldir buffer
@@ -1406,6 +1437,8 @@ static int dir_e_read(struct inode *inode, u64 *offset, void *opaque,
if (IS_ERR(lp))
return PTR_ERR(lp);
GFS2: Add readahead to sequential directory traversal
----- Original Message -----
| Hi,
|
| Were we not intending to make this turn itself off in cases where it
| produces no benefit? I thought the plan was to track the state via
| the
| file descriptor in order to avoid readingahead the same blocks over
| and
| over again too,
|
| Steve.
Hi Steve,
Based on our discussion, I revised the patch to implement a
scheme whereby we don't duplicate read-ahead requests. However,
I decided against using the file descriptor in favor of a
simple bitmap that keeps track of where we've done read-ahead.
So now instead of a hash table in the inode, I have a structure
that contains two pointers: (1) hash table as before, and
(2) the bitmap to keep track of read-ahead requests.
I'm very happy with the results. I've got a file system with
500,000 files in a single directory. On my RHEL5 system, when
I perform the command: "time ls -fR /mnt/gfs2 > /dev/null" I
get these times:
Stock pre-release 5.8: 0m34.481s
Previously posted patch: 0m18.811s
With this patch: 0m9.843s
which is more than three times faster, and nearly twice as
fast as the version I previously posted. I haven't tried my
"1000x1000" test on it yet.
The upstream version of the patch is given below.
There is one noticeable difference between RHEL5 and upstream:
I changed kmalloc to vmalloc. Not sure your feelings on that.
Also, I have a feeling you're going to ask me to remove the
#define MAX_RA_BLOCKS and hard-code it to 32. What can I say?
Regards,
Bob Peterson
Red Hat File Systems
--
commit 19ec5ffb5eaf915944861f8ebd4c065818a238c5
Author: Bob Peterson <rpeterso@redhat.com>
Date: Thu Oct 6 10:34:05 2011 -0500
GFS2: Add readahead to sequential directory traversal
This patch adds read-ahead capability to GFS2's
directory hash table management. It greatly improves
performance for some directory operations. For example:
In one of my file systems that has 1000 directories, each
of which has 1000 files, time to execute a recursive
ls (time ls -fR /mnt/gfs2 > /dev/null) was reduced
from 2m2.814s on a stock kernel to 0m45.938s.
diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
index 2045d70..1c89ae0 100644
--- a/fs/gfs2/dir.c
+++ b/fs/gfs2/dir.c
@@ -329,48 +329,58 @@ fail:
* gfs2_dir_get_hash_table - Get pointer to the dir hash table
* @ip: The inode in question
*
- * Returns: The hash table or an error
+ * Returns: an error code
*/
GFS2: Add readahead to sequential directory traversal
----- Original Message -----
| Well it is not called directly from NFS though, only from our getname
| function. I'd suggest we do something along the following lines:
|
| Create a new structure, something like this:
|
| struct gfs2_dir_ra_state {
| /* Readahead state variables... */
| };
|
| and then pass that to gfs2_dir_read() in addition to passing the
| offset.
| The getname call is trivial, since it always starts at the beginning
| and
| reads through sequentially, so it just needs a gfs2_dir_ra_state to
| be
| set up to take account of that. The struct gfs2_dir_ra_state can be
| embedded into the struct gfs2_file, so that gfs2_readdir() just
| passes a
| pointer to that.
|
| That will allow us to reset the readahead state upon lseek(SEEK_SET,
| 0);
| and also flag whether the accesses become non-sequential.
|
| So I think keeping the state in the struct file and passing it to our
| gfs2_file_read() function should not be too tricky,
|
| Steve.
Hi Steve,
Thanks for the feedback. How about something like this (follows):
Instead of creating a new read-ahead structure gfs2_dir_ra_state
I'm making (non-standard) use of vfs's read-ahead struct in the
file.
Regards,
Bob Peterson
Red Hat File System
--
commit 14e64999c89c7e2801d69b9e877f2fa0bf10ad70
Author: Bob Peterson <rpeterso@redhat.com>
Date: Fri Oct 7 11:55:31 2011 -0500
GFS2: Add readahead to sequential directory traversal
This patch adds read-ahead capability to GFS2's
directory hash table management. It greatly improves
performance for some directory operations. For example:
In one of my file systems that has 1000 directories, each
of which has 1000 files, time to execute a recursive
ls (time ls -fR /mnt/gfs2 > /dev/null) was reduced
from 2m2.814s on a stock kernel to 0m45.938s.
+/* gfs2_dir_readahead - Issue read-ahead requests for leaf blocks.
+ *
+ * Note: we can't calculate each index like dir_e_read can because we don't
+ * have the leaf, and therefore we don't have the depth, and therefore we
+ * don't have the length. So we have to just read enough ahead to make up
+ * for the loss of information. */
+static void gfs2_dir_readahead(struct inode *inode, unsigned hsize, u32 index,
+ struct file_ra_state *f_ra)
+{
+ struct gfs2_inode *ip = GFS2_I(inode);
+ struct gfs2_glock *gl = ip->i_gl;
+ struct buffer_head *bh;
+ u64 blocknr = 0, last;
+ unsigned count;
+
+ /* First check if we've already read-ahead for the whole range. */
+ if (!f_ra || index + MAX_RA_BLOCKS < f_ra->start)
+ return;
+
+ f_ra->start = max((pgoff_t)index, f_ra->start);
+ for (count = 0; count < MAX_RA_BLOCKS; count++) {
+ if (f_ra->start >= hsize) /* if exceeded the hash table */
+ break;
+
+ last = blocknr;
+ blocknr = be64_to_cpu(ip->i_hash_cache[f_ra->start]);
+ f_ra->start++;
+ if (blocknr == last)
+ continue;
+
+ bh = gfs2_getbuf(gl, blocknr, 1);
+ if (trylock_buffer(bh)) {
+ if (buffer_uptodate(bh)) {
+ unlock_buffer(bh);
+ brelse(bh);
+ continue;
+ }
+ bh->b_end_io = end_buffer_read_sync;
+ submit_bh(READA | REQ_META, bh);
+ continue;
+ }
+ brelse(bh);
+ }
+}
/**
* dir_e_read - Reads the entries from a directory into a filldir buffer
@@ -1388,7 +1434,7 @@ out:
*/
GFS2: Add readahead to sequential directory traversal
----- Original Message -----
| Hi,
|
| That looks much better, but why not create & pass a file_ra_state in
| the
| NFS getname function so that you don't have to check for it being
| NULL?
| Since that function always reads everything up until the name it is
| looking for, it is an ideal place to do readahead as the access will
| always be sequential.
|
| Otherwise, it looks good to me,
|
| Steve.
Hi,
There isn't a file struct passed in to the nfs get_name function, but
I can do something like the following. Is this what you mean?
GFS2: Add readahead to sequential directory traversal
----- Original Message -----
| Hi,
|
| > Hi,
| >
| > There isn't a file struct passed in to the nfs get_name function,
| > but
| > I can do something like the following. Is this what you mean?
| >
| > diff --git a/fs/gfs2/export.c b/fs/gfs2/export.c
| > index a581cd2..70ba891 100644
| > --- a/fs/gfs2/export.c
| > +++ b/fs/gfs2/export.c
| > @@ -99,6 +99,7 @@ static int gfs2_get_name(struct dentry *parent,
| > char *name,
| > struct gfs2_holder gh;
| > u64 offset = 0;
| > int error;
| > + struct file_ra_state f_ra = { .start = 0 };
| >
| > if (!dir)
| > return -EINVAL;
| > @@ -118,7 +119,7 @@ static int gfs2_get_name(struct dentry *parent,
| > char *name,
| > if (error)
| > return error;
| >
| > - error = gfs2_dir_read(dir, &offset, &gnfd, get_name_filldir,
| > NULL);
| > + error = gfs2_dir_read(dir, &offset, &gnfd, get_name_filldir,
| > &f_ra);
| >
| > gfs2_glock_dq_uninit(&gh);
| >
| > Regards,
| >
| > Bob Peterson
| > Red Hat File Systems
|
| Yes, thats the kind of things. Looks like we can use
| file_ra_state_init() since it is exported too,
|
| Steve.
Hi,
I'd rather not call file_ra_state_init in this case.
Here is my latest patch for upstream GFS2 then.
Regards,
Bob Peterson
Red Hat File Systems
--
commit 17c4434902790347b4dd6abfdc158aa5a443f91b
Author: Bob Peterson <rpeterso@redhat.com>
Date: Fri Oct 7 11:55:31 2011 -0500
GFS2: Add readahead to sequential directory traversal
This patch adds read-ahead capability to GFS2's
directory hash table management. It greatly improves
performance for some directory operations. For example:
In one of my file systems that has 1000 directories, each
of which has 1000 files, time to execute a recursive
ls (time ls -fR /mnt/gfs2 > /dev/null) was reduced
from 2m2.814s on a stock kernel to 0m45.938s.
+/* gfs2_dir_readahead - Issue read-ahead requests for leaf blocks.
+ *
+ * Note: we can't calculate each index like dir_e_read can because we don't
+ * have the leaf, and therefore we don't have the depth, and therefore we
+ * don't have the length. So we have to just read enough ahead to make up
+ * for the loss of information. */
+static void gfs2_dir_readahead(struct inode *inode, unsigned hsize, u32 index,
+ struct file_ra_state *f_ra)
+{
+ struct gfs2_inode *ip = GFS2_I(inode);
+ struct gfs2_glock *gl = ip->i_gl;
+ struct buffer_head *bh;
+ u64 blocknr = 0, last;
+ unsigned count;
+
+ /* First check if we've already read-ahead for the whole range. */
+ if (!f_ra || index + MAX_RA_BLOCKS < f_ra->start)
+ return;
+
+ f_ra->start = max((pgoff_t)index, f_ra->start);
+ for (count = 0; count < MAX_RA_BLOCKS; count++) {
+ if (f_ra->start >= hsize) /* if exceeded the hash table */
+ break;
+
+ last = blocknr;
+ blocknr = be64_to_cpu(ip->i_hash_cache[f_ra->start]);
+ f_ra->start++;
+ if (blocknr == last)
+ continue;
+
+ bh = gfs2_getbuf(gl, blocknr, 1);
+ if (trylock_buffer(bh)) {
+ if (buffer_uptodate(bh)) {
+ unlock_buffer(bh);
+ brelse(bh);
+ continue;
+ }
+ bh->b_end_io = end_buffer_read_sync;
+ submit_bh(READA | REQ_META, bh);
+ continue;
+ }
+ brelse(bh);
+ }
+}
/**
* dir_e_read - Reads the entries from a directory into a filldir buffer
@@ -1388,7 +1434,7 @@ out:
*/
GFS2: Add readahead to sequential directory traversal
From: Bob Peterson <rpeterso@redhat.com>
This patch adds read-ahead capability to GFS2's
directory hash table management. It greatly improves
performance for some directory operations. For example:
In one of my file systems that has 1000 directories, each
of which has 1000 files, time to execute a recursive
ls (time ls -fR /mnt/gfs2 > /dev/null) was reduced
from 2m2.814s on a stock kernel to 0m45.938s.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
+/* gfs2_dir_readahead - Issue read-ahead requests for leaf blocks.
+ *
+ * Note: we can't calculate each index like dir_e_read can because we don't
+ * have the leaf, and therefore we don't have the depth, and therefore we
+ * don't have the length. So we have to just read enough ahead to make up
+ * for the loss of information. */
+static void gfs2_dir_readahead(struct inode *inode, unsigned hsize, u32 index,
+ struct file_ra_state *f_ra)
+{
+ struct gfs2_inode *ip = GFS2_I(inode);
+ struct gfs2_glock *gl = ip->i_gl;
+ struct buffer_head *bh;
+ u64 blocknr = 0, last;
+ unsigned count;
+
+ /* First check if we've already read-ahead for the whole range. */
+ if (!f_ra || index + MAX_RA_BLOCKS < f_ra->start)
+ return;
+
+ f_ra->start = max((pgoff_t)index, f_ra->start);
+ for (count = 0; count < MAX_RA_BLOCKS; count++) {
+ if (f_ra->start >= hsize) /* if exceeded the hash table */
+ break;
+
+ last = blocknr;
+ blocknr = be64_to_cpu(ip->i_hash_cache[f_ra->start]);
+ f_ra->start++;
+ if (blocknr == last)
+ continue;
+
+ bh = gfs2_getbuf(gl, blocknr, 1);
+ if (trylock_buffer(bh)) {
+ if (buffer_uptodate(bh)) {
+ unlock_buffer(bh);
+ brelse(bh);
+ continue;
+ }
+ bh->b_end_io = end_buffer_read_sync;
+ submit_bh(READA | REQ_META, bh);
+ continue;
+ }
+ brelse(bh);
+ }
+}
/**
* dir_e_read - Reads the entries from a directory into a filldir buffer
@@ -1388,7 +1434,7 @@ out:
*/