FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > EXT3 Users

 
 
LinkBack Thread Tools
 
Old 03-09-2010, 01:36 PM
Charles Riley
 
Default Fwd: problems with large directories?

Sorry, I meant to send this to the list, not just Ric.


----- Forwarded Message -----
From: "Charles Riley" <criley@erad.com>
To: "Ric Wheeler" <rwheeler@redhat.com>
Sent: Tuesday, March 9, 2010 9:34:25 AM GMT -05:00 US/Canada Eastern
Subject: Re: problems with large directories?




----- "Ric Wheeler" <rwheeler@redhat.com> wrote:

> On 03/08/2010 08:23 PM, Mitch Trachtenberg wrote:
> > Hi,
> >
> > I have an application that deals with 100,000 to 1,000,000 image
> files.
> >
> > I initially structured it to use multiple directories, so that file
> > 123456 would be stored in /12/34/123456. I'm now wondering if
> that's
> > pointless, as it would simplify things to simply store the file in
> /123456.
> >
> > Can anyone indicate whether I'm gaining anything by using smaller
> > directories in ext3/ext4? Thanks.
> >
> > Mitch
> >
>
> I think that breaking up your files into subdirectories makes it
> easier to
> navigate the tree and find files from a human point of view. Even
> better if the
> bytes reflect something like year/month/day/hour/min (assuming your
> pathname has
> a date based guid or similar encoding).
>
> You can have a million files in one large directory, but be careful to
> iterate
> and copy them in a sorted order (sorted by inode) to avoid nasty
> performance
> issues that are side effects of the way we hash file names in ext3/4.
>
> Good luck!
>
> Ric
>

Hi Ric,

Can you elaborate on the performance issues you mention above?

We use rhel4/ext3 on our pacs (medical imaging) servers.
We ran into the 32k limit a couple of years back when our first customer hit the 31,999th study, at which point we implemented a directory hashing algorithm. Now we store images for a given patient's study in a path something like:
aa/ab/ac/1.2.3/

where 1.2.3 is the dicom study instance uid (a wwuid for a medical study)
and aa/ab/ac/ is the directory hash we derived from that study instance uid.

The above is a simplified example for illustration purposes only, 1.2.3 does not really hash to aa/ab/ac/.
Within aa/ab/ac/1.2.3/ there can be anywhere from three to a couple of thousand DICOM object files.
Images are initially created in a non-hashed temporary directory and then copied to their permanent home in e.g. aa/ab/ac/1.2.3/

In this context, would we gain filesystem performance by sorting by inode before copying?
Do the performance issues you refer to only apply to the copy process itself or do they contribute to long term filesystem performance?

Thanks for any insight you can provide,

Charles

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 
Old 03-10-2010, 12:51 AM
Ric Wheeler
 
Default Fwd: problems with large directories?

On 03/09/2010 09:36 AM, Charles Riley wrote:

Sorry, I meant to send this to the list, not just Ric.


----- Forwarded Message -----
From: "Charles Riley"<criley@erad.com>
To: "Ric Wheeler"<rwheeler@redhat.com>
Sent: Tuesday, March 9, 2010 9:34:25 AM GMT -05:00 US/Canada Eastern
Subject: Re: problems with large directories?




----- "Ric Wheeler"<rwheeler@redhat.com> wrote:


On 03/08/2010 08:23 PM, Mitch Trachtenberg wrote:

Hi,

I have an application that deals with 100,000 to 1,000,000 image

files.


I initially structured it to use multiple directories, so that file
123456 would be stored in /12/34/123456. I'm now wondering if

that's

pointless, as it would simplify things to simply store the file in

/123456.


Can anyone indicate whether I'm gaining anything by using smaller
directories in ext3/ext4? Thanks.

Mitch



I think that breaking up your files into subdirectories makes it
easier to
navigate the tree and find files from a human point of view. Even
better if the
bytes reflect something like year/month/day/hour/min (assuming your
pathname has
a date based guid or similar encoding).

You can have a million files in one large directory, but be careful to
iterate
and copy them in a sorted order (sorted by inode) to avoid nasty
performance
issues that are side effects of the way we hash file names in ext3/4.

Good luck!

Ric



Hi Ric,

Can you elaborate on the performance issues you mention above?

We use rhel4/ext3 on our pacs (medical imaging) servers.
We ran into the 32k limit a couple of years back when our first customer hit the 31,999th study, at which point we implemented a directory hashing algorithm. Now we store images for a given patient's study in a path something like:
aa/ab/ac/1.2.3/

where 1.2.3 is the dicom study instance uid (a wwuid for a medical study)
and aa/ab/ac/ is the directory hash we derived from that study instance uid.

The above is a simplified example for illustration purposes only, 1.2.3 does not really hash to aa/ab/ac/.
Within aa/ab/ac/1.2.3/ there can be anywhere from three to a couple of thousand DICOM object files.
Images are initially created in a non-hashed temporary directory and then copied to their permanent home in e.g. aa/ab/ac/1.2.3/

In this context, would we gain filesystem performance by sorting by inode before copying?
Do the performance issues you refer to only apply to the copy process itself or do they contribute to long term filesystem performance?

Thanks for any insight you can provide,

Charles




Hi Charles,

The big issue with touching a lot of files (reading, stating, unlinking them) in
ext3/4 is that readdir gives us back a list in effectively random order. This
makes the accesses very seeky.


Not an issue with a handful of files (say a couple of hundred), but when you get
to thousands (or millions) of files, performance really tanks.


To avoid that, you can sort the list returned by readdir() into ascending order
by inode in reasonably large batches and get your performance up.


Several core tools have been looking at doing this automatically, but it is
important for any home grown applications as well.


In your scenario with the directory hierarchy, I suspect that you won't hit
this. If you had one very large directory, you certainly would.


Best regards,

Ric

_______________________________________________
Ext3-users mailing list
Ext3-users@redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
 

Thread Tools




All times are GMT. The time now is 11:52 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org