FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Gentoo > Gentoo User

 
 
LinkBack Thread Tools
 
Old 08-14-2012, 01:54 PM
Daniel Troeder
 
Default Fast file system for cache directory with lot's of files

On 14.08.2012 11:46, Neil Bothwick wrote:
> On Tue, 14 Aug 2012 10:21:54 +0200, Daniel Troeder wrote:
>
>> There is also the possibility to write a really small daemon (less than
>> 50 lines of C) that registers with inotify for the entire fs and
>> journals the file activity to a sqlite-db.
>
> sys-process/incron ?
Uh... didn't know that one! ... very interesting

Have you used it?
How does it perform if there are lots of modifications going on?
Does it have a throttle against fork bombing?
must-read-myself-a-little.....

A incron line
# sqlite3 /file.sql 'INSERT filename, date INTO table'
would be inefficient, because it spawn lots of processes, but it would
be very nice to simply test out the idea. Then a
# sqlite3 /file.sql 'SELECT filename FROM table SORTBY date < date-30days'
or something to get the files older than 30 days, and voilá
 
Old 08-14-2012, 02:00 PM
Florian Philipp
 
Default Fast file system for cache directory with lot's of files

Am 13.08.2012 20:18, schrieb Michael Hampicke:
> Am 13.08.2012 19:14, schrieb Florian Philipp:
>> Am 13.08.2012 16:52, schrieb Michael Mol:
>>> On Mon, Aug 13, 2012 at 10:42 AM, Michael Hampicke
>>> <mgehampicke@gmail.com <mailto:mgehampicke@gmail.com>> wrote:
>>>
>>> Have you indexed your ext4 partition?
>>>
>>> # tune2fs -O dir_index /dev/your_partition
>>> # e2fsck -D /dev/your_partition
>>>
>>> Hi, the dir_index is active. I guess that's why delete operations
>>> take as long as they take (index has to be updated every time)
>>>
>>>
>>> 1) Scan for files to remove
>>> 2) disable index
>>> 3) Remove files
>>> 4) enable index
>>>
>>> ?
>>>
>>> --
>>> :wq
>>
>> Other things to think about:
>>
>> 1. Play around with data=journal/writeback/ordered. IIRC, data=journal
>> actually used to improve performance depending on the workload as it
>> delays random IO in favor of sequential IO (when updating the journal).
>>
>> 2. Increase the journal size.
>>
>> 3. Take a look at `man 1 chattr`. Especially the 'T' attribute. Of
>> course this only helps after re-allocating everything.
>>
>> 4. Try parallelizing. Ext4 requires relatively few locks nowadays (since
>> 2.6.39 IIRC). For example:
>> find $TOP_DIR -mindepth 1 -maxdepth 1 -print0 |
>> xargs -0 -n 1 -r -P 4 -I '{}' find '{}' -type f
>>
>> 5. Use a separate device for the journal.
>>
>> 6. Temporarily deactivate the journal with tune2fs similar to MM's idea.
>>
>> Regards,
>> Florian Philipp
>>
>
> Trying out different journals-/options was already on my list, but the
> manpage on chattr regarding the T attribute is an interesting read.
> Definitely worth trying.
>
> Parallelizing multiple finds was something I already did, but the only
> thing that increased was the IO wait But now having read all the
> suggestions in this thread, I might try it again.
>
> Separate device for the journal is a good idea, but not possible atm
> (machine is abroad in a data center)
>

Something else I just remembered. I guess it doesn't help you with your
current problem but it might come in handy when working with such large
cache dirs: I once wrote a script that sorts files by their starting
physical block. This improved reading them quite a bit (2 minutes
instead of 11 minutes for copying the portage tree).

It's a terrible clutch, will probably fail when passing FS boundaries or
a thousand other oddities and requires root for some very scary
programs. I never had the time to finish an improved C version. Anyway,
maybe it helps you:

#!/bin/bash
#
# Example below copies /usr/portage/* to /tmp/portage.
# Replace /usr/portage with the input directory.
# Replace `cpio` with whatever does the actual work. Input is a
# -delimited file list.
#
FIFO=/tmp/$(uuidgen).fifo
mkfifo "$FIFO"
find /usr/portage -type f -fprintf "$FIFO" 'bmap <%i> 0
' -print0 |
tr '
' '
' |
paste <(
debugfs -f "$FIFO" /dev/mapper/vg-portage |
grep -E '^[[:digit:]]+'
) - |
sort -k 1,1n |
cut -f 2- |
tr '
' '
' |
cpio -p0 --make-directories /tmp/portage/
unlink "$FIFO"
 
Old 08-14-2012, 03:09 PM
Florian Philipp
 
Default Fast file system for cache directory with lot's of files

Am 14.08.2012 15:54, schrieb Daniel Troeder:
> On 14.08.2012 11:46, Neil Bothwick wrote:
>> On Tue, 14 Aug 2012 10:21:54 +0200, Daniel Troeder wrote:
>>
>>> There is also the possibility to write a really small daemon (less than
>>> 50 lines of C) that registers with inotify for the entire fs and
>>> journals the file activity to a sqlite-db.
>>
>> sys-process/incron ?
> Uh... didn't know that one! ... very interesting
>
> Have you used it?
> How does it perform if there are lots of modifications going on?
> Does it have a throttle against fork bombing?
> must-read-myself-a-little.....
>
> A incron line
> # sqlite3 /file.sql 'INSERT filename, date INTO table'
> would be inefficient, because it spawn lots of processes, but it would
> be very nice to simply test out the idea. Then a
> # sqlite3 /file.sql 'SELECT filename FROM table SORTBY date < date-30days'
> or something to get the files older than 30 days, and voilá
>
>

Maybe inotifywait is better for this kind of batch job.

Collecting events:
inotifywait -rm -e CREATE,DELETE --timefmt '%s' --format
"$(printf '%%T %%e %%w%%f')" /tmp > events.tbl
# the printf is there because inotifywait's format does not
# recognize common escapes like
# Output format:
# Seconds since epoch CREATE/DELETE file name


Filtering events:
sort --stable -k3 events.tbl |
awk '
function update() {
line=$0; exists= $2=="DELETE" ? 0 : 1; file=$3
}
NR==1{ update(); next }
{ if($3!=file && exists==1){ print line } update() }'
# Sorts by file name while preserving temporal order.
# Uses awk to suppress output of files that have been deleted.
# Output: Last CREATE event for each existing file

Retrieving files created 30+ days ago:
awk -v newest=$(date -d -5seconds +%s) '
$1>newest{ nextfile }
{ print $3 }'

Remarks:

The awk scripts need some improvement if you have to handle whitespaces
in filenames but with the input format, it should be able to work with
everything except newlines.

Inotifywait itself is utterly useless when dealing with newlines in file
names unless you want to put some serious effort into sanitizing the output.

Regards,
Florian Philipp
 
Old 08-14-2012, 03:33 PM
Florian Philipp
 
Default Fast file system for cache directory with lot's of files

Am 14.08.2012 17:09, schrieb Florian Philipp:
>
> Retrieving files created 30+ days ago:
> awk -v newest=$(date -d -5seconds +%s) '
> $1>newest{ nextfile }
> { print $3 }'
>

s/-5seconds/-30days/
 
Old 08-14-2012, 04:36 PM
Helmut Jarausch
 
Default Fast file system for cache directory with lot's of files

On 08/14/2012 04:07:39 AM, Adam Carter wrote:

> I think btrfs probably is meant to provide a lot of the modern
> features like reiser4 or xfs

Unfortunately btrfs is still generally slower than ext4 for example.
Checkout http://openbenchmarking.org/, eg
http://openbenchmarking.org/s/ext4%20btrfs

The OS will use any spare RAM for disk caching, so if there's not much
else running on that box, most of your content will be served from
RAM. It may be that whatever fs you choose wont make that much of a
difference anyways.



If one can run a recent kernel (3.5.x) btrfs seems quite stable (It's
used by some distribution and Oracle for real work)
Most benchmark don't use compression since other FS can't use it. But
that's unfair. With compression, one needs to read
much less data (my /usr partition has less than 50% of an ext4
partition, savings with the root partition are even higher).


I'm using the mount options
compress=lzo,noacl,noatime,autodefrag,space_cache which require a
recent kernel.


I'd give it a try.

Helmut.
 
Old 08-14-2012, 05:05 PM
Pandu Poluan
 
Default Fast file system for cache directory with lot's of files

On Aug 14, 2012 11:42 PM, "Helmut Jarausch" <jarausch@igpm.rwth-aachen.de> wrote:

>

> On 08/14/2012 04:07:39 AM, Adam Carter wrote:

>>

>> > I think btrfs probably is meant to provide a lot of the modern

>> > features like reiser4 or xfs

>>

>> Unfortunately btrfs is still generally slower than ext4 for example.

>> Checkout http://openbenchmarking.org/, eg

>> http://openbenchmarking.org/s/ext4%20btrfs

>>

>> The OS will use any spare RAM for disk caching, so if there's not much

>> else running on that box, most of your content will be served from

>> RAM. It may be that whatever fs you choose wont make that much of a

>> difference anyways.

>>

>

> If one can run a recent kernel (3.5.x) btrfs seems quite stable (It's used by some distribution and Oracle for real work)

> Most benchmark don't use compression since other FS can't use it. But that's unfair. With compression, one needs to read

> much less data (my /usr partition has less than 50% of an ext4 partition, savings with the root partition are even higher).

>

> I'm using the mount options compress=lzo,noacl,noatime,autodefrag,space_cache which require a recent kernel.

>

> I'd give it a try.

>

> Helmut.

>


Are the support tools for btrfs (fsck, defrag, etc.) already complete?


If so, I certainly would like to take it out for a spin...


Rgds,
 
Old 08-14-2012, 05:21 PM
Jason Weisberger
 
Default Fast file system for cache directory with lot's of files

Sure, but wouldn't compression make write operations slower?* And isn't he looking for performance?

On Aug 14, 2012 1:14 PM, "Pandu Poluan" <pandu@poluan.info> wrote:



On Aug 14, 2012 11:42 PM, "Helmut Jarausch" <jarausch@igpm.rwth-aachen.de> wrote:

>

> On 08/14/2012 04:07:39 AM, Adam Carter wrote:

>>

>> > I think btrfs probably is meant to provide a lot of the modern

>> > features like reiser4 or xfs

>>

>> Unfortunately btrfs is still generally slower than ext4 for example.

>> Checkout http://openbenchmarking.org/, eg

>> http://openbenchmarking.org/s/ext4%20btrfs

>>

>> The OS will use any spare RAM for disk caching, so if there's not much

>> else running on that box, most of your content will be served from

>> RAM. It may be that whatever fs you choose wont make that much of a

>> difference anyways.

>>

>

> If one can run a recent kernel (3.5.x) btrfs seems quite stable (It's used by some distribution and Oracle for real work)

> Most benchmark don't use compression since other FS can't use it. But that's unfair. With compression, one needs to read

> much less data (my /usr partition has less than 50% of an ext4 partition, savings with the root partition are even higher).

>

> I'm using the mount options compress=lzo,noacl,noatime,autodefrag,space_cache which require a recent kernel.

>

> I'd give it a try.

>

> Helmut.

>


Are the support tools for btrfs (fsck, defrag, etc.) already complete?


If so, I certainly would like to take it out for a spin...


Rgds,
 
Old 08-14-2012, 05:42 PM
Volker Armin Hemmann
 
Default Fast file system for cache directory with lot's of files

Am Dienstag, 14. August 2012, 13:21:35 schrieb Jason Weisberger:
> Sure, but wouldn't compression make write operations slower? And isn't he
> looking for performance?

not really. As long as the CPU can compress faster than the disk can write
stuff.

More interessting: is btrfs trying to be smart - only compressing compressible
stuff?

--
#163933
 
Old 08-14-2012, 05:42 PM
Michael Hampicke
 
Default Fast file system for cache directory with lot's of files

Am 14.08.2012 16:00, schrieb Florian Philipp:
> Am 13.08.2012 20:18, schrieb Michael Hampicke:
>> Am 13.08.2012 19:14, schrieb Florian Philipp:
>>> Am 13.08.2012 16:52, schrieb Michael Mol:
>>>> On Mon, Aug 13, 2012 at 10:42 AM, Michael Hampicke
>>>> <mgehampicke@gmail.com <mailto:mgehampicke@gmail.com>> wrote:
>>>>
>>>> Have you indexed your ext4 partition?
>>>>
>>>> # tune2fs -O dir_index /dev/your_partition
>>>> # e2fsck -D /dev/your_partition
>>>>
>>>> Hi, the dir_index is active. I guess that's why delete operations
>>>> take as long as they take (index has to be updated every time)
>>>>
>>>>
>>>> 1) Scan for files to remove
>>>> 2) disable index
>>>> 3) Remove files
>>>> 4) enable index
>>>>
>>>> ?
>>>>
>>>> --
>>>> :wq
>>>
>>> Other things to think about:
>>>
>>> 1. Play around with data=journal/writeback/ordered. IIRC, data=journal
>>> actually used to improve performance depending on the workload as it
>>> delays random IO in favor of sequential IO (when updating the journal).
>>>
>>> 2. Increase the journal size.
>>>
>>> 3. Take a look at `man 1 chattr`. Especially the 'T' attribute. Of
>>> course this only helps after re-allocating everything.
>>>
>>> 4. Try parallelizing. Ext4 requires relatively few locks nowadays (since
>>> 2.6.39 IIRC). For example:
>>> find $TOP_DIR -mindepth 1 -maxdepth 1 -print0 |
>>> xargs -0 -n 1 -r -P 4 -I '{}' find '{}' -type f
>>>
>>> 5. Use a separate device for the journal.
>>>
>>> 6. Temporarily deactivate the journal with tune2fs similar to MM's idea.
>>>
>>> Regards,
>>> Florian Philipp
>>>
>>
>> Trying out different journals-/options was already on my list, but the
>> manpage on chattr regarding the T attribute is an interesting read.
>> Definitely worth trying.
>>
>> Parallelizing multiple finds was something I already did, but the only
>> thing that increased was the IO wait But now having read all the
>> suggestions in this thread, I might try it again.
>>
>> Separate device for the journal is a good idea, but not possible atm
>> (machine is abroad in a data center)
>>
>
> Something else I just remembered. I guess it doesn't help you with your
> current problem but it might come in handy when working with such large
> cache dirs: I once wrote a script that sorts files by their starting
> physical block. This improved reading them quite a bit (2 minutes
> instead of 11 minutes for copying the portage tree).
>
> It's a terrible clutch, will probably fail when passing FS boundaries or
> a thousand other oddities and requires root for some very scary
> programs. I never had the time to finish an improved C version. Anyway,
> maybe it helps you:
>
> #!/bin/bash
> #
> # Example below copies /usr/portage/* to /tmp/portage.
> # Replace /usr/portage with the input directory.
> # Replace `cpio` with whatever does the actual work. Input is a
> # -delimited file list.
> #
> FIFO=/tmp/$(uuidgen).fifo
> mkfifo "$FIFO"
> find /usr/portage -type f -fprintf "$FIFO" 'bmap <%i> 0
' -print0 |
> tr '
' '
' |
> paste <(
> debugfs -f "$FIFO" /dev/mapper/vg-portage |
> grep -E '^[[:digit:]]+'
> ) - |
> sort -k 1,1n |
> cut -f 2- |
> tr '
' '
' |
> cpio -p0 --make-directories /tmp/portage/
> unlink "$FIFO"
>

No, I don't think that's practicable with the number of files in my
setup. To be honest, currently I am quite happy with the performance of
btrfs. Running through the directory tree only takes 1/10th of the time
it took with ext4, and deletes are pretty fast as well. I'm sure there's
still room for more improvement, but right now it's much better than it
was before.
 
Old 08-14-2012, 05:42 PM
Volker Armin Hemmann
 
Default Fast file system for cache directory with lot's of files

Am Mittwoch, 15. August 2012, 00:05:40 schrieb Pandu Poluan:

>
> Are the support tools for btrfs (fsck, defrag, etc.) already complete?

no

--
#163933
 

Thread Tools




All times are GMT. The time now is 09:17 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright ©2007 - 2008, www.linux-archive.org