FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Gentoo > Gentoo Development

 
 
LinkBack Thread Tools
 
Old 04-25-2010, 11:18 AM
Angelo Arrifano
 
Default Utility to find orphaned files

Hello developers developers and developers,

Ever wondered how much crap is left in your X-years old Gentoo box?

I just developed a python utility to efficiently find orphaned files in
the system. By orphaned files I mean the files that are present on
system directories and don't belong to any installed package.

The package builds a virtual filesystem (cache) on the RAM using python
hash tables. Then it uses the cache to find the ownership of files
inside user-specified dirs.

Building the cache takes less than 10 seconds here in a system with 1366
installed packages.

This is not intended to be a finished program yet, I'm looking forward
for your constructive commentaries.

[Attached]

Regards,
--
Angelo Arrifano AKA MiKNiX
Gentoo Embedded/OMAP850 Developer
Linwizard Developer
http://www.gentoo.org/~miknix
http://miknix.homelinux.com
 
Old 04-25-2010, 11:45 AM
Brian Harring
 
Default Utility to find orphaned files

On Sun, Apr 25, 2010 at 01:18:25PM +0200, Angelo Arrifano wrote:
> Hello developers developers and developers,
>
> Ever wondered how much crap is left in your X-years old Gentoo box?
>
> I just developed a python utility to efficiently find orphaned files in
> the system. By orphaned files I mean the files that are present on
> system directories and don't belong to any installed package.
>
> The package builds a virtual filesystem (cache) on the RAM using python
> hash tables. Then it uses the cache to find the ownership of files
> inside user-specified dirs.
>
> Building the cache takes less than 10 seconds here in a system with 1366
> installed packages.
>
> This is not intended to be a finished program yet, I'm looking forward
> for your constructive commentaries.

You're going to want to do realpathing here... also you'll need to
handle syms, and spaces are allowed in paths. I'd personally suggest
using one of the PM api's for this.

Part of the reason I advise poking at the PM apis is that it covers up
some of the nastier details w/ contents and others w/ parsing; simple
example,

python -c "
import sys
from pkgcore.config import load_config
from pkgcore.fs import contents, livefs
contents = contents.contentsSet()
for pkg in load_config().get_default('domain').named_repos['vdb']:
contents.update(pkg.contents);
stream = (x for x in livefs.iter_scan(sys.argv[1]) if x not in
contents)
print '
'.join(map(str, sorted(stream)))
" desired-path

Note also that's a *very* quick writing. I'd personally look at
serializing the sorted lists to disk for both streams (what contents
says is on disk vs what is on disk), and then lockstep walking the
lists; via that you can keep the memory usage down.

~harring
 
Old 04-25-2010, 01:43 PM
Daniel Pielmeier
 
Default Utility to find orphaned files

Angelo Arrifano schrieb am 25.04.2010 13:18:
> Hello developers developers and developers,
>
> Ever wondered how much crap is left in your X-years old Gentoo box?
>
> I just developed a python utility to efficiently find orphaned files in
> the system. By orphaned files I mean the files that are present on
> system directories and don't belong to any installed package.
>
> The package builds a virtual filesystem (cache) on the RAM using python
> hash tables. Then it uses the cache to find the ownership of files
> inside user-specified dirs.
>
> Building the cache takes less than 10 seconds here in a system with 1366
> installed packages.
>
> This is not intended to be a finished program yet, I'm looking forward
> for your constructive commentaries.

What about searching the complete file system but using an exclude file where
you can put directories and files which should not be searched. It is tedious to
tell every path on the command-line. Also for instance if you specify /lib it
will also search under /lib/modules and I am sure you do not consider all
contents there as unneeded.

You also need to consider that your tool will return other false positives like
byte compiled python modules and perl header files. In general everything an
ebuild does in phases where it adds files to file-system but files are not
stored to CONTENTS (pkg_{pre,post}inst). At this point the files are needed but
not recognized by the package manager. If the ebuild does not take care of this
files when removing (pkg_{pre,post}rm) the package they will remain on the
file-system and are now unneeded.

I have written something in perl which I recently tried to implement in python
(not the same functionality like the perl version yet). I am not a good perl or
python programmer but it fits my needs especially the perl version as I know a
bit more perl than python.

I attach both versions and a sample exclude file. Maybe it will be of help.

--
Daniel Pielmeier
 
Old 04-25-2010, 03:34 PM
Yuri Vasilevski
 
Default Utility to find orphaned files

Hello,

On Sun, 25 Apr 2010 13:18:25 +0200
Angelo Arrifano <miknix@gentoo.org> wrote:

> Hello developers developers and developers,
>
> Ever wondered how much crap is left in your X-years old Gentoo box?
>
> I just developed a python utility to efficiently find orphaned files
> in the system. By orphaned files I mean the files that are present on
> system directories and don't belong to any installed package.
>
> The package builds a virtual filesystem (cache) on the RAM using
> python hash tables. Then it uses the cache to find the ownership of
> files inside user-specified dirs.
>
> Building the cache takes less than 10 seconds here in a system with
> 1366 installed packages.
>
> This is not intended to be a finished program yet, I'm looking forward
> for your constructive commentaries.

There is a tool that does that, qfile from app-portage/portage-utils.
Check the "-o, --orphans * List orphan files" option.

It's not as straight forward as it could be, as it checks only for
files specified as arguments or read from file.

But you can trivially use it like:
# find /dir/you/want/to/check/for/orphans | qfile -o -f -

Best,
Yuri.
 
Old 04-25-2010, 05:10 PM
Angelo Arrifano
 
Default Utility to find orphaned files

On 25-04-2010 17:34, Yuri Vasilevski wrote:
> Hello,
>
> On Sun, 25 Apr 2010 13:18:25 +0200
> Angelo Arrifano <miknix@gentoo.org> wrote:
>
>> Hello developers developers and developers,
>>
>> Ever wondered how much crap is left in your X-years old Gentoo box?
>>
>> I just developed a python utility to efficiently find orphaned files
>> in the system. By orphaned files I mean the files that are present on
>> system directories and don't belong to any installed package.
>>
>> The package builds a virtual filesystem (cache) on the RAM using
>> python hash tables. Then it uses the cache to find the ownership of
>> files inside user-specified dirs.
>>
>> Building the cache takes less than 10 seconds here in a system with
>> 1366 installed packages.
>>
>> This is not intended to be a finished program yet, I'm looking forward
>> for your constructive commentaries.
>
> There is a tool that does that, qfile from app-portage/portage-utils.
> Check the "-o, --orphans * List orphan files" option.
>
> It's not as straight forward as it could be, as it checks only for
> files specified as arguments or read from file.
>
> But you can trivially use it like:
> # find /dir/you/want/to/check/for/orphans | qfile -o -f -
>
> Best,
> Yuri.
>

Based on the comments so far, I'll try to make my PoC a better tool.
My primary objective is to make this some kind of disk cleanup utility
for Gentoo boxens. I don't expect Gentoo systems to be *that* polluted
but sometimes we all have to do ugly things to fix broken systems real
fast. - If you know what I mean.

There are other things that came to my mind, like using stored hashes to
check the system files integrity (as in security).

My next steps in regard to this utility will be:
* Follow harring suggestion and use available PM API.
* Make the application handle symlinks so we start getting a more
informative output.
* To store the generated cache on disk and to only regenerate it if needed.

Regards,
- Angelo
 
Old 04-25-2010, 05:43 PM
Benedikt Böhm
 
Default Utility to find orphaned files

On Sun, Apr 25, 2010 at 1:18 PM, Angelo Arrifano <miknix@gentoo.org> wrote:
> Hello developers developers and developers,
>
> Ever wondered how much crap is left in your X-years old Gentoo box?
>
> I just developed a python utility to efficiently find orphaned files in
> the system. By orphaned files I mean the files that are present on
> system directories and don't belong to any installed package.
>
> The package builds a virtual filesystem (cache) on the RAM using python
> hash tables. Then it uses the cache to find the ownership of files
> inside user-specified dirs.
>
> Building the cache takes less than 10 seconds here in a system with 1366
> installed packages.
>
> This is not intended to be a finished program yet, I'm looking forward
> for your constructive commentaries.

i have refactored findcruft (search the forums) two years ago (see
http://git.xnull.de/cgit/findcruft2/), maybe you can take a look at
it, especially the false-positives handling.

HTH,
Bene
 
Old 04-30-2010, 04:24 PM
Enrico Weigelt
 
Default Utility to find orphaned files

* Daniel Pielmeier <billie@gentoo.org> schrieb:

> What about searching the complete file system but using an exclude file where
> you can put directories and files which should not be searched. It is tedious to
> tell every path on the command-line. Also for instance if you specify /lib it
> will also search under /lib/modules and I am sure you do not consider all
> contents there as unneeded.

hmm, perhaps there's some way to assign these files to some package ?

> You also need to consider that your tool will return other false positives like
> byte compiled python modules and perl header files. In general everything an
> ebuild does in phases where it adds files to file-system but files are not
> stored to CONTENTS (pkg_{pre,post}inst). At this point the files are needed but
> not recognized by the package manager. If the ebuild does not take care of this
> files when removing (pkg_{pre,post}rm) the package they will remain on the
> file-system and are now unneeded.

Assuming these files are not optional/temporary (aka: can be regenerated on
the fly), I see a generic design problem here: everything belonging to some
package (excluding content data and configs, of course) should be assigned
to the package.

The big Q: how can we achieve this ?


cu
--
---------------------------------------------------------------------
Enrico Weigelt == metux IT service - http://www.metux.de/
---------------------------------------------------------------------
Please visit the OpenSource QM Taskforce:
http://wiki.metux.de/public/OpenSource_QM_Taskforce
Patches / Fixes for a lot dozens of packages in dozens of versions:
http://patches.metux.de/
---------------------------------------------------------------------
 
Old 05-03-2010, 01:34 PM
Peter Hjalmarsson
 
Default Utility to find orphaned files

fre 2010-04-30 klockan 18:24 +0200 skrev Enrico Weigelt:
> * Daniel Pielmeier <billie@gentoo.org> schrieb:
>
> > What about searching the complete file system but using an exclude file where
> > you can put directories and files which should not be searched. It is tedious to
> > tell every path on the command-line. Also for instance if you specify /lib it
> > will also search under /lib/modules and I am sure you do not consider all
> > contents there as unneeded.
>
> hmm, perhaps there's some way to assign these files to some package ?
>

Eh, no and it should not be since files in that directory is kernel
modules, and most of the files there is created by "cd /usr/src/linux &&
make" or genkernel or something alike and it is supposed to be that way.
Looking at the contents of that directory is pretty easy to see if a
directory there should be left alone or removed (as there is just one
directory per kernel. not any longer running a kernel anymore? remove
the corresponding dir).
It is better to have the script not tuch that directory at all or at
most point out "the directory contains directories for more kernels then
the currently running (i.e. there is more then one dir) and it is
totally THIS big. You may want to take a look if you have files from
older kernels that you do not longer need."
That would leave up to the user to figure out what kernel modules to
keep and what kernel to pount. Or you suggest autocleaning of /boot
and /usr/src/linux-* as well? Dangerous!
 
Old 05-11-2010, 01:08 PM
Angelo Arrifano
 
Default Utility to find orphaned files

On 03-05-2010 15:34, Peter Hjalmarsson wrote:
> fre 2010-04-30 klockan 18:24 +0200 skrev Enrico Weigelt:
>> * Daniel Pielmeier <billie@gentoo.org> schrieb:
>>
>>> What about searching the complete file system but using an exclude file where
>>> you can put directories and files which should not be searched. It is tedious to
>>> tell every path on the command-line. Also for instance if you specify /lib it
>>> will also search under /lib/modules and I am sure you do not consider all
>>> contents there as unneeded.
>>
>> hmm, perhaps there's some way to assign these files to some package ?
>>
>
> Eh, no and it should not be since files in that directory is kernel
> modules, and most of the files there is created by "cd /usr/src/linux &&
> make" or genkernel or something alike and it is supposed to be that way.

Indeed. /lib/firmware is another candidate
> Looking at the contents of that directory is pretty easy to see if a
> directory there should be left alone or removed (as there is just one
> directory per kernel. not any longer running a kernel anymore? remove
> the corresponding dir).

That is dangerous. For example, I always keep the previous 2 kernels
just in case I detect some problem with the latest and I need to quickly
go back.
> It is better to have the script not tuch that directory at all or at
> most point out "the directory contains directories for more kernels then
> the currently running (i.e. there is more then one dir) and it is
> totally THIS big.

Sounds like a plan.
You may want to take a look if you have files from
> older kernels that you do not longer need."
> That would leave up to the user to figure out what kernel modules to
> keep and what kernel to pount. Or you suggest autocleaning of /boot
> and /usr/src/linux-* as well? Dangerous!
>
>
>

I'm seeing that there is enough interest (including me) on such utility.
Since it is difficult to please everyone at start, I'll first open a
project page on sf.net and develop a more powerful PoC that matches my
ideas. There was a lot of good ideas and observations here, so keep them
coming that I'll certainly read them.

When, and only if, the thing grows to a more mature state; I'll try to
open a Gentoo project by the appropriate means.

I'm not very good on free time lately, so I can't promise anything. But,
as long as my interest on it doesn't die I'll slowly keep working on.

Regards,
- Angelo
 

Thread Tools




All times are GMT. The time now is 12:59 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright ©2007 - 2008, www.linux-archive.org