FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Gentoo > Gentoo Portage Developer

 
 
LinkBack Thread Tools
 
Old 12-01-2008, 09:47 PM
"Emma Strubell"
 
Default search functionality in emerge

True, true. Like I said, I don't really use overlays, so excuse my igonrance.

On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@necoro.eu> wrote:

-----BEGIN PGP SIGNED MESSAGE-----

Hash: SHA1



Emma Strubell schrieb:

> 2) does anyone really need to search an overlay anyway?



Of course. Take large (semi-)official overlays like sunrise. They can

easily be seen as a second portage tree.

-----BEGIN PGP SIGNATURE-----

Version: GnuPG v2.0.9 (GNU/Linux)

Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org



iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt

0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S

=+lCO

-----END PGP SIGNATURE-----
 
Old 12-01-2008, 11:20 PM
Tambet
 
Default search functionality in emerge

2008/12/2 Emma Strubell <emma.strubell@gmail.com>

True, true. Like I said, I don't really use overlays, so excuse my igonrance.
Do you know an order of doing things:

Rules of Optimization:


Rule 1: Don't do it.
Rule 2 (for experts only): Don't do it yet.
What this actually means - functionality comes first. Readability comes next. Optimization comes last. Unless you are creating a fancy 3D engine for kung fu game.

If you are going to exclude overlays, you are removing functionality - and, indeed, absolutely has-to-be-there functionality, because noone would intuitively expect search function to search only one subset of packages, however reasonable this subset would be. So, you can't, just can't, add this package into portage base - you could write just another external search package for portage.


I looked this code a bit and:
Portage's "__init__.py" contains comment "# search functionality". After this comment, there is a nice and simple search class.
It also contains method "def action_sync(...)", which contains synchronization stuff.


Now, search class will be initialized by setting up 3 databases - porttree, bintree and vartree, whatever those are. Those will be in self._dbs array and porttree will be in self._portdb.

It contains some more methods:

_findname(...) will return result of self._portdb.findname(...) with same parameters or None if it does not exist.
Other methods will do similar things - map one or another method.
execute will do the real search...

Now - "for package in self.portdb.cp_all()" is important here ...it currently loops over whole portage tree. All kinds of matching will be done inside.
self.portdb obviously points to porttree.py (unless it points to fake tree).

cp_all will take all porttrees and do simple file search inside. This method should contain optional index search.
self.porttrees = [self.porttree_root] +
[os.path.realpath(t) for t in self.mysettings["PORTDIR_OVERLAY"].split()]

So, self.porttrees contains list of trees - first of them is root, others are overlays.

Now, what you have to do will not be harder just because of having overlay search, too.

You have to create method def cp_index(self), which will return dictionary containing package names as keys. For oroot... will be "self.porttrees[1:]", not "self.porttrees" - this will only search overlays. d = {} will be replaced with d = self.cp_index(). If index is not there, old version will be used (thus, you have to make internal porttrees variable, which contains all or all except first).


Other methods used by search are xmatch and aux_get - first used several times and last one used to get description. You have to cache results of those specific queries and make them use your cache - as you can see, those parts of portage are already able to use overlays. Thus, you have to put your code again in beginning of those functions - create index_xmatch and index_aux_get methods, then make those methods use them and return their results unless those are None (or something other in case none is already legal result) - if they return None, old code will be run and do it's job. If index is not created, result is None. In index_** methods, just check if query is what you can answer and if it is, then answer it.


Obviously, the simplest way to create your index is to delete index, then use those same methods to query for all nessecary information - and fastest way would be to add updating index directly into sync, which you could do later.


Please, also, make those commands to turn index on and off (last one should also delete it to save disk space). Default should be off until it's fast, small and reliable. Also notice that if index is kept on hard drive, it might be faster if it's compressed (gz, for example) - decompressing takes less time and more processing power than reading it fully out.


Have luck!


-----BEGIN PGP SIGNED MESSAGE-----

Hash: SHA1



Emma Strubell schrieb:

> 2) does anyone really need to search an overlay anyway?



Of course. Take large (semi-)official overlays like sunrise. They can

easily be seen as a second portage tree.

-----BEGIN PGP SIGNATURE-----

Version: GnuPG v2.0.9 (GNU/Linux)

Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org



iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt

0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S

=+lCO

-----END PGP SIGNATURE-----



On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@necoro.eu> wrote:
 
Old 12-02-2008, 01:23 AM
"Emma Strubell"
 
Default search functionality in emerge

yes, yes, i know, you're right :]

and thanks a bunch for the outline! about the compression, I agree that it would be a good idea, but I don't know how to implement it. not that it would be difficult... I'm guessing there's a gzip module for python that would make it pretty straightforward? I think I'm getting ahead of myself, though. I haven't even implemented the suffix tree yet!


Emma

On Mon, Dec 1, 2008 at 7:20 PM, Tambet <qtvali@gmail.com> wrote:

2008/12/2 Emma Strubell <emma.strubell@gmail.com>


True, true. Like I said, I don't really use overlays, so excuse my igonrance.
Do you know an order of doing things:

Rules of Optimization:


Rule 1: Don't do it.
Rule 2 (for experts only): Don't do it yet.
What this actually means - functionality comes first. Readability comes next. Optimization comes last. Unless you are creating a fancy 3D engine for kung fu game.

If you are going to exclude overlays, you are removing functionality - and, indeed, absolutely has-to-be-there functionality, because noone would intuitively expect search function to search only one subset of packages, however reasonable this subset would be. So, you can't, just can't, add this package into portage base - you could write just another external search package for portage.



I looked this code a bit and:
Portage's "__init__.py" contains comment "# search functionality". After this comment, there is a nice and simple search class.
It also contains method "def action_sync(...)", which contains synchronization stuff.



Now, search class will be initialized by setting up 3 databases - porttree, bintree and vartree, whatever those are. Those will be in self._dbs array and porttree will be in self._portdb.

It contains some more methods:


_findname(...) will return result of self._portdb.findname(...) with same parameters or None if it does not exist.
Other methods will do similar things - map one or another method.
execute will do the real search...


Now - "for package in self.portdb.cp_all()" is important here ...it currently loops over whole portage tree. All kinds of matching will be done inside.
self.portdb obviously points to porttree.py (unless it points to fake tree).


cp_all will take all porttrees and do simple file search inside. This method should contain optional index search.
self.porttrees = [self.porttree_root] +
[os.path.realpath(t) for t in self.mysettings["PORTDIR_OVERLAY"].split()]


So, self.porttrees contains list of trees - first of them is root, others are overlays.

Now, what you have to do will not be harder just because of having overlay search, too.

You have to create method def cp_index(self), which will return dictionary containing package names as keys. For oroot... will be "self.porttrees[1:]", not "self.porttrees" - this will only search overlays. d = {} will be replaced with d = self.cp_index(). If index is not there, old version will be used (thus, you have to make internal porttrees variable, which contains all or all except first).



Other methods used by search are xmatch and aux_get - first used several times and last one used to get description. You have to cache results of those specific queries and make them use your cache - as you can see, those parts of portage are already able to use overlays. Thus, you have to put your code again in beginning of those functions - create index_xmatch and index_aux_get methods, then make those methods use them and return their results unless those are None (or something other in case none is already legal result) - if they return None, old code will be run and do it's job. If index is not created, result is None. In index_** methods, just check if query is what you can answer and if it is, then answer it.



Obviously, the simplest way to create your index is to delete index, then use those same methods to query for all nessecary information - and fastest way would be to add updating index directly into sync, which you could do later.



Please, also, make those commands to turn index on and off (last one should also delete it to save disk space). Default should be off until it's fast, small and reliable. Also notice that if index is kept on hard drive, it might be faster if it's compressed (gz, for example) - decompressing takes less time and more processing power than reading it fully out.



Have luck!



-----BEGIN PGP SIGNED MESSAGE-----

Hash: SHA1



Emma Strubell schrieb:

> 2) does anyone really need to search an overlay anyway?



Of course. Take large (semi-)official overlays like sunrise. They can

easily be seen as a second portage tree.

-----BEGIN PGP SIGNATURE-----

Version: GnuPG v2.0.9 (GNU/Linux)

Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org



iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt

0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S

=+lCO

-----END PGP SIGNATURE-----



On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@necoro.eu> wrote:
 
Old 12-02-2008, 09:21 AM
"Alec Warner"
 
Default search functionality in emerge

On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@gmail.com> wrote:
> 2008/12/2 Emma Strubell <emma.strubell@gmail.com>
>>
>> True, true. Like I said, I don't really use overlays, so excuse my
>> igonrance.
>
> Do you know an order of doing things:
>
> Rules of Optimization:
>
> Rule 1: Don't do it.
> Rule 2 (for experts only): Don't do it yet.
>
> What this actually means - functionality comes first. Readability comes
> next. Optimization comes last. Unless you are creating a fancy 3D engine for
> kung fu game.
>
> If you are going to exclude overlays, you are removing functionality - and,
> indeed, absolutely has-to-be-there functionality, because noone would
> intuitively expect search function to search only one subset of packages,
> however reasonable this subset would be. So, you can't, just can't, add this
> package into portage base - you could write just another external search
> package for portage.
>
> I looked this code a bit and:
> Portage's "__init__.py" contains comment "# search functionality". After
> this comment, there is a nice and simple search class.
> It also contains method "def action_sync(...)", which contains
> synchronization stuff.
>
> Now, search class will be initialized by setting up 3 databases - porttree,
> bintree and vartree, whatever those are. Those will be in self._dbs array
> and porttree will be in self._portdb.
>
> It contains some more methods:
> _findname(...) will return result of self._portdb.findname(...) with same
> parameters or None if it does not exist.
> Other methods will do similar things - map one or another method.
> execute will do the real search...
> Now - "for package in self.portdb.cp_all()" is important here ...it
> currently loops over whole portage tree. All kinds of matching will be done
> inside.
> self.portdb obviously points to porttree.py (unless it points to fake tree).
> cp_all will take all porttrees and do simple file search inside. This method
> should contain optional index search.
>
> self.porttrees = [self.porttree_root] +
> [os.path.realpath(t) for t in self.mysettings["PORTDIR_OVERLAY"].split()]
>
> So, self.porttrees contains list of trees - first of them is root, others
> are overlays.
>
> Now, what you have to do will not be harder just because of having overlay
> search, too.
>
> You have to create method def cp_index(self), which will return dictionary
> containing package names as keys. For oroot... will be "self.porttrees[1:]",
> not "self.porttrees" - this will only search overlays. d = {} will be
> replaced with d = self.cp_index(). If index is not there, old version will
> be used (thus, you have to make internal porttrees variable, which contains
> all or all except first).
>
> Other methods used by search are xmatch and aux_get - first used several
> times and last one used to get description. You have to cache results of
> those specific queries and make them use your cache - as you can see, those
> parts of portage are already able to use overlays. Thus, you have to put
> your code again in beginning of those functions - create index_xmatch and
> index_aux_get methods, then make those methods use them and return their
> results unless those are None (or something other in case none is already
> legal result) - if they return None, old code will be run and do it's job.
> If index is not created, result is None. In index_** methods, just check if
> query is what you can answer and if it is, then answer it.
>
> Obviously, the simplest way to create your index is to delete index, then
> use those same methods to query for all nessecary information - and fastest
> way would be to add updating index directly into sync, which you could do
> later.
>
> Please, also, make those commands to turn index on and off (last one should
> also delete it to save disk space). Default should be off until it's fast,
> small and reliable. Also notice that if index is kept on hard drive, it
> might be faster if it's compressed (gz, for example) - decompressing takes
> less time and more processing power than reading it fully out.

I'm pretty sure your mistaken here, unless your index is stored on a
floppy or something really slow.

A disk read has 2 primary costs.

Seek Time: Time for the head to seek to the sector of disk you want.
Spin Time: Time for the platter to spin around such that the sector
you want is under the read head.

Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120
rotations per second, so worst case (you just passed the sector you
need) you need to wait 1/120th of a second (or 8ms).

Seek Time is per hard drive, but most drives provide average seek
times under 10ms.

So it takes on average 18ms to get to your data, then you start
reading. The index will not be that large (my esearchdb is 2 megs,
but lets assume 10MB for this compressed index).

I took a 10MB meg sqlite database and compressed it with gzip (default
settings) down to 5 megs.
gzip -d on the database takes 300ms, catting the decompressed data
base takes 88ms (average of 5 runs, drop disk caches between runs).

I then tried my vdb_metadata.pickle from /var/cache/edb/vdb_metadata.pickle

1.3megs compresses to 390k.

36ms to decompress the 390k file, but 26ms to read the 1.3meg file from disk.

Your index would have to be very large or very fragmented on disk
(requiring more than one seek) to see a significant gain in file
compression (gzip scales linearly).

In short, don't compress the index ;p

>
> Have luck!
>
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> Emma Strubell schrieb:
>>> > 2) does anyone really need to search an overlay anyway?
>>>
>>> Of course. Take large (semi-)official overlays like sunrise. They can
>>> easily be seen as a second portage tree.
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: GnuPG v2.0.9 (GNU/Linux)
>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>>>
>>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt
>>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S
>>> =+lCO
>>> -----END PGP SIGNATURE-----
>>>
>> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@necoro.eu>
>> wrote:
>>
>
>
 
Old 12-02-2008, 11:42 AM
Tambet
 
Default search functionality in emerge

About zipping.. Default settings might not really be good idea - i think that "fastest" might be even better. Considering that portage tree contains same word again and again (like "applications") it needs pretty small dictionary to make it much smaller. Decompressing will not be reading from disc, decompressing and writing back to disc as in your case probably - try decompression to memory drive and you might get better numbers.


I have personally used compression in one c++ application and with optimum settings, it made things much faster - those were files, where i had for example 65536 16-byte integers, which could be zeros and mostly were; I didnt care about creating better file format, but just compressed the whole thing.


I suggest you to compress esearch db, then decompress it to memory drive and give us those numbers - might be considerably faster.

http://www.python.org/doc/2.5.2/lib/module-gzip.html - Python gzip support. Try open of that and normal open on esearch db; also compress with the same lib to get right kind of file.


Anyway - maybe this compression should be later added and optional.
Tambet - technique evolves to art, art evolves to magic, magic evolves to just doing.



2008/12/2 Alec Warner <antarus@gentoo.org>

On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@gmail.com> wrote:

> 2008/12/2 Emma Strubell <emma.strubell@gmail.com>

>>

>> True, true. Like I said, I don't really use overlays, so excuse my

>> igonrance.

>

> Do you know an order of doing things:

>

> Rules of Optimization:

>

> Rule 1: Don't do it.

> Rule 2 (for experts only): Don't do it yet.

>

> What this actually means - functionality comes first. Readability comes

> next. Optimization comes last. Unless you are creating a fancy 3D engine for

> kung fu game.

>

> If you are going to exclude overlays, you are removing functionality - and,

> indeed, absolutely has-to-be-there functionality, because noone would

> intuitively expect search function to search only one subset of packages,

> however reasonable this subset would be. So, you can't, just can't, add this

> package into portage base - you could write just another external search

> package for portage.

>

> I looked this code a bit and:

> Portage's "__init__.py" contains comment "# search functionality". After

> this comment, there is a nice and simple search class.

> It also contains method "def action_sync(...)", which contains

> synchronization stuff.

>

> Now, search class will be initialized by setting up 3 databases - porttree,

> bintree and vartree, whatever those are. Those will be in self._dbs array

> and porttree will be in self._portdb.

>

> It contains some more methods:

> _findname(...) will return result of self._portdb.findname(...) with same

> parameters or None if it does not exist.

> Other methods will do similar things - map one or another method.

> execute will do the real search...

> Now - "for package in self.portdb.cp_all()" is important here ...it

> currently loops over whole portage tree. All kinds of matching will be done

> inside.

> self.portdb obviously points to porttree.py (unless it points to fake tree).

> cp_all will take all porttrees and do simple file search inside. This method

> should contain optional index search.

>

> * * * * * * * self.porttrees = [self.porttree_root] +

> * * * * * * * * * * * [os.path.realpath(t) for t in self.mysettings["PORTDIR_OVERLAY"].split()]

>

> So, self.porttrees contains list of trees - first of them is root, others

> are overlays.

>

> Now, what you have to do will not be harder just because of having overlay

> search, too.

>

> You have to create method def cp_index(self), which will return dictionary

> containing package names as keys. For oroot... will be "self.porttrees[1:]",

> not "self.porttrees" - this will only search overlays. d = {} will be

> replaced with d = self.cp_index(). If index is not there, old version will

> be used (thus, you have to make internal porttrees variable, which contains

> all or all except first).

>

> Other methods used by search are xmatch and aux_get - first used several

> times and last one used to get description. You have to cache results of

> those specific queries and make them use your cache - as you can see, those

> parts of portage are already able to use overlays. Thus, you have to put

> your code again in beginning of those functions - create index_xmatch and

> index_aux_get methods, then make those methods use them and return their

> results unless those are None (or something other in case none is already

> legal result) - if they return None, old code will be run and do it's job.

> If index is not created, result is None. In index_** methods, just check if

> query is what you can answer and if it is, then answer it.

>

> Obviously, the simplest way to create your index is to delete index, then

> use those same methods to query for all nessecary information - and fastest

> way would be to add updating index directly into sync, which you could do

> later.

>

> Please, also, make those commands to turn index on and off (last one should

> also delete it to save disk space). Default should be off until it's fast,

> small and reliable. Also notice that if index is kept on hard drive, it

> might be faster if it's compressed (gz, for example) - decompressing takes

> less time and more processing power than reading it fully out.



I'm pretty sure your mistaken here, unless your index is stored on a

floppy or something really slow.



A disk read has 2 primary costs.



Seek Time: Time for the head to seek to the sector of disk you want.

Spin Time: Time for the platter to spin around such that the sector

you want is under the read head.



Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120

rotations per second, so worst case (you just passed the sector you

need) you need to wait 1/120th of a second (or 8ms).



Seek Time is per hard drive, but most drives provide average seek

times under 10ms.



So it takes on average 18ms to get to your data, then you start

reading. *The index will not be that large (my esearchdb is 2 megs,

but lets assume 10MB for this compressed index).



I took a 10MB meg sqlite database and compressed it with gzip (default

settings) down to 5 megs.

gzip -d on the database takes 300ms, catting the decompressed data

base takes 88ms (average of 5 runs, drop disk caches between runs).



I then tried my vdb_metadata.pickle from /var/cache/edb/vdb_metadata.pickle



1.3megs compresses to 390k.



36ms to decompress the 390k file, but 26ms to read the 1.3meg file from disk.



Your index would have to be very large or very fragmented on disk

(requiring more than one seek) to see a significant gain in file

compression (gzip scales linearly).



In short, don't compress the index ;p



>

> Have luck!

>

>>> -----BEGIN PGP SIGNED MESSAGE-----

>>> Hash: SHA1

>>>

>>> Emma Strubell schrieb:

>>> > 2) does anyone really need to search an overlay anyway?

>>>

>>> Of course. Take large (semi-)official overlays like sunrise. They can

>>> easily be seen as a second portage tree.

>>> -----BEGIN PGP SIGNATURE-----

>>> Version: GnuPG v2.0.9 (GNU/Linux)

>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

>>>

>>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt

>>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S

>>> =+lCO

>>> -----END PGP SIGNATURE-----

>>>

>> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@necoro.eu>

>> wrote:

>>

>

>
 
Old 12-02-2008, 12:51 PM
Tambet
 
Default search functionality in emerge

Btw. one way of index update would be:

emerge --sync:
Add delete index if it exists
emerge --searchindexon
Turns index onemerge --searchindexoff
Turns index off
Deletes index if it existsemerge -s or emerge -S
Does normal search if index is offIf index is on and does not exist, then creates index on the flyIf index is on and does exist, then uses it
Tambet - technique evolves to art, art evolves to magic, magic evolves to just doing.



2008/12/2 Tambet <qtvali@gmail.com>

About zipping.. Default settings might not really be good idea - i think that "fastest" might be even better. Considering that portage tree contains same word again and again (like "applications") it needs pretty small dictionary to make it much smaller. Decompressing will not be reading from disc, decompressing and writing back to disc as in your case probably - try decompression to memory drive and you might get better numbers.



I have personally used compression in one c++ application and with optimum settings, it made things much faster - those were files, where i had for example 65536 16-byte integers, which could be zeros and mostly were; I didnt care about creating better file format, but just compressed the whole thing.



I suggest you to compress esearch db, then decompress it to memory drive and give us those numbers - might be considerably faster.

http://www.python.org/doc/2.5.2/lib/module-gzip.html - Python gzip support. Try open of that and normal open on esearch db; also compress with the same lib to get right kind of file.



Anyway - maybe this compression should be later added and optional.
Tambet - technique evolves to art, art evolves to magic, magic evolves to just doing.



2008/12/2 Alec Warner <antarus@gentoo.org>


On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@gmail.com> wrote:

> 2008/12/2 Emma Strubell <emma.strubell@gmail.com>

>>

>> True, true. Like I said, I don't really use overlays, so excuse my

>> igonrance.

>

> Do you know an order of doing things:

>

> Rules of Optimization:

>

> Rule 1: Don't do it.

> Rule 2 (for experts only): Don't do it yet.

>

> What this actually means - functionality comes first. Readability comes

> next. Optimization comes last. Unless you are creating a fancy 3D engine for

> kung fu game.

>

> If you are going to exclude overlays, you are removing functionality - and,

> indeed, absolutely has-to-be-there functionality, because noone would

> intuitively expect search function to search only one subset of packages,

> however reasonable this subset would be. So, you can't, just can't, add this

> package into portage base - you could write just another external search

> package for portage.

>

> I looked this code a bit and:

> Portage's "__init__.py" contains comment "# search functionality". After

> this comment, there is a nice and simple search class.

> It also contains method "def action_sync(...)", which contains

> synchronization stuff.

>

> Now, search class will be initialized by setting up 3 databases - porttree,

> bintree and vartree, whatever those are. Those will be in self._dbs array

> and porttree will be in self._portdb.

>

> It contains some more methods:

> _findname(...) will return result of self._portdb.findname(...) with same

> parameters or None if it does not exist.

> Other methods will do similar things - map one or another method.

> execute will do the real search...

> Now - "for package in self.portdb.cp_all()" is important here ...it

> currently loops over whole portage tree. All kinds of matching will be done

> inside.

> self.portdb obviously points to porttree.py (unless it points to fake tree).

> cp_all will take all porttrees and do simple file search inside. This method

> should contain optional index search.

>

> * * * * * * * self.porttrees = [self.porttree_root] +

> * * * * * * * * * * * [os.path.realpath(t) for t in self.mysettings["PORTDIR_OVERLAY"].split()]

>

> So, self.porttrees contains list of trees - first of them is root, others

> are overlays.

>

> Now, what you have to do will not be harder just because of having overlay

> search, too.

>

> You have to create method def cp_index(self), which will return dictionary

> containing package names as keys. For oroot... will be "self.porttrees[1:]",

> not "self.porttrees" - this will only search overlays. d = {} will be

> replaced with d = self.cp_index(). If index is not there, old version will

> be used (thus, you have to make internal porttrees variable, which contains

> all or all except first).

>

> Other methods used by search are xmatch and aux_get - first used several

> times and last one used to get description. You have to cache results of

> those specific queries and make them use your cache - as you can see, those

> parts of portage are already able to use overlays. Thus, you have to put

> your code again in beginning of those functions - create index_xmatch and

> index_aux_get methods, then make those methods use them and return their

> results unless those are None (or something other in case none is already

> legal result) - if they return None, old code will be run and do it's job.

> If index is not created, result is None. In index_** methods, just check if

> query is what you can answer and if it is, then answer it.

>

> Obviously, the simplest way to create your index is to delete index, then

> use those same methods to query for all nessecary information - and fastest

> way would be to add updating index directly into sync, which you could do

> later.

>

> Please, also, make those commands to turn index on and off (last one should

> also delete it to save disk space). Default should be off until it's fast,

> small and reliable. Also notice that if index is kept on hard drive, it

> might be faster if it's compressed (gz, for example) - decompressing takes

> less time and more processing power than reading it fully out.



I'm pretty sure your mistaken here, unless your index is stored on a

floppy or something really slow.



A disk read has 2 primary costs.



Seek Time: Time for the head to seek to the sector of disk you want.

Spin Time: Time for the platter to spin around such that the sector

you want is under the read head.



Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120

rotations per second, so worst case (you just passed the sector you

need) you need to wait 1/120th of a second (or 8ms).



Seek Time is per hard drive, but most drives provide average seek

times under 10ms.



So it takes on average 18ms to get to your data, then you start

reading. *The index will not be that large (my esearchdb is 2 megs,

but lets assume 10MB for this compressed index).



I took a 10MB meg sqlite database and compressed it with gzip (default

settings) down to 5 megs.

gzip -d on the database takes 300ms, catting the decompressed data

base takes 88ms (average of 5 runs, drop disk caches between runs).



I then tried my vdb_metadata.pickle from /var/cache/edb/vdb_metadata.pickle



1.3megs compresses to 390k.



36ms to decompress the 390k file, but 26ms to read the 1.3meg file from disk.



Your index would have to be very large or very fragmented on disk

(requiring more than one seek) to see a significant gain in file

compression (gzip scales linearly).



In short, don't compress the index ;p



>

> Have luck!

>

>>> -----BEGIN PGP SIGNED MESSAGE-----

>>> Hash: SHA1

>>>

>>> Emma Strubell schrieb:

>>> > 2) does anyone really need to search an overlay anyway?

>>>

>>> Of course. Take large (semi-)official overlays like sunrise. They can

>>> easily be seen as a second portage tree.

>>> -----BEGIN PGP SIGNATURE-----

>>> Version: GnuPG v2.0.9 (GNU/Linux)

>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

>>>

>>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt

>>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S

>>> =+lCO

>>> -----END PGP SIGNATURE-----

>>>

>> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@necoro.eu>

>> wrote:

>>

>

>
 
Old 12-02-2008, 04:42 PM
Tambet
 
Default search functionality in emerge

To remove this fuzz maybe some useflags would be possible - for example, useflag which makes portage use SQLite. Then, it's db can used for indexing and portage searches could be implemented as SQL queries. SQLite then does what it can to make it fast and disk cache will make sure that for simultaneous searches it's kept in memory.



Just some thoughts on compression...

http://en.wikipedia.org/wiki/Hard_disk 1.6GBit/s
http://www.dewassoc.com/performance/memory/memory_speeds.htm 1.6GB/s

So the fastest hard drives are 8 times slower, than the fastest memory devices in pair.
Anyway, we live in the real world...
http://www.amazon.com/Maxtor-L01P200-Internal-7200-Drive/dp/B00007KVHZ up to 133MB/s

Therefore,
the peak bandwidth for PC-266 DDR is 266 MHz x 8 Bytes = 2.1GB/s

Now, be fair - if you can read 133MB/s out of hard drive, it means 15 ms for 2MB file without seek times and partitioning. At the same time, you can read/write about 31.5MB of data in memory, doing different operations with it. If your compression algorithm has dictionary of repeated words, which repeat enough times to make actual file size 0.5MB smaller, you win 7.5ms and loose about 1ms for seeking this thing. Processor usage might be somewhat bigger, but you really don't care about processor usage when searching packages and it's not that big.


What it really gives us is ability to make our file format highly inefficient in terms of disk space - for example, what I mentioned before was 65536 64-bit integers (there was a mistake as I said they were 16 byte, they were 8 bytes long). Most of them started with at least 32 bits of zeros, which were wiped out by compression lib, but which I needed to do fast operations. It doesnt matter much, how data is stored, but there are repeated patterns if you want to make it fast.


I really don't know if pickle has some good file format included or not. For common pickle files, I think that compression is noticeable. Anyway, using database, for example, would be clever idea and making keyword search fully SQL-query specific makes it clearly better and removes chance to compress.


Tambet - technique evolves to art, art evolves to magic, magic evolves to just doing.



2008/12/2 Alec Warner <antarus@gentoo.org>

On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@gmail.com> wrote:

> 2008/12/2 Emma Strubell <emma.strubell@gmail.com>

>>

>> True, true. Like I said, I don't really use overlays, so excuse my

>> igonrance.

>

> Do you know an order of doing things:

>

> Rules of Optimization:

>

> Rule 1: Don't do it.

> Rule 2 (for experts only): Don't do it yet.

>

> What this actually means - functionality comes first. Readability comes

> next. Optimization comes last. Unless you are creating a fancy 3D engine for

> kung fu game.

>

> If you are going to exclude overlays, you are removing functionality - and,

> indeed, absolutely has-to-be-there functionality, because noone would

> intuitively expect search function to search only one subset of packages,

> however reasonable this subset would be. So, you can't, just can't, add this

> package into portage base - you could write just another external search

> package for portage.

>

> I looked this code a bit and:

> Portage's "__init__.py" contains comment "# search functionality". After

> this comment, there is a nice and simple search class.

> It also contains method "def action_sync(...)", which contains

> synchronization stuff.

>

> Now, search class will be initialized by setting up 3 databases - porttree,

> bintree and vartree, whatever those are. Those will be in self._dbs array

> and porttree will be in self._portdb.

>

> It contains some more methods:

> _findname(...) will return result of self._portdb.findname(...) with same

> parameters or None if it does not exist.

> Other methods will do similar things - map one or another method.

> execute will do the real search...

> Now - "for package in self.portdb.cp_all()" is important here ...it

> currently loops over whole portage tree. All kinds of matching will be done

> inside.

> self.portdb obviously points to porttree.py (unless it points to fake tree).

> cp_all will take all porttrees and do simple file search inside. This method

> should contain optional index search.

>

> * * * * * * * self.porttrees = [self.porttree_root] +

> * * * * * * * * * * * [os.path.realpath(t) for t in self.mysettings["PORTDIR_OVERLAY"].split()]

>

> So, self.porttrees contains list of trees - first of them is root, others

> are overlays.

>

> Now, what you have to do will not be harder just because of having overlay

> search, too.

>

> You have to create method def cp_index(self), which will return dictionary

> containing package names as keys. For oroot... will be "self.porttrees[1:]",

> not "self.porttrees" - this will only search overlays. d = {} will be

> replaced with d = self.cp_index(). If index is not there, old version will

> be used (thus, you have to make internal porttrees variable, which contains

> all or all except first).

>

> Other methods used by search are xmatch and aux_get - first used several

> times and last one used to get description. You have to cache results of

> those specific queries and make them use your cache - as you can see, those

> parts of portage are already able to use overlays. Thus, you have to put

> your code again in beginning of those functions - create index_xmatch and

> index_aux_get methods, then make those methods use them and return their

> results unless those are None (or something other in case none is already

> legal result) - if they return None, old code will be run and do it's job.

> If index is not created, result is None. In index_** methods, just check if

> query is what you can answer and if it is, then answer it.

>

> Obviously, the simplest way to create your index is to delete index, then

> use those same methods to query for all nessecary information - and fastest

> way would be to add updating index directly into sync, which you could do

> later.

>

> Please, also, make those commands to turn index on and off (last one should

> also delete it to save disk space). Default should be off until it's fast,

> small and reliable. Also notice that if index is kept on hard drive, it

> might be faster if it's compressed (gz, for example) - decompressing takes

> less time and more processing power than reading it fully out.



I'm pretty sure your mistaken here, unless your index is stored on a

floppy or something really slow.



A disk read has 2 primary costs.



Seek Time: Time for the head to seek to the sector of disk you want.

Spin Time: Time for the platter to spin around such that the sector

you want is under the read head.



Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120

rotations per second, so worst case (you just passed the sector you

need) you need to wait 1/120th of a second (or 8ms).



Seek Time is per hard drive, but most drives provide average seek

times under 10ms.



So it takes on average 18ms to get to your data, then you start

reading. *The index will not be that large (my esearchdb is 2 megs,

but lets assume 10MB for this compressed index).



I took a 10MB meg sqlite database and compressed it with gzip (default

settings) down to 5 megs.

gzip -d on the database takes 300ms, catting the decompressed data

base takes 88ms (average of 5 runs, drop disk caches between runs).



I then tried my vdb_metadata.pickle from /var/cache/edb/vdb_metadata.pickle



1.3megs compresses to 390k.



36ms to decompress the 390k file, but 26ms to read the 1.3meg file from disk.



Your index would have to be very large or very fragmented on disk

(requiring more than one seek) to see a significant gain in file

compression (gzip scales linearly).



In short, don't compress the index ;p



>

> Have luck!

>

>>> -----BEGIN PGP SIGNED MESSAGE-----

>>> Hash: SHA1

>>>

>>> Emma Strubell schrieb:

>>> > 2) does anyone really need to search an overlay anyway?

>>>

>>> Of course. Take large (semi-)official overlays like sunrise. They can

>>> easily be seen as a second portage tree.

>>> -----BEGIN PGP SIGNATURE-----

>>> Version: GnuPG v2.0.9 (GNU/Linux)

>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

>>>

>>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt

>>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S

>>> =+lCO

>>> -----END PGP SIGNATURE-----

>>>

>> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@necoro.eu>

>> wrote:

>>

>

>
 
Old 12-02-2008, 06:54 PM
"Alec Warner"
 
Default search functionality in emerge

On Tue, Dec 2, 2008 at 4:42 AM, Tambet <qtvali@gmail.com> wrote:
> About zipping.. Default settings might not really be good idea - i think
> that "fastest" might be even better. Considering that portage tree contains
> same word again and again (like "applications") it needs pretty small
> dictionary to make it much smaller. Decompressing will not be reading from
> disc, decompressing and writing back to disc as in your case probably - try
> decompression to memory drive and you might get better numbers.

I ran gzip -d -c file.gz > /dev/null, which should not write to disk.

I tried again with gzip -1 and it still takes 29ms to decompress (even
with gzip -1) where a bare read takes 26ms. (I have a 2.6Ghz X2 which
is probably relevant to gzip decompression speed)

>
> I have personally used compression in one c++ application and with optimum
> settings, it made things much faster - those were files, where i had for
> example 65536 16-byte integers, which could be zeros and mostly were; I
> didnt care about creating better file format, but just compressed the whole
> thing.

I'm not saying compression won't make the index smaller. I'm saying
making the index smaller does not improve performance. If you have a
10 meg file and you make it 1 meg, you do not increase performance
because you (on average) are not saving enough time reading the
smaller file; since you pay it in decompressing the smaller file
later.

>
> I suggest you to compress esearch db, then decompress it to memory drive and
> give us those numbers - might be considerably faster.

gzip -d -c esearchdb.py.gz > /dev/null (compressed with gzip -1) takes
on average (6 trials, dropped caches between trials) takes 35.1666ms

cat esearchdb.py > /dev/null (uncompressed) takes on average of 6 trials, 24ms.

The point is you use compression when you need to save space (sending
data over the network, or storing large amounts of data or a lot of
something). The index isn't going to be big (if it is bigger than 20
or 30 meg I'll be surprised), the index isn't going over the network
and there is only 1 index, not say a million indexes (where
compression might actually be useful for some kind of LRU subset of
indexes to meet disk requirements).

Anyway this is all moot since as you stated so well earlier,
optimization comes last, so stop trying to do it now

-Alec

>
> http://www.python.org/doc/2.5.2/lib/module-gzip.html - Python gzip support.
> Try open of that and normal open on esearch db; also compress with the same
> lib to get right kind of file.
>
> Anyway - maybe this compression should be later added and optional.
>
> Tambet - technique evolves to art, art evolves to magic, magic evolves to
> just doing.
>
>
> 2008/12/2 Alec Warner <antarus@gentoo.org>
>>
>> On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@gmail.com> wrote:
>> > 2008/12/2 Emma Strubell <emma.strubell@gmail.com>
>> >>
>> >> True, true. Like I said, I don't really use overlays, so excuse my
>> >> igonrance.
>> >
>> > Do you know an order of doing things:
>> >
>> > Rules of Optimization:
>> >
>> > Rule 1: Don't do it.
>> > Rule 2 (for experts only): Don't do it yet.
>> >
>> > What this actually means - functionality comes first. Readability comes
>> > next. Optimization comes last. Unless you are creating a fancy 3D engine
>> > for
>> > kung fu game.
>> >
>> > If you are going to exclude overlays, you are removing functionality -
>> > and,
>> > indeed, absolutely has-to-be-there functionality, because noone would
>> > intuitively expect search function to search only one subset of
>> > packages,
>> > however reasonable this subset would be. So, you can't, just can't, add
>> > this
>> > package into portage base - you could write just another external search
>> > package for portage.
>> >
>> > I looked this code a bit and:
>> > Portage's "__init__.py" contains comment "# search functionality". After
>> > this comment, there is a nice and simple search class.
>> > It also contains method "def action_sync(...)", which contains
>> > synchronization stuff.
>> >
>> > Now, search class will be initialized by setting up 3 databases -
>> > porttree,
>> > bintree and vartree, whatever those are. Those will be in self._dbs
>> > array
>> > and porttree will be in self._portdb.
>> >
>> > It contains some more methods:
>> > _findname(...) will return result of self._portdb.findname(...) with
>> > same
>> > parameters or None if it does not exist.
>> > Other methods will do similar things - map one or another method.
>> > execute will do the real search...
>> > Now - "for package in self.portdb.cp_all()" is important here ...it
>> > currently loops over whole portage tree. All kinds of matching will be
>> > done
>> > inside.
>> > self.portdb obviously points to porttree.py (unless it points to fake
>> > tree).
>> > cp_all will take all porttrees and do simple file search inside. This
>> > method
>> > should contain optional index search.
>> >
>> > self.porttrees = [self.porttree_root] +
>> > [os.path.realpath(t) for t in
>> > self.mysettings["PORTDIR_OVERLAY"].split()]
>> >
>> > So, self.porttrees contains list of trees - first of them is root,
>> > others
>> > are overlays.
>> >
>> > Now, what you have to do will not be harder just because of having
>> > overlay
>> > search, too.
>> >
>> > You have to create method def cp_index(self), which will return
>> > dictionary
>> > containing package names as keys. For oroot... will be
>> > "self.porttrees[1:]",
>> > not "self.porttrees" - this will only search overlays. d = {} will be
>> > replaced with d = self.cp_index(). If index is not there, old version
>> > will
>> > be used (thus, you have to make internal porttrees variable, which
>> > contains
>> > all or all except first).
>> >
>> > Other methods used by search are xmatch and aux_get - first used several
>> > times and last one used to get description. You have to cache results of
>> > those specific queries and make them use your cache - as you can see,
>> > those
>> > parts of portage are already able to use overlays. Thus, you have to put
>> > your code again in beginning of those functions - create index_xmatch
>> > and
>> > index_aux_get methods, then make those methods use them and return their
>> > results unless those are None (or something other in case none is
>> > already
>> > legal result) - if they return None, old code will be run and do it's
>> > job.
>> > If index is not created, result is None. In index_** methods, just check
>> > if
>> > query is what you can answer and if it is, then answer it.
>> >
>> > Obviously, the simplest way to create your index is to delete index,
>> > then
>> > use those same methods to query for all nessecary information - and
>> > fastest
>> > way would be to add updating index directly into sync, which you could
>> > do
>> > later.
>> >
>> > Please, also, make those commands to turn index on and off (last one
>> > should
>> > also delete it to save disk space). Default should be off until it's
>> > fast,
>> > small and reliable. Also notice that if index is kept on hard drive, it
>> > might be faster if it's compressed (gz, for example) - decompressing
>> > takes
>> > less time and more processing power than reading it fully out.
>>
>> I'm pretty sure your mistaken here, unless your index is stored on a
>> floppy or something really slow.
>>
>> A disk read has 2 primary costs.
>>
>> Seek Time: Time for the head to seek to the sector of disk you want.
>> Spin Time: Time for the platter to spin around such that the sector
>> you want is under the read head.
>>
>> Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120
>> rotations per second, so worst case (you just passed the sector you
>> need) you need to wait 1/120th of a second (or 8ms).
>>
>> Seek Time is per hard drive, but most drives provide average seek
>> times under 10ms.
>>
>> So it takes on average 18ms to get to your data, then you start
>> reading. The index will not be that large (my esearchdb is 2 megs,
>> but lets assume 10MB for this compressed index).
>>
>> I took a 10MB meg sqlite database and compressed it with gzip (default
>> settings) down to 5 megs.
>> gzip -d on the database takes 300ms, catting the decompressed data
>> base takes 88ms (average of 5 runs, drop disk caches between runs).
>>
>> I then tried my vdb_metadata.pickle from
>> /var/cache/edb/vdb_metadata.pickle
>>
>> 1.3megs compresses to 390k.
>>
>> 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from
>> disk.
>>
>> Your index would have to be very large or very fragmented on disk
>> (requiring more than one seek) to see a significant gain in file
>> compression (gzip scales linearly).
>>
>> In short, don't compress the index ;p
>>
>> >
>> > Have luck!
>> >
>> >>> -----BEGIN PGP SIGNED MESSAGE-----
>> >>> Hash: SHA1
>> >>>
>> >>> Emma Strubell schrieb:
>> >>> > 2) does anyone really need to search an overlay anyway?
>> >>>
>> >>> Of course. Take large (semi-)official overlays like sunrise. They can
>> >>> easily be seen as a second portage tree.
>> >>> -----BEGIN PGP SIGNATURE-----
>> >>> Version: GnuPG v2.0.9 (GNU/Linux)
>> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>> >>>
>> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt
>> >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S
>> >>> =+lCO
>> >>> -----END PGP SIGNATURE-----
>> >>>
>> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@necoro.eu>
>> >> wrote:
>> >>
>> >
>> >
>
>
 
Old 12-02-2008, 08:47 PM
Tambet
 
Default search functionality in emerge

It might be that your hard drive is not that much slower than memory, then, but I really doubt this one ...or it could mean that reading gzip out is much slower than reading cat - and this one is highly probable. I mean, file size of gzip.

Actually it's elementary logic that decompressing is faster than just loading. What I personally did use was much faster than without compressing, but that was also c++ application having this zip always in memory and this was also highly inefficiently stored data at first.


I suggest you such test to understand me - make some file and write there character "a" about 10 000 000 times to get it that big, then try the same thing on that file. I think it's probable that it will be real fast to decompress the resulting file.


Anyway, you have still made me think that at first, no zip should be used - just because your tests took several new variables in like speed of reading decompression utility from disk.

Tambet - technique evolves to art, art evolves to magic, magic evolves to just doing.




2008/12/2 Alec Warner <antarus@gentoo.org>

On Tue, Dec 2, 2008 at 4:42 AM, Tambet <qtvali@gmail.com> wrote:

> About zipping.. Default settings might not really be good idea - i think

> that "fastest" might be even better. Considering that portage tree contains

> same word again and again (like "applications") it needs pretty small

> dictionary to make it much smaller. Decompressing will not be reading from

> disc, decompressing and writing back to disc as in your case probably - try

> decompression to memory drive and you might get better numbers.



I ran gzip -d -c file.gz > /dev/null, which should not write to disk.



I tried again with gzip -1 and it still takes 29ms to decompress (even

with gzip -1) where a bare read takes 26ms. *(I have a 2.6Ghz X2 which

is probably relevant to gzip decompression speed)



>

> I have personally used compression in one c++ application and with optimum

> settings, it made things much faster - those were files, where i had for

> example 65536 16-byte integers, which could be zeros and mostly were; I

> didnt care about creating better file format, but just compressed the whole

> thing.



I'm not saying compression won't make the index smaller. *I'm saying

making the index smaller does not improve performance. *If you have a

10 meg file and you make it 1 meg, you do not increase performance

because you (on average) are not saving enough time reading the

smaller file; since you pay it in decompressing the smaller file

later.



>

> I suggest you to compress esearch db, then decompress it to memory drive and

> give us those numbers - might be considerably faster.



gzip -d -c esearchdb.py.gz > /dev/null (compressed with gzip -1) takes

on average (6 trials, dropped caches between trials) takes 35.1666ms



cat esearchdb.py > /dev/null (uncompressed) takes on average of 6 trials, 24ms.



The point is you use compression when you need to save space (sending

data over the network, or storing large amounts of data or a lot of

something). *The index isn't going to be big (if it is bigger than 20

or 30 meg I'll be surprised), the index isn't going over the network

and there is only 1 index, not say a million indexes (where

compression might actually be useful for some kind of LRU subset of

indexes to meet disk requirements).



Anyway this is all moot since as you stated so well earlier,

optimization comes last, so stop trying to do it now



-Alec



>

> http://www.python.org/doc/2.5.2/lib/module-gzip.html - Python gzip support.

> Try open of that and normal open on esearch db; also compress with the same

> lib to get right kind of file.

>

> Anyway - maybe this compression should be later added and optional.

>

> Tambet - technique evolves to art, art evolves to magic, magic evolves to

> just doing.

>

>

> 2008/12/2 Alec Warner <antarus@gentoo.org>

>>

>> On Mon, Dec 1, 2008 at 4:20 PM, Tambet <qtvali@gmail.com> wrote:

>> > 2008/12/2 Emma Strubell <emma.strubell@gmail.com>

>> >>

>> >> True, true. Like I said, I don't really use overlays, so excuse my

>> >> igonrance.

>> >

>> > Do you know an order of doing things:

>> >

>> > Rules of Optimization:

>> >

>> > Rule 1: Don't do it.

>> > Rule 2 (for experts only): Don't do it yet.

>> >

>> > What this actually means - functionality comes first. Readability comes

>> > next. Optimization comes last. Unless you are creating a fancy 3D engine

>> > for

>> > kung fu game.

>> >

>> > If you are going to exclude overlays, you are removing functionality -

>> > and,

>> > indeed, absolutely has-to-be-there functionality, because noone would

>> > intuitively expect search function to search only one subset of

>> > packages,

>> > however reasonable this subset would be. So, you can't, just can't, add

>> > this

>> > package into portage base - you could write just another external search

>> > package for portage.

>> >

>> > I looked this code a bit and:

>> > Portage's "__init__.py" contains comment "# search functionality". After

>> > this comment, there is a nice and simple search class.

>> > It also contains method "def action_sync(...)", which contains

>> > synchronization stuff.

>> >

>> > Now, search class will be initialized by setting up 3 databases -

>> > porttree,

>> > bintree and vartree, whatever those are. Those will be in self._dbs

>> > array

>> > and porttree will be in self._portdb.

>> >

>> > It contains some more methods:

>> > _findname(...) will return result of self._portdb.findname(...) with

>> > same

>> > parameters or None if it does not exist.

>> > Other methods will do similar things - map one or another method.

>> > execute will do the real search...

>> > Now - "for package in self.portdb.cp_all()" is important here ...it

>> > currently loops over whole portage tree. All kinds of matching will be

>> > done

>> > inside.

>> > self.portdb obviously points to porttree.py (unless it points to fake

>> > tree).

>> > cp_all will take all porttrees and do simple file search inside. This

>> > method

>> > should contain optional index search.

>> >

>> > * * * * * * * self.porttrees = [self.porttree_root] +

>> > * * * * * * * * * * * [os.path.realpath(t) for t in

>> > self.mysettings["PORTDIR_OVERLAY"].split()]

>> >

>> > So, self.porttrees contains list of trees - first of them is root,

>> > others

>> > are overlays.

>> >

>> > Now, what you have to do will not be harder just because of having

>> > overlay

>> > search, too.

>> >

>> > You have to create method def cp_index(self), which will return

>> > dictionary

>> > containing package names as keys. For oroot... will be

>> > "self.porttrees[1:]",

>> > not "self.porttrees" - this will only search overlays. d = {} will be

>> > replaced with d = self.cp_index(). If index is not there, old version

>> > will

>> > be used (thus, you have to make internal porttrees variable, which

>> > contains

>> > all or all except first).

>> >

>> > Other methods used by search are xmatch and aux_get - first used several

>> > times and last one used to get description. You have to cache results of

>> > those specific queries and make them use your cache - as you can see,

>> > those

>> > parts of portage are already able to use overlays. Thus, you have to put

>> > your code again in beginning of those functions - create index_xmatch

>> > and

>> > index_aux_get methods, then make those methods use them and return their

>> > results unless those are None (or something other in case none is

>> > already

>> > legal result) - if they return None, old code will be run and do it's

>> > job.

>> > If index is not created, result is None. In index_** methods, just check

>> > if

>> > query is what you can answer and if it is, then answer it.

>> >

>> > Obviously, the simplest way to create your index is to delete index,

>> > then

>> > use those same methods to query for all nessecary information - and

>> > fastest

>> > way would be to add updating index directly into sync, which you could

>> > do

>> > later.

>> >

>> > Please, also, make those commands to turn index on and off (last one

>> > should

>> > also delete it to save disk space). Default should be off until it's

>> > fast,

>> > small and reliable. Also notice that if index is kept on hard drive, it

>> > might be faster if it's compressed (gz, for example) - decompressing

>> > takes

>> > less time and more processing power than reading it fully out.

>>

>> I'm pretty sure your mistaken here, unless your index is stored on a

>> floppy or something really slow.

>>

>> A disk read has 2 primary costs.

>>

>> Seek Time: Time for the head to seek to the sector of disk you want.

>> Spin Time: Time for the platter to spin around such that the sector

>> you want is under the read head.

>>

>> Spin Time is based on rpm, so average 7200 rpm / 60 seconds = 120

>> rotations per second, so worst case (you just passed the sector you

>> need) you need to wait 1/120th of a second (or 8ms).

>>

>> Seek Time is per hard drive, but most drives provide average seek

>> times under 10ms.

>>

>> So it takes on average 18ms to get to your data, then you start

>> reading. *The index will not be that large (my esearchdb is 2 megs,

>> but lets assume 10MB for this compressed index).

>>

>> I took a 10MB meg sqlite database and compressed it with gzip (default

>> settings) down to 5 megs.

>> gzip -d on the database takes 300ms, catting the decompressed data

>> base takes 88ms (average of 5 runs, drop disk caches between runs).

>>

>> I then tried my vdb_metadata.pickle from

>> /var/cache/edb/vdb_metadata.pickle

>>

>> 1.3megs compresses to 390k.

>>

>> 36ms to decompress the 390k file, but 26ms to read the 1.3meg file from

>> disk.

>>

>> Your index would have to be very large or very fragmented on disk

>> (requiring more than one seek) to see a significant gain in file

>> compression (gzip scales linearly).

>>

>> In short, don't compress the index ;p

>>

>> >

>> > Have luck!

>> >

>> >>> -----BEGIN PGP SIGNED MESSAGE-----

>> >>> Hash: SHA1

>> >>>

>> >>> Emma Strubell schrieb:

>> >>> > 2) does anyone really need to search an overlay anyway?

>> >>>

>> >>> Of course. Take large (semi-)official overlays like sunrise. They can

>> >>> easily be seen as a second portage tree.

>> >>> -----BEGIN PGP SIGNATURE-----

>> >>> Version: GnuPG v2.0.9 (GNU/Linux)

>> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

>> >>>

>> >>> iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt

>> >>> 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S

>> >>> =+lCO

>> >>> -----END PGP SIGNATURE-----

>> >>>

>> >> On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann <lists@necoro.eu>

>> >> wrote:

>> >>

>> >

>> >

>

>
 
Old 02-12-2009, 06:16 PM
René 'Necoro' Neumann
 
Default search functionality in emerge

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hey,

has your project resulted in anything?

Just curios about perhaps nice portage additions

Regards,
Necoro

Emma Strubell schrieb:
> Hi everyone. My name is Emma, and I am completely new to this list. I've
> been using Gentoo since 2004, including Portage of course, and before I say
> anything else I'd like to say thanks to everyone for such a kickass package
> management system!!
>
> Anyway, for my final project in my Data Structures & Algorithms class this
> semester, I would like to modify the search functionality in emerge.
> Something I've always noticed about 'emerge -s' or '-S' is that, in general,
> it takes a very long time to perform the searches. (Although, lately it does
> seem to be running faster, specifically on my laptop as opposed to my
> desktop. Strangely, though, it seems that when I do a simple 'emerge -av
> whatever' on my laptop it takes a very long time for emerge to find the
> package and/or determine the dependecies - whatever it's doing behind that
> spinner. I can definitely go into more detail about this if anyone's
> interested. It's really been puzzling me!) So, as my final project I've
> proposed to improve the time it takes to perform a search using emerge. My
> professor suggested that I look into implementing indexing.
>
> However, I've started looking at the code, and I must admit I'm pretty
> overwhelmed! I don't know where to start. I was wondering if anyone on here
> could give me a quick overview of how the search function currently works,
> an idea as to what could be modified or implemented in order to improve the
> running time of this code, or any tip really as to where I should start or
> what I should start looking at. I'd really appreciate any help or advice!!
>
> Thanks a lot, and keep on making my Debian-using professor jealous :]
> Emma
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkmUdZIACgkQ4UOg/zhYFuDRQQCfeVXb6uy+wBSKll4MHq54MiyX
VawAn0TWrTBVKuxAPFWpQMvvO3yED5Fs
=dBni
-----END PGP SIGNATURE-----
 

Thread Tools




All times are GMT. The time now is 12:03 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright ©2007 - 2008, www.linux-archive.org