FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Gentoo > Gentoo Portage Developer

 
 
LinkBack Thread Tools
 
Old 11-23-2008, 11:17 AM
"Emma Strubell"
 
Default search functionality in emerge

Hi everyone. My name is Emma, and I am completely new to this list. I've been using Gentoo since 2004, including Portage of course, and before I say anything else I'd like to say thanks to everyone for such a kickass package management system!!


Anyway, for my final project in my Data Structures & Algorithms class this semester, I would like to modify the search functionality in emerge. Something I've always noticed about 'emerge -s' or '-S' is that, in general, it takes a very long time to perform the searches. (Although, lately it does seem to be running faster, specifically on my laptop as opposed to my desktop. Strangely, though, it seems that when I do a simple 'emerge -av whatever' on my laptop it takes a very long time for emerge to find the package and/or determine the dependecies -* whatever it's doing behind that spinner. I can definitely go into more detail about this if anyone's interested. It's really been puzzling me!) So, as my final project I've proposed to improve the time it takes to perform a search using emerge. My professor suggested that I look into implementing indexing.


However, I've started looking at the code, and I must admit I'm pretty overwhelmed! I don't know where to start. I was wondering if anyone on here could give me a quick overview of how the search function currently works, an idea as to what could be modified or implemented in order to improve the running time of this code, or any tip really as to where I should start or what I should start looking at. I'd really appreciate any help or advice!!


Thanks a lot, and keep on making my Debian-using professor jealous :]
Emma
 
Old 11-23-2008, 01:01 PM
tvali
 
Default search functionality in emerge

Try esearch.

emerge esearch
esearch ...

2008/11/23 Emma Strubell <emma.strubell@gmail.com>

Hi everyone. My name is Emma, and I am completely new to this list. I've been using Gentoo since 2004, including Portage of course, and before I say anything else I'd like to say thanks to everyone for such a kickass package management system!!



Anyway, for my final project in my Data Structures & Algorithms class this semester, I would like to modify the search functionality in emerge. Something I've always noticed about 'emerge -s' or '-S' is that, in general, it takes a very long time to perform the searches. (Although, lately it does seem to be running faster, specifically on my laptop as opposed to my desktop. Strangely, though, it seems that when I do a simple 'emerge -av whatever' on my laptop it takes a very long time for emerge to find the package and/or determine the dependecies -* whatever it's doing behind that spinner. I can definitely go into more detail about this if anyone's interested. It's really been puzzling me!) So, as my final project I've proposed to improve the time it takes to perform a search using emerge. My professor suggested that I look into implementing indexing.



However, I've started looking at the code, and I must admit I'm pretty overwhelmed! I don't know where to start. I was wondering if anyone on here could give me a quick overview of how the search function currently works, an idea as to what could be modified or implemented in order to improve the running time of this code, or any tip really as to where I should start or what I should start looking at. I'd really appreciate any help or advice!!



Thanks a lot, and keep on making my Debian-using professor jealous :]
Emma



--
tvali

Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad. Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju mingi täica pea nagu prügikast...
 
Old 11-23-2008, 01:33 PM
Pacho Ramos
 
Default search functionality in emerge

El dom, 23-11-2008 a las 16:01 +0200, tvali escribió:
> Try esearch.
>
> emerge esearch
> esearch ...
>
> 2008/11/23 Emma Strubell <emma.strubell@gmail.com>
> Hi everyone. My name is Emma, and I am completely new to this
> list. I've been using Gentoo since 2004, including Portage of
> course, and before I say anything else I'd like to say thanks
> to everyone for such a kickass package management system!!
>
> Anyway, for my final project in my Data Structures &
> Algorithms class this semester, I would like to modify the
> search functionality in emerge. Something I've always noticed
> about 'emerge -s' or '-S' is that, in general, it takes a very
> long time to perform the searches. (Although, lately it does
> seem to be running faster, specifically on my laptop as
> opposed to my desktop. Strangely, though, it seems that when I
> do a simple 'emerge -av whatever' on my laptop it takes a very
> long time for emerge to find the package and/or determine the
> dependecies - whatever it's doing behind that spinner. I can
> definitely go into more detail about this if anyone's
> interested. It's really been puzzling me!) So, as my final
> project I've proposed to improve the time it takes to perform
> a search using emerge. My professor suggested that I look into
> implementing indexing.
>
> However, I've started looking at the code, and I must admit
> I'm pretty overwhelmed! I don't know where to start. I was
> wondering if anyone on here could give me a quick overview of
> how the search function currently works, an idea as to what
> could be modified or implemented in order to improve the
> running time of this code, or any tip really as to where I
> should start or what I should start looking at. I'd really
> appreciate any help or advice!!
>
> Thanks a lot, and keep on making my Debian-using professor
> jealous :]
> Emma
>
>
>
> --
> tvali
>
> Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad.
> Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju
> mingi täica pea nagu prügikast...

I use eix:
emerge eix

;-)
 
Old 11-23-2008, 01:43 PM
"Emma Strubell"
 
Default search functionality in emerge

Thanks for the replies! I know there are a couple programs out there
that basically already do what I'm looking to do... Unfortunately I
wasn't aware of these pre-existing utilities until after I submitted my
project proposal to my professor. So, I'm looking to implement a better
search myself. Preferably by editing the existing portage code, not
writing a separate program. So if anyone can offer any help regarding
the actual implementation of search in portage, I would greatly
appreciate it!



Or, if anyone has an idea for a more productive/useful project I could
work on relating to portage (about the same difficulty, preferably at
least a little bit data structure related), please, let me know! Thanks
again guys,



Emma

On Sun, Nov 23, 2008 at 9:33 AM, Pacho Ramos <pacho@condmat1.ciencias.uniovi.es> wrote:

El dom, 23-11-2008 a las 16:01 +0200, tvali escribió:

> Try esearch.

>

> emerge esearch

> esearch ...

>

> 2008/11/23 Emma Strubell <emma.strubell@gmail.com>

> * * * * Hi everyone. My name is Emma, and I am completely new to this

> * * * * list. I've been using Gentoo since 2004, including Portage of

> * * * * course, and before I say anything else I'd like to say thanks

> * * * * to everyone for such a kickass package management system!!

>

> * * * * Anyway, for my final project in my Data Structures &

> * * * * Algorithms class this semester, I would like to modify the

> * * * * search functionality in emerge. Something I've always noticed

> * * * * about 'emerge -s' or '-S' is that, in general, it takes a very

> * * * * long time to perform the searches. (Although, lately it does

> * * * * seem to be running faster, specifically on my laptop as

> * * * * opposed to my desktop. Strangely, though, it seems that when I

> * * * * do a simple 'emerge -av whatever' on my laptop it takes a very

> * * * * long time for emerge to find the package and/or determine the

> * * * * dependecies - *whatever it's doing behind that spinner. I can

> * * * * definitely go into more detail about this if anyone's

> * * * * interested. It's really been puzzling me!) So, as my final

> * * * * project I've proposed to improve the time it takes to perform

> * * * * a search using emerge. My professor suggested that I look into

> * * * * implementing indexing.

>

> * * * * However, I've started looking at the code, and I must admit

> * * * * I'm pretty overwhelmed! I don't know where to start. I was

> * * * * wondering if anyone on here could give me a quick overview of

> * * * * how the search function currently works, an idea as to what

> * * * * could be modified or implemented in order to improve the

> * * * * running time of this code, or any tip really as to where I

> * * * * should start or what I should start looking at. I'd really

> * * * * appreciate any help or advice!!

>

> * * * * Thanks a lot, and keep on making my Debian-using professor

> * * * * jealous :]

> * * * * Emma

>

>

>

> --

> tvali

>

> Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad.

> Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju

> mingi täica pea nagu prügikast...



I use eix:

emerge eix



;-)
 
Old 11-23-2008, 01:56 PM
"Douglas Anderson"
 
Default search functionality in emerge

Emma,

It would be great it you could speed search up a bit!

As these other guys have pointed out, we do have some indexing tools in Gentoo already. Most users don't understand why that kind of functionality isn't built directly into Portage, but IIRC it has something to do with the fact that these fast search indexes aren't able to be written to by more than one process at the same time, so for example if you had two emerges finishing at the same time, Portage's current flat hash file can handle that, but the faster db-based indexes can't.


Anyways, that's the way I, as a curious user, understood the problem.

You might be interested in reading this, very old forum thread about a previous attempt:
http://forums.gentoo.org/viewtopic-t-261580-postdays-0-postorder-asc-start-0.html


On Sun, Nov 23, 2008 at 11:33 PM, Pacho Ramos <pacho@condmat1.ciencias.uniovi.es> wrote:

El dom, 23-11-2008 a las 16:01 +0200, tvali escribió:

> Try esearch.

>

> emerge esearch

> esearch ...

>

> 2008/11/23 Emma Strubell <emma.strubell@gmail.com>

> * * * * Hi everyone. My name is Emma, and I am completely new to this

> * * * * list. I've been using Gentoo since 2004, including Portage of

> * * * * course, and before I say anything else I'd like to say thanks

> * * * * to everyone for such a kickass package management system!!

>

> * * * * Anyway, for my final project in my Data Structures &

> * * * * Algorithms class this semester, I would like to modify the

> * * * * search functionality in emerge. Something I've always noticed

> * * * * about 'emerge -s' or '-S' is that, in general, it takes a very

> * * * * long time to perform the searches. (Although, lately it does

> * * * * seem to be running faster, specifically on my laptop as

> * * * * opposed to my desktop. Strangely, though, it seems that when I

> * * * * do a simple 'emerge -av whatever' on my laptop it takes a very

> * * * * long time for emerge to find the package and/or determine the

> * * * * dependecies - *whatever it's doing behind that spinner. I can

> * * * * definitely go into more detail about this if anyone's

> * * * * interested. It's really been puzzling me!) So, as my final

> * * * * project I've proposed to improve the time it takes to perform

> * * * * a search using emerge. My professor suggested that I look into

> * * * * implementing indexing.

>

> * * * * However, I've started looking at the code, and I must admit

> * * * * I'm pretty overwhelmed! I don't know where to start. I was

> * * * * wondering if anyone on here could give me a quick overview of

> * * * * how the search function currently works, an idea as to what

> * * * * could be modified or implemented in order to improve the

> * * * * running time of this code, or any tip really as to where I

> * * * * should start or what I should start looking at. I'd really

> * * * * appreciate any help or advice!!

>

> * * * * Thanks a lot, and keep on making my Debian-using professor

> * * * * jealous :]

> * * * * Emma

>

>

>

> --

> tvali

>

> Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad.

> Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju

> mingi täica pea nagu prügikast...



I use eix:

emerge eix



;-)
 
Old 11-23-2008, 03:56 PM
"Lucian Poston"
 
Default search functionality in emerge

> Thanks for the replies! I know there are a couple programs out there that
> basically already do what I'm looking to do... Unfortunately I wasn't aware
> of these pre-existing utilities until after I submitted my project proposal
> to my professor. So, I'm looking to implement a better search myself.
> Preferably by editing the existing portage code, not writing a separate
> program. So if anyone can offer any help regarding the actual implementation
> of search in portage, I would greatly appreciate it!

Most of the search implementation is in
/usr/lib/portage/pym/_emerge/__init__.py in class search. The class's
execute() method simply iterates over all packages (and descriptions
and package sets) and matches against the searchkey. You might need
to look into pym/portage/dbapi/porttree.py for portdbapi as well.

If you intend to index and support fast regex lookup, then you need to
do some fancy indexing, which I'm not terribly familiar with. You
could follow in the steps of eix[1] or other indexed search utilities
and design some sort of index layout, which is easier than the
following thought. You might consider implementing a suffix trie or
similar that has sublinear regexp lookup and marshalling the structure
for the index. I couldn't find a python implementation for something
like this, but here is a general trie class[2] that you might start
with if you go that route. There is a perl module[3],
Tie::Hash::Regex, that does that, but properly implementing that in
python would be a chore.

That project sounds interesting and fun. Good luck!

Lucian Poston

[1] https://projects.gentooexperimental.org/eix/wiki/IndexFileLayout
[2] http://www.koders.com/python/fid7B6BC1651A9E8BBA547552FE3F039479A4DECC45.aspx
[3] http://search.cpan.org/~davecross/Tie-Hash-Regex-1.02/lib/Tie/Hash/Regex.pm
 
Old 11-23-2008, 05:49 PM
"Emma Strubell"
 
Default search functionality in emerge

Wow, that's extremely helpful!! I happen to particularly enjoy tries, so the suffix trie sounds like a great idea. The trie class example is really helpful too, because this will be my first time programming in Python, and it's a bit easier to figure out what's going on syntax-wise in that simple trie class than in the middle of the portage source code!


Seriously, thanks again :]

On Sun, Nov 23, 2008 at 11:56 AM, Lucian Poston <lucianposton@gmail.com> wrote:

> Thanks for the replies! I know there are a couple programs out there that

> basically already do what I'm looking to do... Unfortunately I wasn't aware

> of these pre-existing utilities until after I submitted my project proposal

> to my professor. So, I'm looking to implement a better search myself.

> Preferably by editing the existing portage code, not writing a separate

> program. So if anyone can offer any help regarding the actual implementation

> of search in portage, I would greatly appreciate it!



Most of the search implementation is in

/usr/lib/portage/pym/_emerge/__init__.py in class search. *The class's

execute() method simply iterates over all packages (and descriptions

and package sets) and matches against the searchkey. *You might need

to look into pym/portage/dbapi/porttree.py for portdbapi as well.



If you intend to index and support fast regex lookup, then you need to

do some fancy indexing, which I'm not terribly familiar with. *You

could follow in the steps of eix[1] or other indexed search utilities

and design some sort of index layout, which is easier than the

following thought. *You might consider implementing a suffix trie or

similar that has sublinear regexp lookup and marshalling the structure

for the index. *I couldn't find a python implementation for something

like this, but here is a general trie class[2] that you might start

with if you go that route. *There is a perl module[3],

Tie::Hash::Regex, that does that, but properly implementing that in

python would be a chore.



That project sounds interesting and fun. Good luck!



Lucian Poston



[1] https://projects.gentooexperimental.org/eix/wiki/IndexFileLayout

[2] http://www.koders.com/python/fid7B6BC1651A9E8BBA547552FE3F039479A4DECC45.aspx

[3] http://search.cpan.org/~davecross/Tie-Hash-Regex-1.02/lib/Tie/Hash/Regex.pm
 
Old 11-23-2008, 07:00 PM
tvali
 
Default search functionality in emerge

Yes if it would be a low-level implementation to portage, speeding up it's native code for searching and using indexes, then it would make everything faster, including emerge (because emerge does search first for package relations). I have actually wanted to do it myself several years ago, so reacting here to have my ideas discussed, too.


Douglas Anderson 16:46 reply is about locks and I think that it would need to rethink portages locking methods - what, when and why it locks. This is probably quite hard task by itself. Anyway, as portage actually lets user make two emerges at the same time, locks might be OK, too.


I think that the best thing would be bottom-up refactoring - first to make a list of lowest-level functions, which have to do with reading data from portage tree or writing into it; then making indexer class, which will be used by all of those low-level functions.


To have it OOP, it should be implemented in such way:
Low-level portage tree handler does everything with portage tree, no function in portage uses it directly.
Tree handler has several needed and several optional methods - so that implementing new handler would be easy, but creating things like native regex search would be possible.
One could implement a new tree handler with SQLite or other interface instead of filesystem and do other tricks through this interface; for example, boost it.So, nice way to go would be:
Implementing portage tree handler and it's proxy, which uses current portage tree in non-indexed way and simply gives methods for the same kind of access, as currently implemented one.
Refactoring portage to rely only on portage tree handler and use direct portage tree nowhere. To test if it is so, portage tree should be moved to another directory, about which only this handler knows, and check if portage works well. Indexing all places, where portage uses it's tree handler (by std. comment, for example) and making clear, which methods would contain all boostable code of it.

Implementing those methods in proxy, which could simulate fast regex search and other stuff using simplest possible interface of portage tree handler (smth. like four methods add, remove, get, list). Proxy should be able to use handler's methods if they are implemented.
Refactoring portage to use advanced methods in proxy.Now, having taken all code together into one place and looking this nice and readable code, real optimizations could be discussed here, for example. Ideally, i think, portage would have such tree handlers:
Filesystem handler - fast searches over current portage tree structure.SQL handler - rewrite of tree functions into SQL queries.Network-based handler - maybe sometimes it would nice to have portage tree only in one machine of cluster, for example if I want to have 100 really small computers with fast connection with mother-computer and portage tree is too big to be wholly copied to all of them.

Memory-buffered handler with daemon, which is actually proxy to some other handler - daemon, which reads whole tree (from filesystem or SQL) into memory on boot or first use, creates really fast index (because now it does matter to have better indexing) and optionally deletes some [less needed] parts of it's index from memory when it's becoming full and behaves as really simple proxy if it stays full. This should be implemented after critical parts of filesystem or SQL handler.

2008/11/23 Emma Strubell <emma.strubell@gmail.com>

Wow, that's extremely helpful!! I happen to particularly enjoy tries, so the suffix trie sounds like a great idea. The trie class example is really helpful too, because this will be my first time programming in Python, and it's a bit easier to figure out what's going on syntax-wise in that simple trie class than in the middle of the portage source code!



Seriously, thanks again :]

On Sun, Nov 23, 2008 at 11:56 AM, Lucian Poston <lucianposton@gmail.com> wrote:


> Thanks for the replies! I know there are a couple programs out there that

> basically already do what I'm looking to do... Unfortunately I wasn't aware

> of these pre-existing utilities until after I submitted my project proposal

> to my professor. So, I'm looking to implement a better search myself.

> Preferably by editing the existing portage code, not writing a separate

> program. So if anyone can offer any help regarding the actual implementation

> of search in portage, I would greatly appreciate it!



Most of the search implementation is in

/usr/lib/portage/pym/_emerge/__init__.py in class search. *The class's

execute() method simply iterates over all packages (and descriptions

and package sets) and matches against the searchkey. *You might need

to look into pym/portage/dbapi/porttree.py for portdbapi as well.



If you intend to index and support fast regex lookup, then you need to

do some fancy indexing, which I'm not terribly familiar with. *You

could follow in the steps of eix[1] or other indexed search utilities

and design some sort of index layout, which is easier than the

following thought. *You might consider implementing a suffix trie or

similar that has sublinear regexp lookup and marshalling the structure

for the index. *I couldn't find a python implementation for something

like this, but here is a general trie class[2] that you might start

with if you go that route. *There is a perl module[3],

Tie::Hash::Regex, that does that, but properly implementing that in

python would be a chore.



That project sounds interesting and fun. Good luck!



Lucian Poston



[1] https://projects.gentooexperimental.org/eix/wiki/IndexFileLayout

[2] http://www.koders.com/python/fid7B6BC1651A9E8BBA547552FE3F039479A4DECC45.aspx

[3] http://search.cpan.org/~davecross/Tie-Hash-Regex-1.02/lib/Tie/Hash/Regex.pm







--
tvali

Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad. Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju mingi täica pea nagu prügikast...
 
Old 11-23-2008, 08:20 PM
Mike Auty
 
Default search functionality in emerge

Hiya Emma,
Good luck on your project. A couple of things to be weary of are disk
I/O, metadata cache backends and overlays.
Disk I/O can be a significant bottleneck. Loading up a lot of files
from disk (be it the metadata cache or whatever) can take a long time
initially, but then be cached in RAM and so be much faster to access in
the future.
Portage allows for its internal metadata cache to be stored in a
variety of formats, as long as there's a backend to support it. This
means simple speedups can be achieved using cdb or sqlite (if you google
these and portage you'll get gentoo-wiki tips, which unfortunately
you'll have to read from google's cache at the moment). It also means
that if you want to make use of this metadata from within portage,
you'll have to rely on the API to tell the backend to get you all the
data (and it may be difficult to speed up without writing your own backend).
Finally there are overlays, and since these can change outside of an
"emerge --sync" (as indeed can the main tree), you'll have to reindex
these before each search request, or give the user stale data until they
manually reindex.
If you're interesting in implementing this in python, you may be
interested in another package manager that can handle the main tree,
also implemented in python, called pkgcore. From what I understand,
it's a similar code-base to portage, but its internal architecture may
have changed a lot.
I hope some of that helps, and isn't off putting. I look forward to
seeing the results! 5

Mike 5
 
Old 11-23-2008, 08:59 PM
René 'Necoro' Neumann
 
Default search functionality in emerge

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Mike Auty schrieb:
> Finally there are overlays, and since these can change outside of an
> "emerge --sync" (as indeed can the main tree), you'll have to reindex
> these before each search request, or give the user stale data until they
> manually reindex.

Determining whether there has been a change to the ebuild system is a
major point in the whole thing. What does a great index serves you, if
it does not notice the changes the user made in his own local overlay?
Manually re-indexing is not a good choice I think...

If somebody comes up here with a good (and fast) solution, this would be
a nice thing (need it myself).

Regards,
René
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkkp0kAACgkQ4UOg/zhYFuAhTACfYDxNeQQG6dysgU5TrNEZGOiH
3CoAn2wV6g8/8uj+T99cxJGdQBxTtZjI
=2I2j
-----END PGP SIGNATURE-----
 

Thread Tools




All times are GMT. The time now is 03:07 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright ©2007 - 2008, www.linux-archive.org