FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Gentoo > Gentoo Portage Developer

 
 
LinkBack Thread Tools
 
Old 11-23-2008, 11:53 PM
tvali
 
Default search functionality in emerge

There is daemon, which notices about filesystem changes - http://pyinotify.sourceforge.net/ would be a good choice.

In case many different applications use portage tree directly without using any portage API (which is a bad choice, I think, and should be deprecated), then there is a kind of "hack" - using http://www.freenet.org.nz/python/lufs-python/ to create a new filesystem (damn now I would like to have some time to join this game). I hope it's possible to build it everywhere where gentoo should work, but it'n no problem if it's not - you can implement it in such way that it's not needed. I totally agree, that filesystem is a bottleneck, but this suffix trie would check for directories first, I guess. Now, having this custom filesystem, which actually serves portage tree like some odd API, you can have backwards compability and still create your own thing.


Having such classes (numbers show implementation order; this is not specified here if proxies are abstract classes, base classes or smth. other, just it shows some relations between some imaginary objects):

1. PortageTreeApi - Proxy for different portage trees on FS or SQL or other.2. PortageTreeCachedApi - same, as previous, but contains boosted memory cache. It should be able to save it's state, which is simply writing it's inner variables into file.

3. PortageTreeDaemon - has interface compatible with PortageTreeAPI, this daemon serves portage tree to PortageTreeFS and portage tree itself. In reality, it should be base class of PortageTreeApi and PortageTreeCachedApi so that they could be directly used as daemons. When cached API is used as daemon, it should be able to check filesystem changes - thus, implementations should contain change trigger callbacks.

4. PortageTreeFS - filesystem, which can be used to map any of those to filesystem. Connectable with PortageTreeApi or PortageTreeDaemon. This creates filesystems, which can be used for backwards-compability. This cannot be used on architectures, which dont implement lufs-python or analog.
6. PortageTreeServer - server, which serves data from PortageTreeDaemon, PortageTreeCachedApi or PortageTreeApi to some other computer.Implementations can be proxied through PortageTreeApi, PortageTreeCachedApi or PortageTreeDaemon.
5. PortageTreeImplementationAsSqlDb
1. PortageTreeImplementationAsFilesystem3. PortageTreeImplementationAsDaemon - client, actually.
6. PortageTreeImplementationAsServer - client, too.
So, 1 - creating PortageTreeApi and PortageTreeImplementationAsFilesystem is pure refactoring task, at first. Then, adding more advanced functions to PortageTreeApi is basically refactoring, too. PortageTreeApi should not become too complex or contain any advanced tasks, which are not purely db-specific, so some common baseclass could implement more high-level things.

Then, 2 - this is finishing your schoolwork, but not yet in most powerful way as we are having only index then, and first search is still slow. At beginning this cache is unable to provide data about changes in portage tree (which could be implemented by some versioning after this new api is only place to update it), so it should have index update command and be only used in search.

Then, 3 - having portage tree daemon means that things can really be cached now and this cache can be kept in memory; also it means updates on filesystem changes.
Then, 4 - having PortageTreeFS means that now you can easily implement portage tree on faster medium without losing backwards-compability.

Now, 5 - implementation as SQL DB is logical as SQL is standardized and common language for creating fast databases.
Eventually, 6 - this has really nothing to do with boosting search, but in fast network it could still boost emerge by removing need for emerge --sync for local networks.


I think that then it would be considered to have synchronization also in those classes - CachedApi almost needs it to be faster with server-client connections. After that, ImplementationAsSync and ImplementationAsWebRsSync could be added and sync server built onto this daemon. As I really doubt that emerge --sync is currently also ultraslow - I see no meaning in waiting a long time to get few new items as currently seems to happen -, it would boost another life-critical part of portage.


So, hope that helps a bit - have luck!

2008/11/23 René 'Necoro' Neumann <lists@necoro.eu>

-----BEGIN PGP SIGNED MESSAGE-----

Hash: SHA1



Mike Auty schrieb:

> * * Finally there are overlays, and since these can change outside of an

> "emerge --sync" (as indeed can the main tree), you'll have to reindex

> these before each search request, or give the user stale data until they

> manually reindex.



Determining whether there has been a change to the ebuild system is a

major point in the whole thing. What does a great index serves you, if

it does not notice the changes the user made in his own local overlay?

Manually re-indexing is not a good choice I think...



If somebody comes up here with a good (and fast) solution, this would be

a nice thing (need it myself).



Regards,

René

-----BEGIN PGP SIGNATURE-----

Version: GnuPG v2.0.9 (GNU/Linux)

Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org



iEYEARECAAYFAkkp0kAACgkQ4UOg/zhYFuAhTACfYDxNeQQG6dysgU5TrNEZGOiH

3CoAn2wV6g8/8uj+T99cxJGdQBxTtZjI

=2I2j

-----END PGP SIGNATURE-----





--
tvali

Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad. Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju mingi täica pea nagu prügikast...
 
Old 11-24-2008, 02:12 AM
Marius Mauch
 
Default search functionality in emerge

On Sun, 23 Nov 2008 07:17:40 -0500
"Emma Strubell" <emma.strubell@gmail.com> wrote:

> However, I've started looking at the code, and I must admit I'm pretty
> overwhelmed! I don't know where to start. I was wondering if anyone
> on here could give me a quick overview of how the search function
> currently works, an idea as to what could be modified or implemented
> in order to improve the running time of this code, or any tip really
> as to where I should start or what I should start looking at. I'd
> really appreciate any help or advice!!

Well, it depends how much effort you want to put into this. The current
interface doesn't actually provide a "search" interface, but merely
functions to
1) list all package names - dbapi.cp_all()
2) list all package names and versions - dbapi.cpv_all()
3) list all versions for a given package name - dbapi.cp_list()
4) read metadata (like DESCRIPTION) for a given package name and
version - dbapi.aux_get()

One of the main performance problems of --search is that there is no
persistent cache for functions 1, 2 and 3, so if you're "just"
interested in performance aspects you might want to look into that.
The issue with implementing a persistent cache is that you have to
consider both cold and hot filesystem cache cases: Loading an index
file with package names and versions might improve the cold-cache case,
but slow things down when the filesystem cache is populated.
As has been mentioned, keeping the index updated is the other major
issue, especially as it has to be portable and should require little or
no configuration/setup for the user (so no extra daemons or special
filesystems running permanently in the background). The obvious
solution would be to generate the cache after `emerge --sync` (and other
sync implementations) and hope that people don't modify their tree and
search for the changes in between (that's what all the external tools
do). I don't know if there is actually a way to do online updates while
still improving performance and not relying on custom system daemons
running in the background.

As for --searchdesc, one problem is that dbapi.aux_get() can only
operate on a single package-version on each call (though it can read
multiple metadata variables). So for description searches the control
flow is like this (obviously simplified):

result = []
# iterate over all packages
for package in dbapi.cp_all():
# determine the current version of each package, this is
# another performance issue.
version = get_current_version(package)
# read package description from metadata cache
description = dbapi.aux_get(version, ["DESCRIPTION"])[0]
# check if the description matches
if matches(description, searchkey):
result.append(package)

There you see the three bottlenecks: the lack of a pregenerated package
list, the version lookup for *each* package and the actual metadata
read. I've already talked about the first, so lets look at the other
two. The core problem there is that DESCRIPTION (like all standard
metadata variables) is version specific, so to access it you need to
determine a version to use, even though in almost all cases the
description is the same (or very similar) for all versions. So the
proper solution would be to make the description a property of the
package name instead of the package version, but that's a _huge_ task
you're probably not interested in. What _might_ work here is to add
support for an optional package-name->description cache that can be
generated offline and includes those packages where all versions have
the same description, and fall back to the current method if the
package is not included in the cache. (Don't think about caching the
version lookup, that's system dependent and therefore not suitable for
caching).

Hope it has become clear that while the actual search algorithm might
be simple and not very efficient, the real problem lies in getting the
data to operate on.

That and the somewhat limited dbapi interface.

Disclaimer: The stuff below involves extending and redesigning some
core portage APIs. This isn't something you can do on a weekend, only
work on this if you want to commit yourself to portage development
for a long time.

The functions listed above are the bare minimum to
perform queries on the package repositories, but they're very
low-level. That means that whenever you want to select packages by
name, description, license, dependencies or other variables you need
quite a bit of custom code, more if you want to combine multiple
searches, and much more if you want to do it efficient and flexible.
See http://dev.gentoo.org/~genone/scripts/metalib.py and
http://dev.gentoo.org/~genone/scripts/metascan for a somewhat flexible,
but very inefficient search tool (might not work anymore due to old
age).

Ideally repository searches could be done without writing any
application code using some kind of query language, similar to how SQL
works for generic database searches (obviously not that complex).
But before thinking about that we'd need a query API that actually
a) allows tools to assemble queries without having to worry about
implementation details
b) run them efficiently without bothering the API user

Simple example: Find all package-versions in the sys-apps category that
are BSD-licensed.

Currently that would involve something like:

result = []
for package is dbapi.cp_all():
if not package.startswith("sys-apps/"):
continue
for version in dbapi.cp_list(package):
license = dbapi.aux_get(version, ["LICENSE"])[0]
# for simplicity perform a equivalence check, in reality you'd
# have to account for complex license definitions
if license == "BSD":
result.append(version)

Not very friendly to maintain, and not very efficient (we'd only need
to iterate over packages in the 'sys-apps' category, but the interface
doesn't allow that).
And now how it might look with a extensive query interface:

query = AndQuery()
query.add(CategoryQuery("sys-apps", FullStringMatch()))
query.add(MetadataQuery("BSD", FullStringMatch()))
result = repository.selectPackages(query)

Much nicer, don't you think?

As said, implementing such a thing would be a huge amount of work, even
if just implemented as wrappers on top of the current interface (which
would prevent many efficiency improvements), but if you (or anyone else
for that matter) are truly interested in this contact me off-list,
maybe I can find some of my old design ideas and (incomplete)
prototypes to give you a start.

Marius
 
Old 11-24-2008, 04:01 AM
devsk
 
Default search functionality in emerge

> not relying on custom system daemonsrunning in the background.

Why is a portage daemon such a bad thing? Or hard to do? I would very much like a daemon running on my system which I can configure to sync the portage tree once a week (or month if I am lazy), give me a summary of hot fixes, security fixes in a nice email, push important announcements and of course, sync caches on detecting changes (which should be trivial with notify daemons all over the place) etc. Why is it such a bad thing?

Its crazy to think that security updates need to be pulled in Linux.

-devsk



----- Original Message ----
From: Marius Mauch <genone@gentoo.org>
To: gentoo-portage-dev@lists.gentoo.org
Sent: Sunday, November 23, 2008 7:12:57 PM
Subject: Re: [gentoo-portage-dev] search functionality in emerge

On Sun, 23 Nov 2008 07:17:40 -0500
"Emma Strubell" <emma.strubell@gmail.com> wrote:

> However, I've started looking at the code, and I must admit I'm pretty
> overwhelmed! I don't know where to start. I was wondering if anyone
> on here could give me a quick overview of how the search function
> currently works, an idea as to what could be modified or implemented
> in order to improve the running time of this code, or any tip really
> as to where I should start or what I should start looking at. I'd
> really appreciate any help or advice!!

Well, it depends how much effort you want to put into this. The current
interface doesn't actually provide a "search" interface, but merely
functions to
1) list all package names - dbapi.cp_all()
2) list all package names and versions - dbapi.cpv_all()
3) list all versions for a given package name - dbapi.cp_list()
4) read metadata (like DESCRIPTION) for a given package name and
version - dbapi.aux_get()

One of the main performance problems of --search is that there is no
persistent cache for functions 1, 2 and 3, so if you're "just"
interested in performance aspects you might want to look into that.
The issue with implementing a persistent cache is that you have to
consider both cold and hot filesystem cache cases: Loading an index
file with package names and versions might improve the cold-cache case,
but slow things down when the filesystem cache is populated.
As has been mentioned, keeping the index updated is the other major
issue, especially as it has to be portable and should require little or
no configuration/setup for the user (so no extra daemons or special
filesystems running permanently in the background). The obvious
solution would be to generate the cache after `emerge --sync` (and other
sync implementations) and hope that people don't modify their tree and
search for the changes in between (that's what all the external tools
do). I don't know if there is actually a way to do online updates while
still improving performance and not relying on custom system daemons
running in the background.

As for --searchdesc, one problem is that dbapi.aux_get() can only
operate on a single package-version on each call (though it can read
multiple metadata variables). So for description searches the control
flow is like this (obviously simplified):

result = []
# iterate over all packages
for package in dbapi.cp_all():
# determine the current version of each package, this is
# another performance issue.
version = get_current_version(package)
# read package description from metadata cache
description = dbapi.aux_get(version, ["DESCRIPTION"])[0]
# check if the description matches
if matches(description, searchkey):
result.append(package)

There you see the three bottlenecks: the lack of a pregenerated package
list, the version lookup for *each* package and the actual metadata
read. I've already talked about the first, so lets look at the other
two. The core problem there is that DESCRIPTION (like all standard
metadata variables) is version specific, so to access it you need to
determine a version to use, even though in almost all cases the
description is the same (or very similar) for all versions. So the
proper solution would be to make the description a property of the
package name instead of the package version, but that's a _huge_ task
you're probably not interested in. What _might_ work here is to add
support for an optional package-name->description cache that can be
generated offline and includes those packages where all versions have
the same description, and fall back to the current method if the
package is not included in the cache. (Don't think about caching the
version lookup, that's system dependent and therefore not suitable for
caching).

Hope it has become clear that while the actual search algorithm might
be simple and not very efficient, the real problem lies in getting the
data to operate on.

That and the somewhat limited dbapi interface.

Disclaimer: The stuff below involves extending and redesigning some
core portage APIs. This isn't something you can do on a weekend, only
work on this if you want to commit yourself to portage development
for a long time.

The functions listed above are the bare minimum to
perform queries on the package repositories, but they're very
low-level. That means that whenever you want to select packages by
name, description, license, dependencies or other variables you need
quite a bit of custom code, more if you want to combine multiple
searches, and much more if you want to do it efficient and flexible.
See http://dev.gentoo.org/~genone/scripts/metalib.py and
http://dev.gentoo.org/~genone/scripts/metascan for a somewhat flexible,
but very inefficient search tool (might not work anymore due to old
age).

Ideally repository searches could be done without writing any
application code using some kind of query language, similar to how SQL
works for generic database searches (obviously not that complex).
But before thinking about that we'd need a query API that actually
a) allows tools to assemble queries without having to worry about
implementation details
b) run them efficiently without bothering the API user

Simple example: Find all package-versions in the sys-apps category that
are BSD-licensed.

Currently that would involve something like:

result = []
for package is dbapi.cp_all():
if not package.startswith("sys-apps/"):
continue
for version in dbapi.cp_list(package):
license = dbapi.aux_get(version, ["LICENSE"])[0]
# for simplicity perform a equivalence check, in reality you'd
# have to account for complex license definitions
if license == "BSD":
result.append(version)

Not very friendly to maintain, and not very efficient (we'd only need
to iterate over packages in the 'sys-apps' category, but the interface
doesn't allow that).
And now how it might look with a extensive query interface:

query = AndQuery()
query.add(CategoryQuery("sys-apps", FullStringMatch()))
query.add(MetadataQuery("BSD", FullStringMatch()))
result = repository.selectPackages(query)

Much nicer, don't you think?

As said, implementing such a thing would be a huge amount of work, even
if just implemented as wrappers on top of the current interface (which
would prevent many efficiency improvements), but if you (or anyone else
for that matter) are truly interested in this contact me off-list,
maybe I can find some of my old design ideas and (incomplete)
prototypes to give you a start.

Marius
 
Old 11-24-2008, 05:25 AM
Marius Mauch
 
Default search functionality in emerge

On Sun, 23 Nov 2008 21:01:40 -0800 (PST)
devsk <funtoos@yahoo.com> wrote:

> > not relying on custom system daemonsrunning in the background.
>
> Why is a portage daemon such a bad thing? Or hard to do? I would very
> much like a daemon running on my system which I can configure to sync
> the portage tree once a week (or month if I am lazy), give me a
> summary of hot fixes, security fixes in a nice email, push important
> announcements and of course, sync caches on detecting changes (which
> should be trivial with notify daemons all over the place) etc. Why is
> it such a bad thing?

Well, as an opt-in solution it might work (though most of what you
described is IMO just stuff for cron, no need to reinvent the wheel).

What I was saying is that _relying_ on custom system
daemons/filesystems for a _core subsystem_ of portage is the wrong
way, simply because it adds a substantial amount of complexity to the
whole package management architecture. It's one more thing that can
(and will) break, one more layer to take into account for any design
decisions, one more component that has to be secured, one more obstacle
to overcome when you want to analyze/debug things.
And special care must be taken if it requires special kernel support
and/or external packages. Do you want to make inotify support mandatory
to use portage efficiently? (btw, looks like inotify doesn't really
work with NFS mounts, which would already make such a daemon completely
useless for people using a NFS-shared repository)

And finally, if you look at the use cases, a daemon is simply overkill
for most cases, as the vast majority of people only use emerge
--sync (or wrappers) and maybe layman to change the tree, usually once
per day or less often. Do you really want to push another system daemon
on users that isn't of use to them?

> Its crazy to think that security updates need to be pulled in Linux.

That's IMO better be handled via an applet (bug #190397 has some code),
or just check for updates after a sync (as syncing is the only
way for updates to become available at this time). Maybe a message
could be added after sync if there are pending GLSAs, now that the glsa
support code is in portage.

Marius
 
Old 11-24-2008, 05:47 AM
Duncan
 
Default search functionality in emerge

devsk <funtoos@yahoo.com> posted
396349.98307.qm@web31708.mail.mud.yahoo.com, excerpted below, on Sun, 23
Nov 2008 21:01:40 -0800:

> Why is a portage daemon such a bad thing? Or hard to do? I would very
> much like a daemon running on my system which I can configure to sync
> the portage tree once a week (or month if I am lazy), give me a summary
> of hot fixes, security fixes in a nice email, push important
> announcements and of course, sync caches on detecting changes (which
> should be trivial with notify daemons all over the place) etc. Why is it
> such a bad thing?
>
> Its crazy to think that security updates need to be pulled in Linux.

Well, this is more a user list discussion than a portage development
discussion, but...

For one thing, it's terribly inefficient to keep a dozen daemons running
checking only a single thing each, each week, when we have a cron
scheduling daemon, and it's both efficient and The Unix Way (R) to setup
a script to do whatever you need it to do, and then have the cron daemon
run each of a dozen different scripts once each week, instead of having
those dozen different daemons running constantly when they're only active
once a week.

IOW, it only requires a manual pull if you've not already setup cron to
invoke an appropriate script once a week, and that involves only a single
constantly running daemon, the cron daemon of your choice.

Now, perhaps it can be argued that there should be a package that
installs such a pre-made script. For all I know, maybe there is one
already. And perhaps it can be argued that said script, if optional,
should at least be mentioned in the handbook. I couldn't argue with the
logic of either of those. But there's no reason to run yet another
daemon constantly, when (1) it's not needed constantly, and (2), there's
already a perfectly functional way of scheduling something to run when
it /is/ needed, complete with optional results mailing, etc, if it's
scripted to do that.

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
 
Old 11-24-2008, 08:34 AM
René 'Necoro' Neumann
 
Default search functionality in emerge

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

tvali schrieb:
> There is daemon, which notices about filesystem changes -
> http://pyinotify.sourceforge.net/ would be a good choice.

Disadvantage: Has to run all the time (I see already some people crying:
"oh noez. not yet another daemon..."). Problem with offline changes
(which might be overcome by a one-time check on daemon-startup ... but
this would really increase the startup time).

I have built an algorithm, which does sth like:
for overlay in OVERLAYS + PORTDIR:
db[overlay] = md5("".join(f.st_mtime for files(overlay)))

and then compare the MD5-values on later runs.
This is fast if the portage stuff is already cached - else it is quite
slow . Another disadvantage is, that it does not know, WHAT changes do
have occurred and thus has to re-read the complete overlay.

I like the filesystem idea more, than the one with the daemon . Write
a new FS (using FUSE f.ex. (LUFS is deprecated)) which provides a
logfile. This logfile can either just contain the time of the latest
change in the complete subtree, or even some kind of log stating WHICH
files have been changed.

I think, this should even be possible, if the tree is not on its own
partition.

Of course, this should be clearly an opt-in solution: If the user does
not modify the trees by hand, or does so seldomly, the "create index
after sync" (similarly to 'eix-sync') is sufficient.

Regards,
René
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkkqdSMACgkQ4UOg/zhYFuDFLACaAn7skiCsy9pHutXf5ETa5db5
BP8AnR8lqj7c6u8HPKVbOsHVTFuGAqfG
=G+lV
-----END PGP SIGNATURE-----
 
Old 11-24-2008, 08:48 AM
Fabian Groffen
 
Default search functionality in emerge

On 24-11-2008 10:34:28 +0100, René 'Necoro' Neumann wrote:
> tvali schrieb:
> > There is daemon, which notices about filesystem changes -
> > http://pyinotify.sourceforge.net/ would be a good choice.
>
> Disadvantage: Has to run all the time (I see already some people crying:
> "oh noez. not yet another daemon...").

... and it is Linux only, which spoils the fun.


--
Fabian Groffen
Gentoo on a different level
 
Old 11-24-2008, 01:30 PM
tvali
 
Default search functionality in emerge

So, mornings are smarter than evenings (it's Estonian saying) ...at night, I thought more about this filesystem thing and found that it simply answers all needs, actually. Now I did read some messages here and thought how it could be made real simple, at least as I understand this word. Yesterday I searched if custom filesystems could have custom functionality and did not find any, so I wrote this list of big bunch of classes, which might be overkill as I think now.


First thing about that indexing - if you dont create daemon nor filesystem, you can create commands "emerge --indexon", "emerge --indexoff", "emerge --indexrenew". Then, index is renewed on "emerge --sync" and such, but when user changes files manually, she has to renew index manually - not much asked, isn't it? If someone is going to open the cover of her computer, she will take the responsibility to know some basic things about electricity and that they should change smth in bios after adding and removing some parts of computer. Maybe it should even be "emerge --commithandmadechanges", which will index or do some other things, which are needed after handmade changes. More such things might emerge in future, I guess.


But about filesystem...

Consider such thing that when you have filesystem, you might have some directory, which you could not list, but where you can read files. Imagine some function, which is able to encode and decode queryes into filesystem format.


If you have such function: search(packagename, "dependencies") you can write it as file path:
/cgi-bin/search/packagename/dependencies - and packagename can be encoded by replacing some characters with some codes and separating long strings with /. Also, you could have API, which has one file in directory, from where you can read some tmp filename, then write your query to that file and read the result from the same or similarly-named file with different extension. So, FS provides some ways to create custom queries - actually that idea came because there was idea of creating FS as cgi server on LUFS page, thus this "cgi-bin" starting here is to simplify. I think it's similar to how files in /dev/ directory behave - you open some file and start writing and reading, but this file actually is zero-sized and contains nothing.


Under such case, API could be written to provide this filesystem and nothing more. If it is custom-mapped filesystem, then it could provide search and such directories, which can be used by portage and others. If not, it would work as it used to.


So, having filesystem, which contains such stuff (i call this subdir "dev" here):
/dev/search - write your query here and read the result./dev/search/searchstring - another way for user to just read some listings with her custom script.
/portage/directory/category/packagename/depslist.dev - contains dynamic list of package dependencies./dev/version - some integer, which will grow every time any change to portage tree is made.

Then, other functions would be added eventually.

Now, things simple:
Create standard filesystem, which can be used to contain portage tree.Add all nessecary notifications to change and update files.
Mount this filesystem to the same dir, where actual files are placed - if it's not mounted, portage will almost not notice this (so in emergency, things are just slower). You can navigate to a directory, then mount new one - I am not on linux box right now, but if I remember correctly, you can use files in real directory after mounting smth other there in such way.
Create indexes and other stuff.
2008/11/24 Fabian Groffen <grobian@gentoo.org>

On 24-11-2008 10:34:28 +0100, René 'Necoro' Neumann wrote:

> tvali schrieb:

> > There is daemon, which notices about filesystem changes -

> > http://pyinotify.sourceforge.net/ would be a good choice.

>

> Disadvantage: Has to run all the time (I see already some people crying:

> "oh noez. not yet another daemon...").



... and it is Linux only, which spoils the fun.





--

Fabian Groffen

Gentoo on a different level





--
tvali

Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad. Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju mingi täica pea nagu prügikast...
 
Old 11-24-2008, 02:14 PM
tvali
 
Default search functionality in emerge

There is one clear problem:
Some other app opens some portage file.Tree is mounted and indexed.Other app changes this file.Index is out-of-date.To disallow such thing it should be first suggested that all scripts change portage tree only after mount. As defence against those, which dont listen to that suggestion, portage should just not use this altered data - portage should totally rely on it's internal index and when you change some file and index is not updated, you change should be as well as lost. Does this make portage tree twice as big as it is?


I guess not, because:
Useflags can be indexed and refferred with numbers.Licence, homepage and such data is not needed to be duplicated.Also, as overlay directories are suggested anyway, is it needed at all to check all files for updates? I think that when one does something wrong, it's OK when everything goes boom and if someone has update scripts, which dont use overlays and other suggested ways to do thing, then adding one more thing, which breaks, is not bad. Hashing those few files isnt bad idea and keeping internal duplicate of overlay directory is not so bad, too - then you need to "emerge --commithandmadeupdates" and that's all.


Some things, which could be used to boost:
Dependancy searches are saved - so that "emerge -p pck1 pck2 pck3" saves data about deps of those 3 packages.Package name list is saved.
All packages are given integer ID.
List of all words in package descriptions are saved and connected to their internal ID's. This could be used to make smaller index file. So when i search for "al", then all words containing those chars like "all" are considered and -S search will run only on those packages.
Hash file of whole portage tree is saved to understand if it's changed after last remount.2008/11/24 tvali <qtvali@gmail.com>

So, mornings are smarter than evenings (it's Estonian saying) ...at night, I thought more about this filesystem thing and found that it simply answers all needs, actually. Now I did read some messages here and thought how it could be made real simple, at least as I understand this word. Yesterday I searched if custom filesystems could have custom functionality and did not find any, so I wrote this list of big bunch of classes, which might be overkill as I think now.



First thing about that indexing - if you dont create daemon nor filesystem, you can create commands "emerge --indexon", "emerge --indexoff", "emerge --indexrenew". Then, index is renewed on "emerge --sync" and such, but when user changes files manually, she has to renew index manually - not much asked, isn't it? If someone is going to open the cover of her computer, she will take the responsibility to know some basic things about electricity and that they should change smth in bios after adding and removing some parts of computer. Maybe it should even be "emerge --commithandmadechanges", which will index or do some other things, which are needed after handmade changes. More such things might emerge in future, I guess.



But about filesystem...

Consider such thing that when you have filesystem, you might have some directory, which you could not list, but where you can read files. Imagine some function, which is able to encode and decode queryes into filesystem format.



If you have such function: search(packagename, "dependencies") you can write it as file path:
/cgi-bin/search/packagename/dependencies - and packagename can be encoded by replacing some characters with some codes and separating long strings with /. Also, you could have API, which has one file in directory, from where you can read some tmp filename, then write your query to that file and read the result from the same or similarly-named file with different extension. So, FS provides some ways to create custom queries - actually that idea came because there was idea of creating FS as cgi server on LUFS page, thus this "cgi-bin" starting here is to simplify. I think it's similar to how files in /dev/ directory behave - you open some file and start writing and reading, but this file actually is zero-sized and contains nothing.



Under such case, API could be written to provide this filesystem and nothing more. If it is custom-mapped filesystem, then it could provide search and such directories, which can be used by portage and others. If not, it would work as it used to.



So, having filesystem, which contains such stuff (i call this subdir "dev" here):
/dev/search - write your query here and read the result./dev/search/searchstring - another way for user to just read some listings with her custom script.

/portage/directory/category/packagename/depslist.dev - contains dynamic list of package dependencies./dev/version - some integer, which will grow every time any change to portage tree is made.


Then, other functions would be added eventually.

Now, things simple:
Create standard filesystem, which can be used to contain portage tree.Add all nessecary notifications to change and update files.

Mount this filesystem to the same dir, where actual files are placed - if it's not mounted, portage will almost not notice this (so in emergency, things are just slower). You can navigate to a directory, then mount new one - I am not on linux box right now, but if I remember correctly, you can use files in real directory after mounting smth other there in such way.

Create indexes and other stuff.
2008/11/24 Fabian Groffen <grobian@gentoo.org>


On 24-11-2008 10:34:28 +0100, René 'Necoro' Neumann wrote:

> tvali schrieb:

> > There is daemon, which notices about filesystem changes -

> > http://pyinotify.sourceforge.net/ would be a good choice.

>

> Disadvantage: Has to run all the time (I see already some people crying:

> "oh noez. not yet another daemon...").



... and it is Linux only, which spoils the fun.





--

Fabian Groffen

Gentoo on a different level





--
tvali

Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad. Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju mingi täica pea nagu prügikast...





--
tvali

Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad. Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju mingi täica pea nagu prügikast...
 
Old 11-24-2008, 02:15 PM
René 'Necoro' Neumann
 
Default search functionality in emerge

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

tvali schrieb:
> But about filesystem...
>
> [... snip lots of stuff ...]

What you mentioned for the filesystem might be a nice thing (actually I
started something like this some time ago [1] , though it is now dead
), but it does not help in the index/determine changes thing. It is
just another API .

Perhaps the "index after sync" is sufficient for most parts of the
userbase - but esp. those who often deal with their own local overlays
(like me) do not want to have to re-index manually - esp. if re-indexing
takes a long time. The best solution would be to have portage find a)
THAT something has been changed and b) WHAT has been changed. So that it
only has to update these parts of the index, and thus do not be sth
enerving for the users (remind the "Generate Metadata" stuff (or
whatever it was called) in older portage versions, which alone seemed to
take longer than the rest of the sync progress)

Regards,
René

[1] https://launchpad.net/catapultfs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkkqxSsACgkQ4UOg/zhYFuBPSACdH9H6VChrhlcovucgVAcCsp/B
j+AAmgPXPmuBs5GWnNAfs5nss4HlBEMT
=WG8B
-----END PGP SIGNATURE-----
 

Thread Tools




All times are GMT. The time now is 06:37 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright ©2007 - 2008, www.linux-archive.org