FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Gentoo > Gentoo User

 
 
LinkBack Thread Tools
 
Old 06-04-2010, 10:52 PM
Harry Putnam
 
Default Any users of sphinx here

I've been looking for a perl based search tool that uses some kind of
indexing to index and render searchable my home library of software
manual and the like. Quite a few html pages involved, maybe 15-16,000.

Webglimpse is something I've worked with before and know a bit about
but thought I might like to see what else is available.

Googling lead to a tool called Sphinx that apparently is coupled with
a data base tool like mysql. It is advertised as the kind of search
tool I'm after and has a perl front-end also available in portage
(dev-perl/Sphinx-Search).

The trouble is I haven't been able to figure out the first thing about
using it. The overview, and Introduction, like a lot of such
documents fails to give a really basic idea of what the tool does.

The call it a `full text search engine', but never really say what
that means.

There are 12-15 FEATURES listed, and none appear to describe sensibly
what they really do.

The faq is a string a questions about using sql.. really.

So far I haven't found a good statement of what the darn thing really
does or how to aim it at data.

The manual is probably great if you already know a lot about using
sphinx but very thin for my case.

I've not even been able to get a rough idea of how to aim the darn
thing at the desired (Local lan) web site.

Or, to show how thin it really is or how dumb I really am, I've been
unable to tell if it can even do what I want to do.

I've posted on a sphinx list on gmane... but it appears to be only
moderately active and haven't gotten any replies...

I hoped some one here may be familiar with sphinx and willing to coach
me a bit or at least let me know if it can even do what I want to do.

Also any other perl based search tools involving indexing and some
kind of versatile search query capability.. like regular expressions
I'd be interested to know about.
 
Old 06-06-2010, 04:11 AM
Brandon Vargo
 
Default Any users of sphinx here

On Fri, 2010-06-04 at 17:52 -0500, Harry Putnam wrote:
> I've been looking for a perl based search tool that uses some kind of
> indexing to index and render searchable my home library of software
> manual and the like. Quite a few html pages involved, maybe 15-16,000.
>
> Webglimpse is something I've worked with before and know a bit about
> but thought I might like to see what else is available.
>
> Googling lead to a tool called Sphinx that apparently is coupled with
> a data base tool like mysql. It is advertised as the kind of search
> tool I'm after and has a perl front-end also available in portage
> (dev-perl/Sphinx-Search).
>
> The trouble is I haven't been able to figure out the first thing about
> using it. The overview, and Introduction, like a lot of such
> documents fails to give a really basic idea of what the tool does.
>
> The call it a `full text search engine', but never really say what
> that means.
>
> There are 12-15 FEATURES listed, and none appear to describe sensibly
> what they really do.
>
> The faq is a string a questions about using sql.. really.
>
> So far I haven't found a good statement of what the darn thing really
> does or how to aim it at data.
>
> The manual is probably great if you already know a lot about using
> sphinx but very thin for my case.
>
> I've not even been able to get a rough idea of how to aim the darn
> thing at the desired (Local lan) web site.
>
> Or, to show how thin it really is or how dumb I really am, I've been
> unable to tell if it can even do what I want to do.
>
> I've posted on a sphinx list on gmane... but it appears to be only
> moderately active and haven't gotten any replies...
>
> I hoped some one here may be familiar with sphinx and willing to coach
> me a bit or at least let me know if it can even do what I want to do.
>
> Also any other perl based search tools involving indexing and some
> kind of versatile search query capability.. like regular expressions
> I'd be interested to know about.

If you can put your HTML pages into a database, Sphinx might be able to
help you with your issue. Basically what Sphinx does is let you search
databases. You specify one or more SQL sources of data ans associated
queries, and Sphinx provides an API (or a emulated SQL server) that
makes searching easy. Sphinx is for full text database searching; it
does not index files or websites directly. (Note that is this not
actually true; it can search XML files directly, but you still specify
XML attributes instead of database columns, etc, so it is treating the
XML as a data store and not as a generic document.) I recall reading
that Craigslist uses Sphinx to search their database of listings.

As an example of how it works, suppose I am making a news website and
have a bunch of news posts, each of which has an author, category, and
text. With Sphinx, I can setup a source -- let's call it news_catalog --
that will index this data. news_catalog will be associated with an SQL
query that will allow Sphinx to access the data it needs to index. Let's
use "SELECT id, author, category, text FROM catalog" as our query. Note
that catalog is a table or view in your database, though this query can
also use complex joins, etc, as long as the database supports it. Via
the Sphinx API, I can say I want to search for "Europe | America" and it
will return a list of news articles containing the terms Europe,
America, or both, as a pipe is the or operator. It actually returns a
list of ids which correspond to the id I specified in my query; a unique
key is always the first argument in the query. My application is
responsible for fetching the actual data from the original database
using that id and presenting the data in a useful way to the user.
Extended query syntax allows for other boolean operators, searching
specific fields, strict order, exact match, field start/end, etc. The
documentation has lots of examples; look at
http://www.sphinxsearch.com/docs/current.html for the current reference
manual.

If you have a bunch of HTML files on a disk or website that you want to
index and search, I do not think Sphinx is the software you want. Yes,
you could load your data into a database and then use Sphinx, but that
does not seem like the best solution. Sphinx provides the API for use in
your application; it does not provide a user interface. As an
alternative, I recommend you look at something like ht://Dig
(htdig.org), which will search HTML pages directly in addition to PDF,
Word, Excel, Powerpoint, etc with the help of external converters. It
also includes a user interface. After glancing at webglimpse, with which
I am not familiar, it looks like it does something similar to ht://Dig.

Regards,

Brandon Vargo
 
Old 06-06-2010, 08:37 PM
Harry Putnam
 
Default Any users of sphinx here

Brandon Vargo <brandon.vargo@gmail.com> writes:

> As an example of how it works, suppose I am making a news website and
> have a bunch of news posts, each of which has an author, category, and

Thank you brandon for such a nice through answer... Yeah, looks like
I'm barking up the wrong tree.

I know about htdig.. Not much though. Far as remember it didn't have
much in the way of search interface... something like google. Where
as webglimpse has a rich set of search terms, including some regular
expressions and regular expression like operators... all the same
tools as glimpse (and agrep). So many in fact it can be a bit
daunting to try to become proficient with.

Maybe you can enlighten me about htdig... its been yrs since I tried
htdig.

Even webglimpse fails though when it comes to trying to search for
snippets of code like perl or C etc. No body want the sloth and cpu
overhead of serious regular expression searching and that maybe the
only (good) way to search for things like /,{,$,(,[,!,@ etc etc like
one would need to find types of code snippets. Also I guess it
would be pretty hard to build an index with that in mind.

I keep thinking some good developer will come out with a tool aimed at
websites like might be found on a home lan (in scope)... where regular
expression searching wouldn't be so far out.

Or maybe there just is no herd of people who are competent in regular
expression searching, and hence no audience for such a tool
 
Old 06-08-2010, 05:07 AM
Brandon Vargo
 
Default Any users of sphinx here

On Sun, 2010-06-06 at 15:37 -0500, Harry Putnam wrote:
> Brandon Vargo <brandon.vargo@gmail.com> writes:
>
> > As an example of how it works, suppose I am making a news website and
> > have a bunch of news posts, each of which has an author, category, and
>
> Thank you brandon for such a nice through answer... Yeah, looks like
> I'm barking up the wrong tree.
>
> I know about htdig.. Not much though. Far as remember it didn't have
> much in the way of search interface... something like google. Where
> as webglimpse has a rich set of search terms, including some regular
> expressions and regular expression like operators... all the same
> tools as glimpse (and agrep). So many in fact it can be a bit
> daunting to try to become proficient with.
>
> Maybe you can enlighten me about htdig... its been yrs since I tried
> htdig.

Sorry, it has been awhile since I have used it as well.

> Even webglimpse fails though when it comes to trying to search for
> snippets of code like perl or C etc. No body want the sloth and cpu
> overhead of serious regular expression searching and that maybe the
> only (good) way to search for things like /,{,$,(,[,!,@ etc etc like
> one would need to find types of code snippets. Also I guess it
> would be pretty hard to build an index with that in mind.

Certainly it is a hard problem to index for arbitrary regular
expressions. Even Google's code search [1] is not terribly good at it.
However, I also do not think it is something most people will want to
do. When I go to find code that I have written, I do not remember
variable names, lines of code, etc that I can match with a regular
expression. Thus, that kind of search is pointless for me. I remember
what the code does, the project for which I wrote the code, and
approximately where the code is located within the project. I remember
function calls for libraries that I probably used. If I cannot find what
I am looking for, I use grep on the name of a function call I remember,
or I have a ctags file containing all the information I need about
function definitions.

I suggest, for code, you just organize whatever you have in a sane
directory structure. Or, even better, you can put your code in a central
place using a version control system (SVN, git, hg, CVS, etc), where it
is organized in a way that makes sense to you. After all, it sounds like
this is for your personal use, so use something that makes you happy.
Personally, I have a series of git repositories that I use to keep track
of my code and some of my documents.

> I keep thinking some good developer will come out with a tool aimed at
> websites like might be found on a home lan (in scope)... where regular
> expression searching wouldn't be so far out.
>
> Or maybe there just is no herd of people who are competent in regular
> expression searching, and hence no audience for such a tool

I do not think the problem is a lack of people with knowledge of regular
expressions, but rather the lack of a need for such a product. Many
people, at least those I know, do not think "Oh, I want to search for
xyz; I'll write a regular expression to search for what I want across
all my data." Instead, they have a directory structure of organized
documents that makes finding that particular document or series of
documents on xyz easy. When that fails, there is the find and locate
commands for terminal users, which support regex searching in filenames,
desktop search tools such as Beagle [2], and of course grep.

Certainly it would be really nice to have a search tool that would
produce results for "show me all the code on this computer used for
validating HTTP POST requests in Python for a submitted HTML form,
preferably using Django." If you find one, let me know, as I would love
to try it. In the meantime, `grep -RE 'form|POST'
projects/python/django/project_xyz` works fairly well once I figure out
that what I want is probably in that directory. (grep -E, or egrep,
supports extended regular expression; -R is recursive) Or, I just go
search through the documentation, if available.

Maybe someone here can suggestion something better for code searching.
For everything else, use Beagle/something similar or a web-based search
engine you can install locally if you really want to be able to search
through your documents. Maybe there is something better for that too; I
do not know. I still use directories and git repositories in said
directories, where appropriate, as it is more efficient for me. Of
course your mileage may vary.

[1]: http://www.google.com/codesearch
[2]: http://beagle-project.org/

Regards,

Brandon Vargo
 
Old 06-12-2010, 10:18 PM
Harry Putnam
 
Default Any users of sphinx here

Brandon Vargo <brandon.vargo@gmail.com> writes:

> do. When I go to find code that I have written, I do not remember
> variable names, lines of code, etc that I can match with a regular
> expression. Thus, that kind of search is pointless for me. I remember
> what the code does, the project for which I wrote the code, and
> approximately where the code is located within the project. I remember
> function calls for libraries that I probably used. If I cannot find what
> I am looking for, I use grep on the name of a function call I remember,
> or I have a ctags file containing all the information I need about
> function definitions.

Again, thanks for a thorough answer... just a note on the above
comment.

I often find myself searching for a technique... NOT variable names or
sub function names because who knows what I might call stuff in any
particular script.

For example... I once was shown how to compile as regular expression
an element of @ARGV in perl, in one step:

my $what_re = qr/@{[shift]}/;

I liked that and have used it many times... but only recently could I
remember at a moments notice how to write it.

I used `grep -r' or 'egrep -r' as you've mentioned, now I use a
my own perl script (recently written [since posting original query])
that uses regex and File::Find, where user feeds the regex and the
approximate location to begin the search, on the cmd line.

In my case that would be an nfs share /projects/reader/perl which is
kept in my ENV as $perlp

So:
script.pl 'qr/.*?@' $perlp

Will find a number of examples of using that particular technique.

What prompted my query here, was looking for a way to search several
thousand html pages that are a collection of Perl books on CD.

These are 2 of the Oreilly Perl CDbooks. (I spent $150 for the first
one, and I think the second was a little cheaper, it was yrs ago) The
Books on CD have built in search tools but those only work on a
windows OS and aren't up to much anyway.

I've since downloaded the data from the CDS onto an opensolaris zfs
server and access them through NFS.

I was attempting to use `webglimpse'
(http://webglimpse.net/download.php) for the task, hence the interest
in indexing. But I suspect a search for a particular technique I read
about, but have forgotten how to code, would be best searched for
using regular expressions. This would be long after I've forgotten
which section or even which book I read about it in.

The tool I've written can be made to strip html if necessary and can
be made to include (by regex) only certain kinds of filenames, but
uses no index so consequently is pretty slow... but still very useful
and is fully perl regex capable.

It returns up to 4 lines of context, 2 above the line with the hit,
and 1 below (where possible), along with the page number and the
absolute filename where the hit was found.

Here is an example search being timed:
------- --------- ---=--- --------- --------
(I purposely picked something that would be found many times)

time ./pgrep3 /var/www/localhost/htdocs/lcweb/cdbk+/AllPerl/ hash

(So above we are searching a collection from the Oreilly CDbooks for
the term `hash'..)

(Just one example of the thousands of lines returned)
[...]

/var/www/localhost/htdocs/lcweb/cdbk+/AllPerl/perlnut/index/idx_p.htm
135 dereferencing with : [104]4.8.2. Dereferencing
136 modulus operator : [105]4.5.3. Arithmetic Operators
137 prototype symbol (hash) : [106]4.7.5. Prototypes
138 %= (assignment) operator : [107]4.5.6. Assignment Operators
---

[...]

Total files searched: 522
Total lines searched: 431689
real 1m48.344s
user 1m25.234s
sys 0m14.336s

------- --------- ---=--- --------- --------
Almost 2 minutes to search 431689 lines

So it is slow, maybe even very slow by comparison to tools using an
indexed search.

I don't really mind the sloth, but of course it would not be scalable
very much above the scope of use I'm doing with it. I do like the
precision search capability and plenty of context. All of the above is
also possible with grep, egrep... and friends too, of course, but only
with quite a lot more cmdline manipulation and piping.

I'm currently working on using something like this basic search script
to return URLS linking to the page and lines found, and working the
whole thing into something that can be carried out with a web browser.

Something pretty similar to webglimpse, I guess but without the
benefit of indexing.

Also webglimpe relies on glimpse which is not capable of full regex
search but does have a rich mixture of regex, regex like and boolean
query capability.
 
Old 06-12-2010, 11:05 PM
Harry Putnam
 
Default Any users of sphinx here

Brandon Vargo <brandon.vargo@gmail.com> writes:

> [1]: http://www.google.com/codesearch
> [2]: http://beagle-project.org/

Acckk, I forgot to thank you for the URLS you posted.. thanks
 

Thread Tools




All times are GMT. The time now is 01:19 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org