FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian User

 
 
LinkBack Thread Tools
 
Old 07-13-2008, 09:05 AM
"j t"
 
Default Tool to show maximal repeating patterns / structure in (text?) data

Hi all,

Does anyone know of a tool which will analyse a block of data and find
structure / repeating patterns in it, and then somehow show that
structure to the user?

As an example, pretend I give it the following paragraph of text (but
I don't tell it that the following paragraph contains a string
repeated 4 times):

<snip>
Support for Debian users who Support for Debian users who Support for
Debian users who Support for Debian users who
</snip>

I'd like this tool to tell me that the previous paragraph contains the
string "Support for Debian users who " 4 times (and I'd like the tool
to have worked that out on its own).

I realize that this example is trivial. I'd also like this tool to do
things which are more complicated, but since I can't find anything
that even helps me with my previous example, that will do for the time
being.

To preemptively answer the question "why do you want it / what is it
you're trying to achieve", I have a log of a dhcp conversation which
contains what I think is a repeated DHCPDISCOVER stanza. Rather than
the manual copy/paste/diff cycle, I'd like this tool to look at the
log and tell me: "Yup, you've got a stanza/paragraph repeated 4
times".

I might be butting up against the edge of what's theoretically
possible ("computer science"-wise) but I think that my requirements
have something to do with lossless compression algorithms. Perhaps I
should start reading the source code for gzip/bzip2...?

Thanks for your help, Jaime :-)


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 07-13-2008, 11:14 AM
"Javier Barroso"
 
Default Tool to show maximal repeating patterns / structure in (text?) data

On Sun, Jul 13, 2008 at 11:05 AM, j t <mark473@gmail.com> wrote:


To preemptively answer the question "why do you want it / what is it

you're trying to achieve", I have a log of a dhcp conversation which

contains what I think is a repeated DHCPDISCOVER stanza. Rather than

the manual copy/paste/diff cycle, I'd like this tool to look at the

log and tell me: "Yup, you've got a stanza/paragraph repeated 4

times".You may want to take a look on logcheck package or write your own perl/awk program
*
 
Old 07-13-2008, 02:26 PM
Dave Sherohman
 
Default Tool to show maximal repeating patterns / structure in (text?) data

On Sun, Jul 13, 2008 at 10:05:23AM +0100, j t wrote:
> I might be butting up against the edge of what's theoretically
> possible ("computer science"-wise) but I think that my requirements
> have something to do with lossless compression algorithms. Perhaps I
> should start reading the source code for gzip/bzip2...?

You're on the right track here, at least for getting as far as detecting
maximal-length identical strings. As I recall, Huffman encoding should
be what you're looking for.

Another place to look would be search indexing algorithms. I used to
know a guy who'd done graduate work in that area and, from talking to
him about it, it sounded like this is one of their key techniques.

Although, if you're just looking for identical log entries (rather than
arbitrary repeated segments in freeform text), using awk/sed to strip
out timestamps, then feeding the result through `sort | uniq -cd` should
handle that case. (There are already standard log analysis packages
which do essentially this, but I can't think of any names at the
moment.)

--
News aggregation meets world domination. Can you see the fnews?
http://seethefnews.com/


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 07-13-2008, 06:09 PM
"j t"
 
Default Tool to show maximal repeating patterns / structure in (text?) data

On Sun, Jul 13, 2008 at 3:26 PM, Dave Sherohman <dave@sherohman.org> wrote:
> You're on the right track here, at least for getting as far as detecting
> maximal-length identical strings. As I recall, Huffman encoding should
> be what you're looking for.
>
> Another place to look would be search indexing algorithms. I used to
> know a guy who'd done graduate work in that area and, from talking to
> him about it, it sounded like this is one of their key techniques.

Dave,

I've briefly scanned wikipedia's pages on Huffman coding, DEFLATE,
LZ77 and LZ78, LZW, etc, and that's definitely what I'm looking for.
Wikipedia's entry on "Dictionary coder" even contains interesting
example algorithms. I can sense a fascinating project coming on...

Thank you for the pointers, Jaime


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 07-14-2008, 12:09 AM
"Sam Kuper"
 
Default Tool to show maximal repeating patterns / structure in (text?) data

You might also want to try something like ANTLR or LEX/YACC (FLEX/BISON if using FLOSS). Might be overkill, though.

Sam
 

Thread Tools




All times are GMT. The time now is 04:54 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org