FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > CentOS > CentOS

 
 
LinkBack Thread Tools
 
Old 06-28-2012, 06:30 PM
Sean Carolan
 
Default Optimizing grep, sort, uniq for speed

This snippet of code pulls an array of hostnames from some log files.
It has to parse around 3GB of log files, so I'm keen on making it as
efficient as possible. Can you think of any way to optimize this to
run faster?

HOSTS=()
for host in $(grep -h -o "[-.0-9a-z][-.0-9a-z]*.com" ${TMPDIR}/* |
sort | uniq); do
HOSTS+=("$host")
done
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 06-28-2012, 06:57 PM
 
Default Optimizing grep, sort, uniq for speed

Sean Carolan wrote:
> This snippet of code pulls an array of hostnames from some log files.
> It has to parse around 3GB of log files, so I'm keen on making it as
> efficient as possible. Can you think of any way to optimize this to
> run faster?
>
> HOSTS=()
> for host in $(grep -h -o "[-.0-9a-z][-.0-9a-z]*.com" ${TMPDIR}/* |
> sort | uniq); do
> HOSTS+=("$host")
> done

For one, do the sort in one step: sort -u. For another, are the hostnames
always the same field? For example, if they're all /var/log/messages, I'd
do awk '{print $4;}' | sort -u

mark


_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 06-28-2012, 07:15 PM
Gordon Messmer
 
Default Optimizing grep, sort, uniq for speed

On 06/28/2012 11:30 AM, Sean Carolan wrote:
> Can you think of any way to optimize this to run faster?
>
> HOSTS=()
> for host in $(grep -h -o "[-.0-9a-z][-.0-9a-z]*.com" ${TMPDIR}/* |
> sort | uniq); do
> HOSTS+=("$host")
> done

You have two major performance problems in this script. First, UTF-8
processing is slow. Second, wildcards are EXTREMELY SLOW!

You'll get a small performance improvement by using a C locale, *if* you
know that all of your text will be ascii (hostnames will be). You can
set LANG either for the whole script or just for grep/sort:

---
$ export LANG=C
---
$ env LANG=C grep ... | env LANG=C sort
---

I don't think you'll get much from running uniq in a C locale.

You'll get a HUGE performance boost from prefixing your search with some
known prefix to your regex. As it is written, your regex will iterate
over every character in each line. If that character is a member of the
first set, grep will then iterate over all of the following characters
until it finds one that isn't a match, then check for ".com". That
second loop increases the processing load tremendously. If you know the
prefix, use it, and cut it out in a subsequent stage.

$ grep 'host: [-.0-9a-z][-.0-9a-z]*.com' ${TMPDIR}/*
$ egrep '(host:|hostname:|from [-.0-9a-z][-.0-9a-z]*.com'
${TMPDIR}/*
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 06-28-2012, 07:22 PM
Gordon Messmer
 
Default Optimizing grep, sort, uniq for speed

On 06/28/2012 12:15 PM, Gordon Messmer wrote:
> You have two major performance problems in this script. First, UTF-8
> processing is slow. Second, wildcards are EXTREMELY SLOW!

Naturally, you should test both on your own data. I'm amused to admit
that I tested my own advice against my mail log and got more improvement
from the LANG setting than the string prefix. The combination of the
two reduced the amount of time to run your your pattern against my mail
logs by about 90%.

_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 06-28-2012, 07:27 PM
Sean Carolan
 
Default Optimizing grep, sort, uniq for speed

Thank you Mark and Gordon. Since the hostnames I needed to collect
are in the same field, at least in the lines of the file that are
important. I ended up using suggestions from both of you, the code is
like this now. The egrep is there to make sure whatever is in the 9th
field looks like a domain name.

for host in $(awk '{ print $9 }' ${TMPDIR}/* | egrep
"[-.0-9a-z][-.0-9a-z]*.com" | sort -u); do
HOSTS+=("$host")
done

Original script:
real 28m11.488s
user 26m57.043s
sys 0m30.634s

Using awk instead of grepping the entire batch:
real 6m14.949s
user 5m0.629s
sys 0m26.914s

Using awk and with export LANG=C
real 2m50.611s
user 1m20.849s
sys 0m27.366s

Awesome, thanks for the tips!



> For one, do the sort in one step: sort -u. For another, are the hostnames
> always the same field? For example, if they're all /var/log/messages, I'd
> do awk '{print $4;}' | sort -u

> You have two major performance problems in this script. *First, UTF-8
> processing is slow. *Second, wildcards are EXTREMELY SLOW!

> You'll get a HUGE performance boost from prefixing your search with some
> known prefix to your regex.
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 06-28-2012, 07:37 PM
 
Default Optimizing grep, sort, uniq for speed

Sean Carolan wrote:
> Thank you Mark and Gordon. Since the hostnames I needed to collect
> are in the same field, at least in the lines of the file that are
> important. I ended up using suggestions from both of you, the code is
> like this now. The egrep is there to make sure whatever is in the 9th
> field looks like a domain name.
>
> for host in $(awk '{ print $9 }' ${TMPDIR}/* | egrep
> "[-.0-9a-z][-.0-9a-z]*.com" | sort -u); do
> HOSTS+=("$host")
> done
>
*sigh*
awk is not "cut". What you want is
awk '{if (/[-.0-9a-z][-.0-9a-z]*.com/) { print $9;}}' | sort -u

No grep needed; awk looks for what you want *first* this way.

mark, who learned awk in the very early nineties, writing
100-200 line awk scripts....

_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 06-28-2012, 07:50 PM
Sean Carolan
 
Default Optimizing grep, sort, uniq for speed

> *sigh*
> awk is not "cut". What you want is
> awk '{if (/[-.0-9a-z][-.0-9a-z]*.com/) { print $9;}}' | sort -u
>
> No grep needed; awk looks for what you want *first* this way.

Thanks, Mark. This is cleaner code but it benchmarked slower than awk
then grep.

real 3m35.550s
user 2m7.186s
sys 0m27.793s

I'll run it a few more times to make sure that it wasn't some other
process slowing it down.

I really need to brush up some more on my awk skills!
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 06-28-2012, 08:07 PM
Woodchuck
 
Default Optimizing grep, sort, uniq for speed

On Thu, Jun 28, 2012 at 01:30:33PM -0500, Sean Carolan wrote:
> This snippet of code pulls an array of hostnames from some log files.
> It has to parse around 3GB of log files, so I'm keen on making it as
> efficient as possible. Can you think of any way to optimize this to
> run faster?

If the key phrase is *as efficient as possible*, then I would say
you want a compiled pattern search. Lex is the tool for this, and
for this job is not hard. Lex will generate a specific scanner(*)
in C or C++ (depending on what flavor of lex you use). It will probably
be table-based. Grep and awk, in contrast, generate scanners on the
fly, and specifying complicated regular expressions is somewhat
clumsier in grep and awk.

(*) strictly speaking, you are *scanning* not *parsing*. Parsing
involves a grammar, and there's no grammar here. If it develops that
these domain names are context sensitive, then you will need a grammar.

The suggestions of others -- setting LANG, cutting a specific field,
and so on, are all very valuable, and may be *practically* more valuable
than writing a scanner with lex, or could be used in conjunction
with a "proper" scanner.

Note that lex will allow you to use a much better definition for
"domain name" -- such as more than one suffix, names of arbitrary
complexity, names that may violate RFC, numeric type names, case
sensitivity, names that match certain special templates, like
"*.cn" or "goog*.*" and so on.

If you are unfamiliar with lex, note that it is the front end for
many a compiler.

BTW, you could easily incorporate a sorting function in lex that
would eliminate the need for an external sort. This might be done in awk,
too, but in lex it would be more natural. You simply would not
enter duplicates in the tree. When the run is over, traverse the
tree and out come the unique hostnames. I'm assuming you'll have
many collisions. (You could even keep a count of collisions, if you're
interested in which hosts are "popular".) Consider btree(3) for this
or hash(3).

Dave
--
Programming is tedious, but it is still fun after all these years.
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 06-28-2012, 08:39 PM
 
Default Optimizing grep, sort, uniq for speed

Woodchuck wrote:
> On Thu, Jun 28, 2012 at 01:30:33PM -0500, Sean Carolan wrote:
>> This snippet of code pulls an array of hostnames from some log files.
>> It has to parse around 3GB of log files, so I'm keen on making it as
>> efficient as possible. Can you think of any way to optimize this to
>> run faster?
>
> If the key phrase is *as efficient as possible*, then I would say
> you want a compiled pattern search. Lex is the tool for this, and

That, to me, would be a Big Deal.
<snip>
> BTW, you could easily incorporate a sorting function in lex that
> would eliminate the need for an external sort. This might be done in awk,
> too, but in lex it would be more natural. You simply would not
<snip>
Hello, mark, wake up.

Of course, there's an even easier way, just using awk:

awk '{if (/[-.0-9a-z][-.0-9a-z]*.com/) { hostarray[$9] = 1;}} END { for
(i in hostarray ) { print i;}}'

This dumps it into an associative array - that's one whose indices are a
string - so it will by default be in order.

mark

_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 
Old 06-28-2012, 09:04 PM
Sean Carolan
 
Default Optimizing grep, sort, uniq for speed

>> *sigh*
>> awk is not "cut". What you want is
>> awk '{if (/[-.0-9a-z][-.0-9a-z]*.com/) { print $9;}}' | sort -u

I ended up using this construct in my code; this one fetches out
servers that are having issues checking in with puppet:

awk '{if (/Could not find default node or by name with/) { print
substr($15, 2, length($15)-2);}}' ${TMPDIR}/* | sort -u

Thanks again, your knowledge and helpfulness is much appreciated.
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos
 

Thread Tools




All times are GMT. The time now is 04:22 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org