FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian User

 
 
LinkBack Thread Tools
 
Old 05-27-2008, 02:49 PM
"John O'Hagan"
 
Default Bash, sed: extracting regex subexpressions

Hi,

I've been looking for a command I can use in bash scripts that will do
something like this:

$COMMAND(n[,m...]) (REGEX-1)(REGEX-2)[...] <($FILE)

(MATCH-n)[(MATCH-m)...]

In other words, to output only the parts of a regular expression match which
match specified subexpressions.

As a trivial example:

ifconfig | $COMMAND(2) '(inet addr([^ ]+)( .*)'

192.168.1.10

Some invocations of grep, awk and sed use backreferences, but AFAIK you can't
get just the backreferences as output. It would be simple if grep -o could have
subexpression indices, like:

egrep -o(2) '(foo)(.*)(bar)'

to get the matches for (.*); or if awk did something like this:

mawk '/(foo)(.*)(bar)/ {print 2}'

in other words, treating backreferences as pseudo-variables, but it doesn't,
AFAIK.

What I wanted can be done with grep plus sed, or multiple greps, or
awk using regexes as field separators, etc. but I wondered if there was a
neat way to do it with one command and without having to repeat regexes. It's
something that comes up from time to time in admin scripts and I've seen
posts here and there asking this kind of question.

Anyway, I found one:

sed -nr 's/(foo)(.*)(bar)/2/p'

The -n stops the lines which don't match the regex from being printed,
backreferences in the replacement let you choose subexpressions and the p
flag at the end prints them.

Not as neat as the imaginary grep or awk features above, because you
have to match the whole line, even what you don't want, but wildcards
make that possible.

Any better ideas?

John


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 05-27-2008, 03:27 PM
"Javier Barroso"
 
Default Bash, sed: extracting regex subexpressions

On Tue, May 27, 2008 at 4:49 PM, John O'Hagan <johnmohagan@gmail.com> wrote:

Hi,



I've been looking for a command I can use in bash scripts that will do

something like this:



$COMMAND(n[,m...]) (REGEX-1)(REGEX-2)[...] <($FILE)



* * * *(MATCH-n)[(MATCH-m)...]A recent article talk about it in bash3.0:



if [[ "a,b,c" =~ ^(.).(.) ]]; then echo ok ${BASH_REMATCH[1]} ${BASH_REMATCH[2]} ; fi
 
Old 05-27-2008, 05:27 PM
"Todd A. Jacobs"
 
Default Bash, sed: extracting regex subexpressions

On Tue, May 27, 2008 at 02:49:59PM +0000, John O'Hagan wrote:

> in other words, treating backreferences as pseudo-variables, but it
> doesn't, AFAIK.

Use the right tool for the job. If you want to treat grouped matches as
variables, use perl because perl explicitly supports this:

$ echo foobarbaz |
perl -ne '$_ =~ /(foo)(bar)(baz)/; print "$3, $2, $1
"'
baz, bar, foo

If you don't want to use perl, you'll have to use multiple invocations
of grep's -o flag and build up your own capture variables. Good luck!

--
"Oh, look: rocks!"
-- Doctor Who, "Destiny of the Daleks"


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 05-28-2008, 09:23 AM
John O'Hagan
 
Default Bash, sed: extracting regex subexpressions

> On Tue, May 27, 2008 at 4:49 PM, John O'Hagan <johnmohagan@gmail.com> wrote:
> > Hi,
> >
> > I've been looking for a command I can use in bash scripts that will do
> > something like this:
> >
> > $COMMAND(n[,m...]) (REGEX-1)(REGEX-2)[...] <($FILE)
> >
> > (MATCH-n)[(MATCH-m)...]
>

Thanks for the tips; they all work.

I tried each approach for a time-intensive task: finding palindromes within
words in a dictionary file $DICT, using an identical regex in each case.
Below are the expressions used and the times they took to execute:

while read i ; do

[[ $i =~ '(.*((.)(.?)((.)6?)43).*)' ]] && echo $BASH_REMATCH
${BASH_REMATCH[2]}

done < $DICT

#real 1m41.239s
#user 1m17.383s
#sys 0m0.474s

--------


sed -nr 's/(.*((.)(.?)((.)6?)43).*)/1 2/p' $DICT

#real 1m6.151s
#user 0m46.763s
#sys 0m0.151s

-------


perl -ne '$_ =~ /(.*((.)(.?)((.)6?)43).*)/; print "$1, $2
"' < $DICT


#real 0m16.381s
#user 0m4.660s
#sys 0m0.482s

--------

So I guess Perl is way the winner; unless the above comparison is somehow
unfair?

Regards,

John


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 

Thread Tools




All times are GMT. The time now is 10:25 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org