Hi all, this is a long-ish post, but I hope you will enjoy it.
The other night I was discussing with persia, apachelogger and
norsetto on #ubuntu-motu, where they were using their awesome MOTU
powers to demolish my poor little theseus package.
While apachelogger was disassembling and devastating the files in
debian/ one by one, persia was vehemently attacking a little awk
script that I used in the package. I'll make a long story short, and
just say that I needed to extract the upstream version number from
debian/changelog file from inside debian/rules.
My approach when scripting is always to try to use a little hammer
first. If that doesn't work, I'll use a bigger hammer, and if _that_
doesn't work, I'll use a giant hammer.
The first line of the changelog file, which I am interested in, looks
theseus (1.1.5-0ubuntu1) hardy; urgency=low
Due to the strict format of the changelog file, it will always look
like that, but of course the version numbers etc. kan vary. I am
interested in extracting the string "1.1.5". So, first, I pulled out
my little hammer, which consists of a pipeline of standard shell
tools, such as head, tail, cut, sort, etc. The following little
hammer solves the problem.
What's wrong with that? Well nothing, it's just, kinda ugly. We can
do better. I pulled out the bigger hammer, sed, but within a minute
or two it grew sour on me and I took out my favourite big hammer,
awk. Awk is indeed awesome, it's an incredible tool, and if you don't
know it, you're missing out. Awk is extremely powerful, and very easy
to understand. An awk script is basically a series of patterns and
actions, like so:
If the pattern - an ordinary regexp - matches a line, the action is
performed on that line. The stuff inside the curly brackets is very
reminiscent of C syntax, so if you're familiar with that, you're off
to the races. In fact, awk is so powerful, that Henry Spencer has
written an nroff formatter, called awf, in the language (sic!). Henry
writes he can't believe he wrote it. Neither can anyone else :-)
There are several flavours of awk. I like gawk, which is the GNU one.
It contains several extensions to the original language. So, here is
the gawk oneliner that extracts the version:
As you see, only the action is used here. We call the function match,
which actually takes over the regexp matching job normally carried
out by the pattern. Let's dissect the regexp.
First, it will try to match the initial left parenthesis, that is
what the ( is for. The next part is (.*), here the parentheses are
not escaped, so they have a special meaning, namely a grouping.
Inside the grouping, we look for an arbitrary run of characters. This
run ends when a dash is encountered. But now the grouping becomes
important, because the match function will place the matched pattern
in arr - this is "(1.1.5-" in this case, and the groupings in the
following array elements. So arr contains the desired string "1.1.5".
Well, as mentioned, persia didn't like that too well. You have to
Build-depend on gawk, he said. You can use mawk, said norsetto, it's
part of the basic build environment. Granted, the gawk binary is
293K, and mawk is only 93K. It's saving valuable resources!
Unfortunately, the "match" function syntax was not accepted by mawk,
so I got a syntax error. But, not to worry, of course it can be done
StevenK said: Why dont you just do: dpkg-parsechangelog | grep
Version | cut -d -f2 ?
Well, at this stage, we were into optimization, finding the very last
CPU cycle and the very last bit of RAM. It was becoming a dogma-film
like situation: we value the minimalist creative ideal. And dpkg-
parsechangelog is a Perl script. Yeeechh.
In this awk dialect, the match function will set the beginning and
the end of the string that matches the regexp. It will not deal with
groupings, so the '()' surrounding the .* are gone. Another function,
substr, is used to extract the wanted version string from the input
string ($0). Mission accomplished. Success!
But no, no, no. Persia was still not happy. "I'll accept gawk, but
couldn't resist your last comment", he said, referring to a comment I
had made about efficiency. Persia pushed me back to using sed. He said:
"Isn't it just something like sed /^theseuss([d.]*)-.*/1/p |
head -1 "? And indeed, the size of /bin/sed is only 40K. A huge
saving of resources compared to gawk!
I copy-pasted it, but it didn't quite work. Hmm. Back to the drawing
board. Then I came up with another suggestion:
sed 's/.*(//; s/-.*//;q' < changelog
Let's examine the regexp again. It is a series of "substitute"
statements, separated by semicolons. These are executed on every line
in the file. The first deletes everything up to, and including, the
first '('. The next deletes from the dash to end-of-line. The third
statement quits the program after the first line.
But persia was still not happy. He was using his MOTU powers, driving
me forward, at every step, for perfection! I started to look at
persia's oneliner again, and finally got it twisted so it worked for me:
sed 's/.*((.*)-.*).*/1/;q' < changelog
Let's analyse the regexp again. We are using "grouping" again, but
unlike awk (unfortunately) sed has a reversed interpretation of
parentheses. In sed, they have to be escaped to signify a grouping.
Inside the first pair of /'s is the regexp that recognized the whole
first line. There is a grouping around the characters between the '('
and the '-' in that line, in other words, the version. The sed
statement thus a substitution, where the whole line is replaced by
grouping 1, which is referenced as an escaped nr. 1. Voila!
Finally, persia, that relentless seeker of perfection, was satisfied!
The package was uploaded to REVU, quickly sponsored by apachelogger
and norsetto, and is now already accepted for Universe.
So, what can all we MOTU-hopefuls learn from this story? Well, be
patient when you work on your package, don't get frustrated! Have
some fun on the irc channel, show the MOTUs what you can do, and
learn from them! You may even teach them a trick or two ;-)
PS: The entire #ubunto-motu session can be viewed at http://
Ubuntu-motu mailing list
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-motu
12-05-2007, 03:37 PM
Fun with oneliners
On Wed, Dec 05, 2007 at 04:31:55PM +0100, Kjeldgaard Morten wrote:
1. perl's startup costs compared to awk isn't that high, unless
you require a whole slew of modules... . In addition, you should
prefer whatever's more likely to have been recently used, as that's
what already is paged in. If dpkg.* avoids .*awk and sed, but uses
perl throughout, perl's suddenly VERY cheap.
> board. Then I came up with another suggestion:
> sed 's/.*(//; s/-.*//;q' < changelog
> Let's examine the regexp again. It is a series of "substitute"
> statements, separated by semicolons. These are executed on every line
> in the file. The first deletes everything up to, and including, the
> first '('. The next deletes from the dash to end-of-line. The third
> statement quits the program after the first line.
> But persia was still not happy. He was using his MOTU powers, driving
> me forward, at every step, for perfection! I started to look at
> persia's oneliner again, and finally got it twisted so it worked for me:
> sed 's/.*((.*)-.*).*/1/;q' < changelog
> Let's analyse the regexp again. We are using "grouping" again, but
> unlike awk (unfortunately) sed has a reversed interpretation of
> parentheses. In sed, they have to be escaped to signify a grouping.
> Inside the first pair of /'s is the regexp that recognized the whole
> first line. There is a grouping around the characters between the '('
> and the '-' in that line, in other words, the version. The sed
> statement thus a substitution, where the whole line is replaced by
> grouping 1, which is referenced as an escaped nr. 1. Voila!
you DID notice that the semantics of the 2 sed statements have
a very significant difference for some nice version strings. Plus
nice failures. Such as