Monday, December 28, 2009

Grep 2.5.4 breaks regular expressions syntax

Backwards compatibility is for chumps, apparently. GNU Grep version 2.5.4 fundamentally changes regular expression syntax from the 2.5.3 and prior behavior. The below demonstrates the backwards breakage between 2.5.3 (on box1) and 2.5.4 (on box2).

todb@box1:~$ grep --version
GNU grep 2.5.3

Copyright (C) 1988, 1992-2002, 2004, 2005 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

todb@box1:~$ for i in cat parrot dog monkey
> do echo $i | egrep -v '^(cat|dog)'
> done
parrot
monkey
todb@box1:~$

### Meanwhile, on a system with grep 2.5.4 ###

todb@box2:~$ grep --version
GNU grep 2.5.4

Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.


todb@box2:~$ for i in cat parrot dog monkey
> do echo $i | egrep -v '^(cat|dog)'
> done
cat
parrot
dog
monkey
root@box2:~$

The second fails because the special regex characters of parenthesis and pipe loose their special grouping and alteration meanings in 2.5.4. Thus, this works for 2.5.4:
todb@box2:~$ for i in cat parrot dog monkey
> do echo $i | egrep -v '^\(cat\|dog\)'
> done
parrot
monkey

But the same does not work for 2.5.3:
todb@box1:~$ for i in cat parrot dog monkey
> do echo $i | egrep -v '^\(cat\|dog\)'
> done
cat
parrot
dog
monkey
todb@box1:~$

What this all boils down to is that scripts that rely on egrep are going to break pretty horribly and somewhat mysteriously when the underlying grep package gets updated; even better, there's no common method between the two versions to ensure that you get what you expect with a regular expression that involves grouping or alteration.

Naughty, naughty, grep maintainers. Off to submit a bug report now, but since grep 2.5.4 was released way back in February, 2009, I suspect the damage is going to be somewhat unavoidable.

If you know of a way to create a regex that will work in both contexts, I'd love to hear it. Single versus double quotes don't work, so for my purposes, I have to wrap my grep functions up in a version check of grep itself. (grep --version | sed s/[^0-9]*// | head -1 for the curious)

Labels: , , , ,

1 Comments:

Blogger A guy said...

The reason most people haven't noticed it, is that typically grep is configured with "--without-included-regex" and is instead linked against libpcre. At least that's how it is on the Gentoo and Ubuntu systems I have access to.

11:48 AM  

Post a Comment

Links to this post:

Create a Link

<< Home