Sam's Blog
Anchoring Regexps
Date: Thursday, 22 April 2010, 15:14.
Categories: perl, ironman, regexp, craft, optimization, basic, tutorial.
Part of the series: Better Regexps.
A common mistake I find whenever I look at someone else's regexps, is a failure to anchor the regexp.
This is often, in my experience, the single biggest thing you can do to improve the performance of a regexp: it's one of those things you should learn to do in every regexp where applicable, which should be almost every regexp unless you're specifically looking for "something somewhere in the middle but I don't know where".
So, what is anchoring, and why does it have such a big impact?
Anchoring gets its name because it fixes the regexp to an unmoving position within the target string, much like a physical anchor fixes the position of a ship.
In particular it means anchoring to either the start of the
string or the end of the string using the ^
and $
tokens.
(If you're using /m
mode then ^$
match the start or
end of a line, but for simplicity I'll assume you're using the
default behaviour of //
for the rest of this article.)
So, by way of some examples:
'123abc456' =~ /123/; # Matches
'123abc456' =~ /^123/; # Also matches
'123abc456' =~ /abc/; # Matches
'123abc456' =~ /^abc/; # Doesn't match - abc isn't at the start of the string
'123abc456' =~ /456/; # Matches
'123abc456' =~ /456$/; # Also matches
'123abc456' =~ /abc/; # Matches
'123abc456' =~ /abc$/; # Doesn't match - abc isn't at the end of the string
'123abc456' =~ /def/; # Doesn't match
'123abc456' =~ /^def/; # Also doesn't match
'123abc456' =~ /def$/; # Still doesn't match
Now, you can see from those examples that the anchoring lets you make sure that your match is in the position you expected it to be, and that's certainly one important use of anchoring: if you've got a pattern that you want to match at the start or end of a string you should always anchor it to be certain that that's the place you matched.
That's a bit of a mouthful, an example should make it clear of how an unanchored regexp can go unintentionally wrong.
If we wanted to extract the version number from a list of packages:
'perl version:5.12.0' =~ /version:(\S+)/;
print "$1\n";
# Gives: 5.12.0
So far so good, except... whoops:
'version::contrived.pm version:1.0.3' =~ /version:(\S+)/;
print "$1\n";
# Gives: :contrived.pm
Oh dear, that's not what we wanted at all. Because we didn't anchor our regexp, it matched the first "version" it found, when we wanted the last one. We didn't realise it, but our first regexp was ambiguous.
If we had anchored our regexp it wouldn't have been ambiguous:
'perl version:5.12.0' =~ /version:(\S+)$/;
print "$1\n";
# Gives: 5.12.0
'version::contrived.pm version:1.0.3' =~ /version:(\S+)$/;
print "$1\n";
# Gives: 1.0.3
So, to return to my point, anchoring's primary purpose is to say "I only want to match if the pattern is at the start or end of the string, I don't want to match it at any just any old random place in the middle of the string".
That isn't the whole story however.
Unfortunately the natural human inclination is to focus on what that statement says the regexp does, since our focus on writing a program in the first place is to "do something", and it's very easy to miss an important point.
The biggest behavioural change in the regexp is actually what it doesn't do:
It does only match at the start or end of the string.
It doesn't match anywhere except the start or end of the string.
Now, those might seem like they're saying the same thing, and in fact they actually are, but the second phrasing is the key to understanding why anchoring a regexp is important even when you don't care about disambiguating the pattern.
If you have a 30-character long string then the unanchored regexp can potentially match at 30 places along that string, and the computer has to check each one.
Of those 30 characters, only one is at the start, only one is at the end and 28 are "in the middle".
Simply put, if you can say "It doesn't match anywhere except the start of the string" you're saying 29 out of 30 possibilities don't need to even be looked at.
If your string is 100 characters long, you're talking 99/100, and so on.
Now, I don't know about you, but I'll take a "does 99% less work" solution any time I can get it.
The real picture is slightly more complicated than that, a few example points:
You don't necessarily have 30 possible matching points in a 30 character string, it's more like "30 minus the minimum number of characters the pattern can match", and in some situations the perl regexp compiler is smart enough to figure this out.
This doesn't significantly alter the point though: on a three character long pattern you're still talking 27 possibilities on a 30 character string, and that 27 being reduced to 1 - still a clear win.
You're optimizing the "failure case", so if the first thing the regexp does is find a match and it's the right match, you're not getting any improvement.
However, for this "no gain" to happen you need to be matching at the very start of the string and you need to be matching on every string you test.
Secondly, the "failure case" you're optimizing is "failure of the regexp engine to find a match at that point in the string", not simply failure to find a match in the entire string - if you're looking for a pattern at the end of the string then even on a successful pattern match the regexp engine will have tried and failed at every point in the string up until the one it succeeded at, right at the end of the string.
If you'd anchored it with
$
the regexp engine would have just leapt to the end and tried there.
Those points don't change the overall conclusion in any way: if you can anchor, you should anchor.
(Within reason: don't stick pointless chunks of wild-card matches on the front or end of your pattern just so you can anchor it.)
Even in the small handful of situations where it will not gain you anything, it won't harm anything, and it's a good habit to form.
And in the two specific situations of you largely failing to
find a match on your input (think grep
, it only matches
a few lines of the thousands it looks at), or of you wanting to
match at the end of your input; you're probably getting a 90%+
speed improvement.
Anchored regexps are your friend.
Footnote
This article was partly inspired by Bryce Verdier's recent blog "Getting all installed software, and their versions, on Debian/Ubuntu", and normally I'd mention this prominently at the top of the article.
I didn't want to seem like I was singling him out specifically for criticism however (or anyone for that matter), as I've said this is pretty common and Bryce just happened to provide inspiration while I was flailing around for an idea of what to write this week.
Thanks Bryce. ;)