Sam's Blog
Readable Regexps: Why you should use /x
Date: Wednesday, 2 June 2010, 09:53.
Categories: perl, ironman, regexp, craft, basic, tutorial.
Part of the series: Better Regexps.
Regexps are one of Perl's strongest features, but they're also one of the causes of Perl's greatest criticism: that it looks like line noise.
If you've ever had to examine someone else's regexp, or worse debug one, you'll probably agree that there's some merit in that criticism.
It doesn't have to be that way however, there's some simple steps you
can take to make your regexps more readable and more maintainable, and
this week we look at one of them: the /x
modifier.
Let's start with a trio of moderately complicated regexps used by Template::Sandbox, written in the traditional Perl manner:
$single_quoted_text_regexp = qr/\'(?:(?>[^\'\\]+)|(?>(?:\\\\)*\\.))*\'/s;
$double_quoted_text_regexp = qr/\"(?:(?>[^\"\\]+)|(?>(?:\\\\)*\\.))*\"/s;
$matching_round_brackets_regexp =
qr/\((?:(?>[^\(\)\"\']+)|(??{$double_quoted_text_regexp})|(??{$single_quoted_text_regexp})|(??{$matching_round_brackets_regexp}))*\)/s;
Without the nice variable names, it'd take me a minute or two to figure out just what those regexps were trying to do, let alone how they did it. This is despite being the author.
I wouldn't like to make a guess at how long it'd take me to feel confident that they worked, if ever.
This is where the /x
modifier comes in handy, it allows you to
use white-space for layout in your regexps, rather than it counting as
literal white-space characters to be matched.
An example probably explains it better... here's our previous three regexps in rather more readable form:
$single_quoted_text_regexp = qr/
\'
(?:
# Quoteless, backslashless text.
(?> [^\'\\]+ )
|
# Escaped characters.
(?> (?:\\\\)* \\ . )
)*
\'
/sx;
$double_quoted_text_regexp = qr/
\"
(?:
# Quoteless, backslashless text.
(?> [^\"\\]+ )
|
# Escaped characters.
(?> (?:\\\\)* \\ . )
)*
\"
/sx;
$matching_round_brackets_regexp = qr/
\(
(?:
# Bracketless, quoteless subtext.
(?> [^\(\)\"\']+ )
|
# Quoted text.
(??{ $double_quoted_text_regexp }) |
(??{ $single_quoted_text_regexp })
|
# Expression containing sub-brackets.
(??{ $matching_round_brackets_regexp })
)*
\)
/sx;
As you can see, I've used the /x
behaviour to indent the
regexps, pairing up brackets and grouping alternative branches in
a logical manner.
Each logical section of the regexp is visually isolated, giving you the chance to see it as a self-contained unit, which makes it easier to see what it's doing, whether there's any errors, and how it combines into the entire expression.
Even better, there's now comments in the regexps, explaining what each tricky bit is trying to do.
I know which set of regexps I'd rather maintain, do you?
This blog entry is part of the series: Better Regexps.
- Anchoring Regexps
- Did you mean +, not *, in that regexp?
- Readable Regexps: Why you should use /x