Sam's Blog

Readable Regexps: Why you should use /x

Date: Wednesday, 2 June 2010, 09:53.

Categories: perl, ironman, regexp, craft, basic, tutorial.

Part of the series: Better Regexps.

Regexps are one of Perl's strongest features, but they're also one of the causes of Perl's greatest criticism: that it looks like line noise.

If you've ever had to examine someone else's regexp, or worse debug one, you'll probably agree that there's some merit in that criticism.

It doesn't have to be that way however, there's some simple steps you can take to make your regexps more readable and more maintainable, and this week we look at one of them: the /x modifier.

Let's start with a trio of moderately complicated regexps used by Template::Sandbox, written in the traditional Perl manner:

Code:
$single_quoted_text_regexp = qr/\'(?:(?>[^\'\\]+)|(?>(?:\\\\)*\\.))*\'/s;
$double_quoted_text_regexp = qr/\"(?:(?>[^\"\\]+)|(?>(?:\\\\)*\\.))*\"/s;
$matching_round_brackets_regexp =
    qr/\((?:(?>[^\(\)\"\']+)|(??{$double_quoted_text_regexp})|(??{$single_quoted_text_regexp})|(??{$matching_round_brackets_regexp}))*\)/s;

Without the nice variable names, it'd take me a minute or two to figure out just what those regexps were trying to do, let alone how they did it. This is despite being the author.

I wouldn't like to make a guess at how long it'd take me to feel confident that they worked, if ever.

This is where the /x modifier comes in handy, it allows you to use white-space for layout in your regexps, rather than it counting as literal white-space characters to be matched.

An example probably explains it better... here's our previous three regexps in rather more readable form:

$single_quoted_text_regexp = qr/
    \'
    (?:
        #  Quoteless, backslashless text.
        (?> [^\'\\]+ )
        |
        #  Escaped characters.
        (?> (?:\\\\)* \\ . )
    )*
    \'
    /sx;

$double_quoted_text_regexp = qr/
    \"
    (?:
        #  Quoteless, backslashless text.
        (?> [^\"\\]+ )
        |
        #  Escaped characters.
        (?> (?:\\\\)* \\ . )
    )*
    \"
    /sx;

$matching_round_brackets_regexp = qr/
    \(
    (?:
        #  Bracketless, quoteless subtext.
        (?> [^\(\)\"\']+ )
        |
        #  Quoted text.
        (??{ $double_quoted_text_regexp }) |
        (??{ $single_quoted_text_regexp })
        |
        #  Expression containing sub-brackets.
        (??{ $matching_round_brackets_regexp })
    )*
    \)
    /sx;

As you can see, I've used the /x behaviour to indent the regexps, pairing up brackets and grouping alternative branches in a logical manner.

Each logical section of the regexp is visually isolated, giving you the chance to see it as a self-contained unit, which makes it easier to see what it's doing, whether there's any errors, and how it combines into the entire expression.

Even better, there's now comments in the regexps, explaining what each tricky bit is trying to do.

I know which set of regexps I'd rather maintain, do you?

This blog entry is part of the series: Better Regexps.

  1. Anchoring Regexps
  2. Did you mean +, not *, in that regexp?
  3. Readable Regexps: Why you should use /x

Browse Sam's Blog Subscribe to Sam's Blog

By day of June: 02.

By month of 2010: March, April, May, June, July, August, September, November.

By year: 2010, 2011, 2012, 2013.

Or by: category or series.

Comments

blog comments powered by Disqus
© 2009-2013 Sam Graham, unless otherwise noted. All rights reserved.