Sam's Blog

Regression Benchmarks with Template::Benchmark

Date: Wednesday, 7 April 2010, 11:04.

Categories: perl, ironman, template-benchmark, benchmarking, regression, qa, development-environment.

As part of my development environment for Template::Sandbox, I maintain a suite of regression benchmarks, using Template::Benchmark against all previous versions of the distribution.

While a crude tool, it's something I find useful and thought I'd use this week's column to share how I automated as much pain away as I could.

First of all, Adam Kennedy makes a good argument for the use of regression benchmarking, I suggest you go read that, because he sums up nearly everything I'd say if I was going to explain it here.

My setup is fairly simple, I have a directory full of files, one for each benchmark run, containing a JSON data-structure with the benchmark details inside:

$ ls ~/projects/Template-Sandbox/benchmarks/ Template-Sandbox-1.00-full.json Template-Sandbox-1.01_07-full.json Template-Sandbox-1.00-standard.json Template-Sandbox-1.01_07-standard.json Template-Sandbox-1.00_01-full.json Template-Sandbox-1.01_08-full.json Template-Sandbox-1.00_01-standard.json Template-Sandbox-1.01_08-standard.json Template-Sandbox-1.00_02-full.json Template-Sandbox-1.01_09-full.json Template-Sandbox-1.00_02-standard.json Template-Sandbox-1.01_09-standard.json Template-Sandbox-1.00_03-full.json Template-Sandbox-1.01_10-full.json Template-Sandbox-1.00_03-standard.json Template-Sandbox-1.01_10-standard.json Template-Sandbox-1.01-full.json Template-Sandbox-1.01_11-full.json Template-Sandbox-1.01-standard.json Template-Sandbox-1.01_11-standard.json Template-Sandbox-1.01_01-full.json Template-Sandbox-1.02-full.json Template-Sandbox-1.01_01-standard.json Template-Sandbox-1.02-standard.json Template-Sandbox-1.01_02-full.json Template-Sandbox-1.02_01-full.json Template-Sandbox-1.01_02-standard.json Template-Sandbox-1.02_01-standard.json Template-Sandbox-1.01_03-full.json Template-Sandbox-1.02_02-full.json Template-Sandbox-1.01_03-standard.json Template-Sandbox-1.02_02-standard.json Template-Sandbox-1.01_04-full.json Template-Sandbox-1.03-full.json Template-Sandbox-1.01_04-standard.json Template-Sandbox-1.03-standard.json Template-Sandbox-1.01_05-full.json Template-Sandbox-backdev-full.json Template-Sandbox-1.01_05-standard.json Template-Sandbox-backdev-standard.json Template-Sandbox-1.01_06-full.json Template-Sandbox-dev-full.json Template-Sandbox-1.01_06-standard.json Template-Sandbox-dev-standard.json

The naming convention should be fairly obvious, "dev" refers to my active development copy, "backdev" is the previous dev benchmark. It's oddly named so that it sorts alphabetically before dev. Yes I was too lazy to Do It Right. Sue me.

"Full" versions include all supported template features (ie, fancy syntax stuff: hash loops, expressions, functions, and so on), whereas "standard" is just the default "lowest common denominator" features enabled by Template::Benchmark (ie, token replacement, array loops and the other basic features implemented by nearly all template engines).

I generate this spew of files with a ts_full_regression_benchmark script:

$ ts_full_regression_benchmark Running benchmarks for Template-Sandbox-1.00. Skipping: /home/illusori/projects/Template-Sandbox/benchmarks/Template-Sandbox-1.00-standard.json exists. Skipping: /home/illusori/projects/Template-Sandbox/benchmarks/Template-Sandbox-1.00-full.json exists. Running benchmarks for Template-Sandbox-1.00_01. Skipping: /home/illusori/projects/Template-Sandbox/benchmarks/Template-Sandbox-1.00_01-standard.json exists. Skipping: /home/illusori/projects/Template-Sandbox/benchmarks/Template-Sandbox-1.00_01-full.json exists. Running benchmarks for Template-Sandbox-1.00_02. Skipping: /home/illusori/projects/Template-Sandbox/benchmarks/Template-Sandbox-1.00_02-standard.json exists. Skipping: /home/illusori/projects/Template-Sandbox/benchmarks/Template-Sandbox-1.00_02-full.json exists. ... much MUCH more of this ... Running benchmarks for Template-Sandbox-1.02_02. Skipping: /home/illusori/projects/Template-Sandbox/benchmarks/Template-Sandbox-1.02_02-standard.json exists. Skipping: /home/illusori/projects/Template-Sandbox/benchmarks/Template-Sandbox-1.02_02-full.json exists. Running benchmarks for Template-Sandbox-1.03. Skipping: /home/illusori/projects/Template-Sandbox/benchmarks/Template-Sandbox-1.03-standard.json exists. Skipping: /home/illusori/projects/Template-Sandbox/benchmarks/Template-Sandbox-1.03-full.json exists. Running benchmarks for devel version.

As you can see, it skips generation of any of the "proper" distribution benchmarks if they already exist, but always runs a new benchmark for the dev copy (renaming the old one to backdev).

This means I only take time to run the benchmarks I actually need, and if for some reason I want to regenerate the lot, it's as easy as an rm *.json.

The ts_full_regression_benchmark script to generate the files:

#!/bin/sh

PROJECTS='/home/illusori/projects'

BENCHMARK_DIR="${PROJECTS}/Template-Sandbox/benchmarks"
BENCHMARK_SCRIPT="${PROJECTS}/Template-Benchmark/src/script/benchmark_template_engines"

BENCHMARK_DURATION="-d 60"
BENCHMARK_TYPES="--notypes --uncached_string --memory_cache --instance_reuse"
BENCHMARK_PLUGINS="--onlyplugin TemplateSandbox"

COMMON_SWITCHES="--json $BENCHMARK_DURATION $BENCHMARK_TYPES $BENCHMARK_PLUGINS"

mkdir -p /tmp/ts_full_regression

for dist_file in ${PROJECTS}/released/Template-Sandbox*.tar.gz;
do
    tar -xzf $dist_file -C /tmp/ts_full_regression
done

for dist_dir in /tmp/ts_full_regression/*;
do
    dist_name=`basename $dist_dir`
    echo "Running benchmarks for $dist_name."
    out="${BENCHMARK_DIR}/${dist_name}-standard.json"
    if [ -s "$out" ]; then
       echo "  Skipping: $out exists."
    else
       $BENCHMARK_SCRIPT $COMMON_SWITCHES -I $dist_dir/lib >$out
    fi
    out="${BENCHMARK_DIR}/${dist_name}-full.json"
    if [ -s $out ]; then
       echo "  Skipping: $out exists."
    else
       $BENCHMARK_SCRIPT $COMMON_SWITCHES --allfeatures -I $dist_dir/lib >$out
    fi
done

rm -rf /tmp/ts_full_regression

#  Always overwrite the devel benchmarks.
echo "Running benchmarks for devel version."
out="${BENCHMARK_DIR}/Template-Sandbox-dev-standard.json"
if [ -s $out ]; then
   mv -f "$out" "${BENCHMARK_DIR}/Template-Sandbox-backdev-standard.json"
fi
$BENCHMARK_SCRIPT $COMMON_SWITCHES -I $PROJECTS/Template-Sandbox/src/lib >$out
out="${BENCHMARK_DIR}/Template-Sandbox-dev-full.json"
if [ -s $out ]; then
   mv -f "$out" "${BENCHMARK_DIR}/Template-Sandbox-backdev-full.json"
fi
$BENCHMARK_SCRIPT $COMMON_SWITCHES --allfeatures -I $PROJECTS/Template-Sandbox/src/lib >$out

Yes it's a shell script. Yes I know this is a perl blog. Get over it.

This script loops through the directory I keep my CPAN release tarballs in, looking for Template::Sandbox dists, and extracts them, runs benchmarks using the script provided with Template::Benchmark, which provides a handy JSON mode, and dumps the output into an appropriate filename.

Yep, it extracts the tarballs even for versions it's going to skip running benchmarks for, it's a quick-n-dirty hack.

Of course, having all that benchmark data in near-unreadable JSON format files isn't much good without a way to display them neatly, this is where ts_old_benchmarks comes in:

$ ts_old_benchmarks 1.00 full uncached_string 4.46 memory_cache 13.50 1.00_01 full uncached_string 4.42 memory_cache 13.70 1.00_02 full uncached_string 4.50 memory_cache 13.50 1.00_03 full uncached_string 4.44 memory_cache 13.50 1.01 full uncached_string 4.49 memory_cache 13.70 1.01_01 full uncached_string 4.53 memory_cache 13.50 1.01_02 full uncached_string 3.54 memory_cache 13.60 1.01_03 full uncached_string 3.53 memory_cache 13.50 1.01_04 full uncached_string 3.54 memory_cache 16.00 1.01_05 full uncached_string 3.56 memory_cache 16.30 1.01_06 full uncached_string 3.64 memory_cache 16.40 1.01_07 full uncached_string 4.19 memory_cache 19.50 1.01_08 full uncached_string 4.20 memory_cache 19.60 1.01_09 full uncached_string 4.18 memory_cache 19.60 1.01_10 full uncached_string 4.16 memory_cache 19.50 1.01_11 full uncached_string 4.20 memory_cache 19.70 instance_reuse 28.40 1.02 full uncached_string 4.19 memory_cache 19.70 instance_reuse 28.40 1.02_01 full uncached_string 4.11 memory_cache 19.70 instance_reuse 28.30 1.02_02 full uncached_string 4.46 memory_cache 20.20 instance_reuse 29.70 1.03 full uncached_string 4.47 memory_cache 20.10 instance_reuse 29.70 backdev full uncached_string 4.49 memory_cache 20.30 instance_reuse 30.00 dev full uncached_string 4.47 memory_cache 20.20 instance_reuse 30.00 1.00 standard uncached_string 2.75 memory_cache 42.20 1.00_01 standard uncached_string 2.68 memory_cache 42.10 1.00_02 standard uncached_string 2.72 memory_cache 41.00 1.00_03 standard uncached_string 2.66 memory_cache 42.10 1.01 standard uncached_string 2.66 memory_cache 40.90 1.01_01 standard uncached_string 2.50 memory_cache 41.00 1.01_02 standard uncached_string 13.20 memory_cache 40.70 1.01_03 standard uncached_string 13.20 memory_cache 41.50 1.01_04 standard uncached_string 13.30 memory_cache 52.20 1.01_05 standard uncached_string 13.30 memory_cache 52.10 1.01_06 standard uncached_string 13.70 memory_cache 53.00 1.01_07 standard uncached_string 14.60 memory_cache 62.10 1.01_08 standard uncached_string 14.60 memory_cache 62.10 1.01_09 standard uncached_string 14.70 memory_cache 61.70 1.01_10 standard uncached_string 14.50 memory_cache 61.60 1.01_11 standard uncached_string 14.50 memory_cache 62.00 instance_reuse 85.30 1.02 standard uncached_string 14.40 memory_cache 60.10 instance_reuse 83.30 1.02_01 standard uncached_string 14.20 memory_cache 61.70 instance_reuse 86.30 1.02_02 standard uncached_string 15.10 memory_cache 62.60 instance_reuse 88.40 1.03 standard uncached_string 15.20 memory_cache 62.30 instance_reuse 88.60 backdev standard uncached_string 15.20 memory_cache 62.80 instance_reuse 87.70 dev standard uncached_string 15.10 memory_cache 62.10 instance_reuse 88.60

ts_old_benchmarks kinda sucks for a name, but I think I've mentioned before that this is a quick-n-dirty hack.

"instance_reuse" benchmarks only show up late, because previous versions of Template::Sandbox had undefined behaviour (ie, they probably broke horribly) if you reused an instance.

Please don't ask how performance increased five-fold between versions 1.01 and 1.01_01, it's deeply embarrassing.

And here's the script to generate that output:

#!/usr/bin/perl -w

use strict;
use warnings;

use JSON::Any;
use File::Slurp;

my $archive_dir = '/home/illusori/projects/Template-Sandbox/benchmarks';

my ( @entries, $json, @full, @standard );

opendir( DIR, "$archive_dir" ) or
    die "Unable to opendir '$archive_dir': $!";
@entries = grep /^Template-Sandbox.*\.json$/, readdir( DIR );
closedir( DIR );

$json = JSON::Any->new();

@full = @standard = ();
foreach my $file ( sort( @entries ) )
{
    my ( $content, $result, $line, $name, $type );

    eval { $content = read_file( $archive_dir . '/' . $file ); };
    if( $@ )
    {
        warn "Unable to read $archive_dir/$file: $@";
        next;
    }
        
    next unless $content;

    eval { $result = $json->decode( $content ); };
    if( $@ or not $result )
    {
        warn "Unable to decode content of $file: " . ( $@ || 'empty result' );
        next;
    }

    ( $name, $type ) = $file =~ /^Template\-Sandbox\-(.*)\-(.*)\.json$/;
    $line = sprintf( '%-8s %-8s', $name, $type );
    foreach my $benchmark ( @{$result->{ benchmarks }} )
    {
        my ( $timing );

        next unless @{$benchmark->{ comparison }} > 1;
        #  TODO: grab from timings when working.

        $timing = $benchmark->{ comparison }->[ 1 ]->[ 1 ];
        $timing =~ s/\/s$//;
        $line .= sprintf( ' %-15s %5.2f', $benchmark->{ type },
            $timing );
    }
    if( $type eq 'full' )
    {
        push @full, "$line\n";
    }
    else
    {
        push @standard, "$line\n";
    }
}

print @full, @standard;

Yes, this one's perl. Yes, I know the other was a shell script. Sue me. Again.

This script mangles out the timing data from the human-readable comparison chart, it should really pull it from the timings section of the data-structure, but because allow_blessed seems to be inconsistently supported by JSON back-ends even with JSON::Any, a bunch of my old benchmarks have big blanks there.

Version 0.99_11 of Template::Benchmark, which should be hitting a mirror near you about the time this column is published, fixes this (or hacks around it anyway), but I've not recreated my benchmark "database" yet.

That's basically it.

When I'm done working on a revision, I run ts_full_regression_benchmark just after I've checked my ./Build test, then I compare the results with ts_old_benchmarks to make sure I've not caused a hideous performance regression.

Because it only needs to run the benchmarks for the current version, it lets me run a longer benchmark and get a less variable result.

I should point out that, like a test suite, this only partially helps you in fixing any problems that occur. It does however, like a test suite, provide a mostly-automated way to see whether something has gone wrong or not.

Now there's definite scope for improvement here, some items on my "when I get around to it" list:

  • Build benchmarks for each template feature option individually, to specifically spot regressions in each area of functionality.

  • Command-line tool to just present the recent most-relevant results rather than the whole spam.

  • Run with nightly build/regression tests.

  • Pretty HTML reports, with GOOD/BAD colour coding so I don't need to engage brain to read them.

  • I'm running the perl code from the dists directly from their source dir, I really ought to run a build first and run from the build dir. Thankfully Template::Sandbox is pure-perl so it still works, but if I wanted to run this on other template engines I'd have problems with the XS code probably.

  • Bundling all that together and there's the glimmerings of some manner of potentially-useful community website similar to www.cpantesters.org (on a rather more modest scale).

  • World Domination.

I strongly doubt I'll get further than pretty HTML reports, for me this is a useful tool but still just a means to an end rather than an end in itself, but maybe someone out there has had a light bulb go off above their head while reading this.

Browse Sam's Blog Subscribe to Sam's Blog

By day of April: 07, 14, 22, 25, 28.

By month of 2010: March, April, May, June, July, August, September, November.

By year: 2010, 2011, 2012, 2013.

Or by: category or series.

Comments

blog comments powered by Disqus
© 2009-2013 Sam Graham, unless otherwise noted. All rights reserved.