I was curious how the c++11 std::regex interface compared to the C posix regular expression library. The c++11 interfaces are almost as easy to use as perl. Suppose we have some space separated fields that we wish to manipulate, showing an order switch and the original:
my @strings = ( "hi bye", "hello world", "why now", "one two" ) ; foreach ( @strings ) { s/(\S+)\s+(\S+)/'$&' -> '$2 $1'/ ; print "$_\n" ; }
The C++ equivalent is
const char * strings[] { "hi bye", "hello world", "why now", "one two" } ; std::regex re( R"((\S+)\s+(\S+))" ) ; for ( auto s : strings ) { std::cout << regex_replace( s, re, "'$&' -> '$2 $1'\n" ) ; }
We have one additional step with the C++ code, compiling the regular expression. Precompilation of perl regular expressions is also possible, but that is usually just as performance optimization.
The posix equivalent requires precompilation too
void posixre_error( regex_t * pRe, int rc ) { char buf[ 128 ] ; regerror( rc, pRe, buf, sizeof(buf) ) ; fprintf( stderr, "regerror: %s\n", buf ) ; exit( 1 ) ; } void posixre_compile( regex_t * pRe, const char * expression ) { int rc = regcomp( pRe, expression, REG_EXTENDED ) ; if ( rc ) { posixre_error( pRe, rc ) ; } }
but the transform requires more work:
void posixre_transform( regex_t * pRe, const char * input ) { constexpr size_t N{3} ; regmatch_t m[N] {} ; int rc = regexec( pRe, input, N, m, 0 ) ; if ( rc && (rc != REG_NOMATCH) ) { posixre_error( pRe, rc ) ; } if ( !rc ) { printf( "'%s' -> ", input ) ; int len ; len = m[2].rm_eo - m[2].rm_so ; printf( "'%.*s ", len, &input[ m[2].rm_so ] ) ; len = m[1].rm_eo - m[1].rm_so ; printf( "%.*s'\n", len, &input[ m[1].rm_so ] ) ; } }
To get at the capture expressions we have to pass an array of regmatch_t’s. The first element of that array is the entire match expression, and then we get the captures after that. The awkward thing to deal with is that the regmatch_t is a structure containing the start end end offset within the string.
If we want more granular info from the c++ matcher, it can also provide an array of capture info. We can also get info about whether or not the match worked, something we can do in perl easily
my @strings = ( "hi bye", "helloworld", "why now", "onetwo" ) ; foreach ( @strings ) { if ( s/(\S+)\s+(\S+)/$2 $1/ ) { print "$_\n" ; } }
This only prints the transformed line if there was a match success. To do this in C++ we can use regex_match
const char * pattern = R"((\S+)\s+(\S+))" ; std::regex re( pattern ) ; for ( auto s : strings ) { std::cmatch m ; if ( regex_match( s, m, re ) ) { std::cout << m[2] << ' ' << m[1] << '\n' ; } }
Note that we don’t have to mess around with offsets as was required with the Posix C interface, and also don’t have to worry about the size of the capture match array, since that is handled under the covers. It’s not too hard to do wrap the posix C APIs in a C++ wrapper that makes it about as easy to use as the C++ regex code, but unless you are constrained to using pre-C++11 code and can also live with a Unix only restriction. There are also portability issues with the posix APIs. For example, the perl-style regular expressions like:
work fine with the Linux regex API, but that appears to be an exception. To make code using that regex work on Mac, I had to use strict posix syntax
Actually using the Posix C interface, with a portability constraint that avoids the Linux regex extensions, would be horrendous.