I was curious how the c++11 std::regex interface compared to the C posix regular expression library. The c++11 interfaces are almost as easy to use as perl. Suppose we have some space separated fields that we wish to manipulate, showing an order switch and the original:
1 2 3 4 5 6 7 8 | my @strings = ( "hi bye" , "hello world" , "why now" , "one two" ) ; foreach ( @strings ) { s/(\S+)\s+(\S+)/ '$&' -> '$2 $1' / ; print "$_\n" ; } |
The C++ equivalent is
1 2 3 4 5 6 7 8 | const char * strings[] { "hi bye" , "hello world" , "why now" , "one two" } ; std::regex re( R"((\S+)\s+(\S+))" ) ; for ( auto s : strings ) { std::cout << regex_replace( s, re, "'$&' -> '$2 $1'\n" ) ; } |
We have one additional step with the C++ code, compiling the regular expression. Precompilation of perl regular expressions is also possible, but that is usually just as performance optimization.
The posix equivalent requires precompilation too
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | void posixre_error( regex_t * pRe, int rc ) { char buf[ 128 ] ; regerror( rc, pRe, buf, sizeof (buf) ) ; fprintf ( stderr, "regerror: %s\n" , buf ) ; exit ( 1 ) ; } void posixre_compile( regex_t * pRe, const char * expression ) { int rc = regcomp( pRe, expression, REG_EXTENDED ) ; if ( rc ) { posixre_error( pRe, rc ) ; } } |
but the transform requires more work:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | void posixre_transform( regex_t * pRe, const char * input ) { constexpr size_t N{3} ; regmatch_t m[N] {} ; int rc = regexec( pRe, input, N, m, 0 ) ; if ( rc && (rc != REG_NOMATCH) ) { posixre_error( pRe, rc ) ; } if ( !rc ) { printf ( "'%s' -> " , input ) ; int len ; len = m[2].rm_eo - m[2].rm_so ; printf ( "'%.*s " , len, &input[ m[2].rm_so ] ) ; len = m[1].rm_eo - m[1].rm_so ; printf ( "%.*s'\n" , len, &input[ m[1].rm_so ] ) ; } } |
To get at the capture expressions we have to pass an array of regmatch_t’s. The first element of that array is the entire match expression, and then we get the captures after that. The awkward thing to deal with is that the regmatch_t is a structure containing the start end end offset within the string.
If we want more granular info from the c++ matcher, it can also provide an array of capture info. We can also get info about whether or not the match worked, something we can do in perl easily
1 2 3 4 5 6 7 8 9 | my @strings = ( "hi bye" , "helloworld" , "why now" , "onetwo" ) ; foreach ( @strings ) { if ( s/(\S+)\s+(\S+)/$2 $1/ ) { print "$_\n" ; } } |
This only prints the transformed line if there was a match success. To do this in C++ we can use regex_match
1 2 3 4 5 6 7 8 9 10 11 12 13 | const char * pattern = R"((\S+)\s+(\S+))" ; std::regex re( pattern ) ; for ( auto s : strings ) { std::cmatch m ; if ( regex_match( s, m, re ) ) { std::cout << m[2] << ' ' << m[1] << '\n' ; } } |
Note that we don’t have to mess around with offsets as was required with the Posix C interface, and also don’t have to worry about the size of the capture match array, since that is handled under the covers. It’s not too hard to do wrap the posix C APIs in a C++ wrapper that makes it about as easy to use as the C++ regex code, but unless you are constrained to using pre-C++11 code and can also live with a Unix only restriction. There are also portability issues with the posix APIs. For example, the perl-style regular expressions like:
work fine with the Linux regex API, but that appears to be an exception. To make code using that regex work on Mac, I had to use strict posix syntax
Actually using the Posix C interface, with a portability constraint that avoids the Linux regex extensions, would be horrendous.