Peeter Joot's Blog » 2016 » July

Playing with c++11 and posix regular expression libraries

I was curious how the c++11 std::regex interface compared to the C posix regular expression library. The c++11 interfaces are almost as easy to use as perl. Suppose we have some space separated fields that we wish to manipulate, showing an order switch and the original:

my @strings = ( "hi bye", "hello world", "why now", "one two" ) ;

foreach ( @strings )
{
   s/(\S+)\s+(\S+)/'$&' -> '$2 $1'/ ;

   print "$_\n" ;
}

The C++ equivalent is

   const char * strings[] { "hi bye", "hello world", "why now", "one two" } ;

   std::regex re( R"((\S+)\s+(\S+))" ) ;

   for ( auto s : strings )
   {
      std::cout << regex_replace( s, re, "'$&' -> '$2 $1'\n" )  ;
   }

We have one additional step with the C++ code, compiling the regular expression. Precompilation of perl regular expressions is also possible, but that is usually just as performance optimization.

The posix equivalent requires precompilation too

void posixre_error( regex_t * pRe, int rc )
{
   char buf[ 128 ] ;

   regerror( rc, pRe, buf, sizeof(buf) ) ;

   fprintf( stderr, "regerror: %s\n", buf ) ;
   exit( 1 ) ;
}

void posixre_compile( regex_t * pRe, const char * expression )
{
   int rc = regcomp( pRe, expression, REG_EXTENDED ) ;
   if ( rc )
   { 
      posixre_error( pRe, rc ) ;
   }
}

but the transform requires more work:

void posixre_transform( regex_t * pRe, const char * input )
{
   constexpr size_t N{3} ;
   regmatch_t m[N] {} ;

   int rc = regexec( pRe, input, N, m, 0 ) ;

   if ( rc && (rc != REG_NOMATCH) )
   {
      posixre_error( pRe, rc ) ;
   }

   if ( !rc )
   { 
      printf( "'%s' -> ", input ) ;
      int len ;
      len = m[2].rm_eo - m[2].rm_so ; printf( "'%.*s ", len, &input[ m[2].rm_so ] ) ;
      len = m[1].rm_eo - m[1].rm_so ; printf( "%.*s'\n", len, &input[ m[1].rm_so ] ) ;
   }
}

To get at the capture expressions we have to pass an array of regmatch_t’s. The first element of that array is the entire match expression, and then we get the captures after that. The awkward thing to deal with is that the regmatch_t is a structure containing the start end end offset within the string.

If we want more granular info from the c++ matcher, it can also provide an array of capture info. We can also get info about whether or not the match worked, something we can do in perl easily

my @strings = ( "hi bye", "helloworld", "why now", "onetwo" ) ;

foreach ( @strings )
{
   if ( s/(\S+)\s+(\S+)/$2 $1/ )
   {
      print "$_\n" ;
   }
}

This only prints the transformed line if there was a match success. To do this in C++ we can use regex_match

const char * pattern = R"((\S+)\s+(\S+))" ;

std::regex re( pattern ) ;

for ( auto s : strings )
{ 
   std::cmatch m ;

   if ( regex_match( s, m, re ) )
   { 
      std::cout << m[2] << ' ' << m[1] << '\n' ;
   }
}

Note that we don’t have to mess around with offsets as was required with the Posix C interface, and also don’t have to worry about the size of the capture match array, since that is handled under the covers. It’s not too hard to do wrap the posix C APIs in a C++ wrapper that makes it about as easy to use as the C++ regex code, but unless you are constrained to using pre-C++11 code and can also live with a Unix only restriction. There are also portability issues with the posix APIs. For example, the perl-style regular expressions like:

   R"((\S+)(\s+)(\S+))" ) ;

work fine with the Linux regex API, but that appears to be an exception. To make code using that regex work on Mac, I had to use strict posix syntax

   R"(([^[:space:]]+)([[:space:]]+)([^[:space:]]+))"

Actually using the Posix C interface, with a portability constraint that avoids the Linux regex extensions, would be horrendous.

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Day: July 24, 2016

Playing with c++11 and posix regular expression libraries

Like this:

Pages

Categories

Recent Posts

Archives

Meta

Categories

Recent Posts

Archives

Day: July 24, 2016

Playing with c++11 and posix regular expression libraries

Share this:

Like this:

Pages

Categories

Recent Posts

Tags

Archives

Meta

Categories

Recent Posts

Tags

Archives