perl

Playing with c++11 and posix regular expression libraries

July 24, 2016 C/C++ development and debugging. , , , , , , , , ,

I was curious how the c++11 std::regex interface compared to the C posix regular expression library. The c++11 interfaces are almost as easy to use as perl. Suppose we have some space separated fields that we wish to manipulate, showing an order switch and the original:

my @strings = ( "hi bye", "hello world", "why now", "one two" ) ;

foreach ( @strings )
{
   s/(\S+)\s+(\S+)/'$&' -> '$2 $1'/ ;

   print "$_\n" ;
}

The C++ equivalent is

   const char * strings[] { "hi bye", "hello world", "why now", "one two" } ;

   std::regex re( R"((\S+)\s+(\S+))" ) ;

   for ( auto s : strings )
   {
      std::cout << regex_replace( s, re, "'$&' -> '$2 $1'\n" )  ;
   }

We have one additional step with the C++ code, compiling the regular expression. Precompilation of perl regular expressions is also possible, but that is usually just as performance optimization.

The posix equivalent requires precompilation too

void posixre_error( regex_t * pRe, int rc )
{
   char buf[ 128 ] ;

   regerror( rc, pRe, buf, sizeof(buf) ) ;

   fprintf( stderr, "regerror: %s\n", buf ) ;
   exit( 1 ) ;
}

void posixre_compile( regex_t * pRe, const char * expression )
{
   int rc = regcomp( pRe, expression, REG_EXTENDED ) ;
   if ( rc )
   { 
      posixre_error( pRe, rc ) ;
   }
}

but the transform requires more work:

void posixre_transform( regex_t * pRe, const char * input )
{
   constexpr size_t N{3} ;
   regmatch_t m[N] {} ;

   int rc = regexec( pRe, input, N, m, 0 ) ;

   if ( rc && (rc != REG_NOMATCH) )
   {
      posixre_error( pRe, rc ) ;
   }

   if ( !rc )
   { 
      printf( "'%s' -> ", input ) ;
      int len ;
      len = m[2].rm_eo - m[2].rm_so ; printf( "'%.*s ", len, &input[ m[2].rm_so ] ) ;
      len = m[1].rm_eo - m[1].rm_so ; printf( "%.*s'\n", len, &input[ m[1].rm_so ] ) ;
   }
}

To get at the capture expressions we have to pass an array of regmatch_t’s. The first element of that array is the entire match expression, and then we get the captures after that. The awkward thing to deal with is that the regmatch_t is a structure containing the start end end offset within the string.

If we want more granular info from the c++ matcher, it can also provide an array of capture info. We can also get info about whether or not the match worked, something we can do in perl easily

my @strings = ( "hi bye", "helloworld", "why now", "onetwo" ) ;

foreach ( @strings )
{
   if ( s/(\S+)\s+(\S+)/$2 $1/ )
   {
      print "$_\n" ;
   }
}  

This only prints the transformed line if there was a match success. To do this in C++ we can use regex_match

const char * pattern = R"((\S+)\s+(\S+))" ;

std::regex re( pattern ) ;

for ( auto s : strings )
{ 
   std::cmatch m ;

   if ( regex_match( s, m, re ) )
   { 
      std::cout << m[2] << ' ' << m[1] << '\n' ;
   }
}

Note that we don’t have to mess around with offsets as was required with the Posix C interface, and also don’t have to worry about the size of the capture match array, since that is handled under the covers. It’s not too hard to do wrap the posix C APIs in a C++ wrapper that makes it about as easy to use as the C++ regex code, but unless you are constrained to using pre-C++11 code and can also live with a Unix only restriction. There are also portability issues with the posix APIs. For example, the perl-style regular expressions like:

   R"((\S+)(\s+)(\S+))" ) ;

work fine with the Linux regex API, but that appears to be an exception. To make code using that regex work on Mac, I had to use strict posix syntax

   R"(([^[:space:]]+)([[:space:]]+)([^[:space:]]+))"

Actually using the Posix C interface, with a portability constraint that avoids the Linux regex extensions, would be horrendous.

Why does touch include a utimensat() syscall?

February 25, 2015 perl and general scripting hackery , , , , , , , , ,

I’m seeing odd time sequencing of files when using clearcase version 8 dynamic views, which makes me wonder about an aspect of the (gnu) touch command. Running:

> cat u
rm -f touchedEarlier touchedLater

perl -e "open (F, '> touchedEarlier') || die"
touch touchedLater

ls --full-time touchedEarlier touchedLater

produces:

> ./u
-rw-r--r-- 1 peeterj pdxdb2 0 2015-02-25 11:42:05.833044000 -0500 touchedEarlier
-rw-r--r-- 1 peeterj pdxdb2 0 2015-02-25 11:42:05.000000000 -0500 touchedLater

Notice that the file that is touched by doing a perl “open” ends up with a later time, despite the fact that it was done logically earlier than the touch.

Running this command outside of a clearcase dynamic view shows zeros only in the subsecond times (also the behaviour of clearcase V7). Needless to say, this difference in file times from their creation sequence wreaks havoc on make.

I was curious how the two touch methods differed, and stracing them shows that the touch differs by including a utimesat() syscall. The perl touch is:

open(“touchedEarlier”, O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff29cd29f0) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR) = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=0, …}) = 0
fcntl(3, F_SETFD, FD_CLOEXEC) = 0
close(3) = 0
exit_group(0) = ?

whereas the touch command has:

open(“touchedLater”, O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = 3
dup2(3, 0) = 0
close(3) = 0
dup2(0, 0) = 0
utimensat(0, NULL, NULL, 0) = 0
close(0) = 0
close(1) = 0
close(2) = 0
exit_group(0) = ?

It appears that the touch command explicitly zeros the subsecond portion of the files’ timestamp.

I also see that perl’s File::Touch module does the same thing, but uses a different mechanism. I see the following in a strace of such a Touch() call:

stat(“xxyyzz”, 0x656060)                = -1 ENOENT (No such file or directory)
open(“xxyyzz”, O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = 3
ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff16bb3c60) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR)                   = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=0, …}) = 0
fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
close(3)                                = 0
utimes(“xxyyzz”, {{1424884553, 0}, {1424884553, 0}}) = 0

I am very curious why touch and perl’s File::Touch() both explicitly zero the subsecond modification time for the file (using utimensat() or utimes() syscalls)?