## Playing with c++11 and posix regular expression libraries

I was curious how the c++11 std::regex interface compared to the C posix regular expression library. The c++11 interfaces are almost as easy to use as perl. Suppose we have some space separated fields that we wish to manipulate, showing an order switch and the original:

my @strings = ( "hi bye", "hello world", "why now", "one two" ) ;

foreach ( @strings )
{
s/(\S+)\s+(\S+)/'$&' -> '$2 $1'/ ; print "$_\n" ;
}


The C++ equivalent is

   const char * strings[] { "hi bye", "hello world", "why now", "one two" } ;

std::regex re( R"((\S+)\s+(\S+))" ) ;

for ( auto s : strings )
{
std::cout << regex_replace( s, re, "'$&' -> '$2 $1'\n" ) ; }  We have one additional step with the C++ code, compiling the regular expression. Precompilation of perl regular expressions is also possible, but that is usually just as performance optimization. The posix equivalent requires precompilation too void posixre_error( regex_t * pRe, int rc ) { char buf[ 128 ] ; regerror( rc, pRe, buf, sizeof(buf) ) ; fprintf( stderr, "regerror: %s\n", buf ) ; exit( 1 ) ; } void posixre_compile( regex_t * pRe, const char * expression ) { int rc = regcomp( pRe, expression, REG_EXTENDED ) ; if ( rc ) { posixre_error( pRe, rc ) ; } }  but the transform requires more work: void posixre_transform( regex_t * pRe, const char * input ) { constexpr size_t N{3} ; regmatch_t m[N] {} ; int rc = regexec( pRe, input, N, m, 0 ) ; if ( rc && (rc != REG_NOMATCH) ) { posixre_error( pRe, rc ) ; } if ( !rc ) { printf( "'%s' -> ", input ) ; int len ; len = m[2].rm_eo - m[2].rm_so ; printf( "'%.*s ", len, &input[ m[2].rm_so ] ) ; len = m[1].rm_eo - m[1].rm_so ; printf( "%.*s'\n", len, &input[ m[1].rm_so ] ) ; } }  To get at the capture expressions we have to pass an array of regmatch_t’s. The first element of that array is the entire match expression, and then we get the captures after that. The awkward thing to deal with is that the regmatch_t is a structure containing the start end end offset within the string. If we want more granular info from the c++ matcher, it can also provide an array of capture info. We can also get info about whether or not the match worked, something we can do in perl easily my @strings = ( "hi bye", "helloworld", "why now", "onetwo" ) ; foreach ( @strings ) { if ( s/(\S+)\s+(\S+)/$2 $1/ ) { print "$_\n" ;
}
}


This only prints the transformed line if there was a match success. To do this in C++ we can use regex_match

const char * pattern = R"((\S+)\s+(\S+))" ;

std::regex re( pattern ) ;

for ( auto s : strings )
{
std::cmatch m ;

if ( regex_match( s, m, re ) )
{
std::cout << m[2] << ' ' << m[1] << '\n' ;
}
}


Note that we don’t have to mess around with offsets as was required with the Posix C interface, and also don’t have to worry about the size of the capture match array, since that is handled under the covers. It’s not too hard to do wrap the posix C APIs in a C++ wrapper that makes it about as easy to use as the C++ regex code, but unless you are constrained to using pre-C++11 code and can also live with a Unix only restriction. There are also portability issues with the posix APIs. For example, the perl-style regular expressions like:

   R"((\S+)(\s+)(\S+))" ) ;


work fine with the Linux regex API, but that appears to be an exception. To make code using that regex work on Mac, I had to use strict posix syntax

   R"(([^[:space:]]+)([[:space:]]+)([^[:space:]]+))"


Actually using the Posix C interface, with a portability constraint that avoids the Linux regex extensions, would be horrendous.

## More C++11 notes from reading Stroustrup: nothrow, try, inline & unnamed namespace, initialized new

Here’s more notes from reading Stroustrup’s “The C++ Programming Language, 4th edition”

## throw() as noexcept equivalent

throw() without any exception types can be used as an equivalent to the new noexcept keyword. Stroustrup also mentions that explicit throw() clauses

void foo() throw( e1, e2 ) ;


haven’t worked out well in practise, and is deprecated.

## try scopes as function body

It turns out that try clauses can be used as function bodies, as in

void foo( void )
try {
}
catch ( ... )
{
}


This can also be done for constructor and destructor bodies as in

X::X( T1 v, T2 w )
try{
: f1( v )
, f2( w )
}
catch ( ... )
{
}


so that a throw in the class field member construction can also be caught.

## Inline (default) namespace

There is a mechanism for namespace versioning. Suppose that you want a new V2 namespace to be the default, you can do:

namespace myproject
{
inline namespace V2
{
struct X {
int x ;
int y ;
} ;
void foo( const X & ) ;
}

namespace V1
{
struct X {
int x ;
} ;

void foo( const X & ) ;
}
}


Existing callers of the library that are using V1 interfaces can continue to work unmodified, but new callers will use the V2::X and V2::foo interfaces, and the library can provide both interfaces, one for compatibility and another for new code:

void myproject::V2::foo( const myproject::V2::X & )
{
// ...
}

void myproject::V1::foo( const myproject::V1::X & )
{
// ...
}


## Unnamed namespaces.

I’d once seen unnamed namespaces as a modern C++ (more general) replacement for static functions. To see if such namespace functions are optimized away in the same fashion as a static function, I tried

#include <stdio.h>

namespace
{
void foo()
{
printf( "ns:foo\n" ) ;
}
}

int main()
{
foo() ;

return 0 ;
}


This example uses printf and not std::cout because I wanted to look at the assembly listing and cout’s listing, at least on a mac, was completely abysmal. foo() was optimized away, but that’s a lot easier to see in the C printf listing:

$make c++ -o n -std=c++11 -O2 n.cc$ otool -tV n | less
n:
(__TEXT,__text) section
_main:
0000000100000f70        pushq   %rbp
0000000100000f71        movq    %rsp, %rbp
0000000100000f74        leaq    0x2b(%rip), %rdi        ## literal pool for: "ns:foo"
0000000100000f7b        callq   0x100000f84             ## symbol stub for: _puts
0000000100000f80        xorl    %eax, %eax
0000000100000f82        popq    %rbp
0000000100000f83        retq


## at_quick_exit

There’s now also a mechanism to exit and avoid global destructors and atexit routines from being evaluated. Here’s an example

#include <cstdlib>
#include <iostream>

extern "C"
void normalexit()
{
std::cout << "normalexit\n" ;
}

extern "C"
void quickCexit()
{
std::cout << "quickCexit\n" ;
}

void quickCPPexit()
{
std::cout << "quickCPPexit\n" ;
}

class X
{
public:
~X()
{
std::cout << "X::~X()\n" ;
}
} x ;

int main( int argc, char ** argv )
{
atexit( normalexit ) ;
std::at_quick_exit( quickCexit ) ;
std::at_quick_exit( quickCPPexit ) ;

if ( argc == 1 )
{
std::quick_exit( 3 ) ;
}

when run without arguments (argc == 1), we get

$./at quickCPPexit quickCexit  whereas if the normal exit processing is allowed to complete we see global destructors and regular atexit calls $ ./at 1
normalexit
X::~X()


Observe, unlike atexit, which can only (portably) take extern “C” defined functions, at_quick_exit can take functions with both C and C++ linkage.

## Enum default

It was not obvious to me what the default value for an enum class (or enum) should be (the first value, an invalid value, zero, …)? It turns out that the default is zero, as printed by the following fragment

#include <iostream>

enum class x { v = 1, w } ;
enum y { vv = 1, ww } ;

int main()
{
x e1 = {} ;
y e2 = {} ;
std::cout << (int)e1 << '\n' ;
std::cout << e2 << '\n' ;

return 0 ;
}


Note that an explicit cast is required for enum class values, but not for enum, which are by default, int convertible.

## default initialization with new

The uniform initializer syntax can also be used with new calls. Here’s an example with uninitialized and default initialized double allocations

#include <stdio.h>

int main()
{
double * d1 = new double ;
double * d2 = new double{} ;

printf( "%g %g\n", *d1, *d2 ) ;

return 0 ;
}


Observe that we get nice garbage values for *d1, but *d2 is always 0.0:

$./d -1.49167e-154 0$ ./d
0 0
$./d 1.72723e-77 0$ ./d
-2.68156e+154 0


## initializer_list

I remember really wanting a feature like this eons ago when I first wrote a matrix template class in 1st year. Here’s a sample of how it could be used

#include <iostream>
#include <vector>
#include <string>

template <unsigned r, unsigned c>
class m
{
std::vector<double> mat ;

public:
class bad_init {} ;

m() : mat(r*c) {}

m( std::initializer_list<double> i ) : mat( r * c )
{
if ( i.size() > ( r * c ) )
{
}

int p{} ;
for ( auto v : i )
{
mat[ p++ ] = v ;
}
}

void dump( const std::string & n ) const
{
const char * sep = ": " ;
std::cout << n ;

for ( auto v : mat )
{
std::cout << sep << v ;
sep = ", " ;
}

std::cout << '\n' ;
}
} ;

int main()
{
m< 3, 2 > v1 ;
m< 3, 2 > v2{ 0., 1., 2., 3., 4. } ;

v1.dump( "v1" ) ;
v2.dump( "v2" ) ;

m< 3, 2 > v3{ 0., 1., 2., 3., 4., 5., 6., 7. } ;

return 0 ;
}


This produces the two dumps and the expected std::terminate call for the wrong (too many) parameters on the third construction attempt

$./i v1: 0, 0, 0, 0, 0, 0 v2: 0, 1, 2, 3, 4, 0 libc++abi.dylib: terminating with uncaught exception of type m<3u, 2u>::bad_init Abort trap: 6  ## Notes for Stroustrup’s “The C++ Programming Language, 4th Ed.”: nothrow new, noexcept, noreturn, static cons, initializer_list I recently purchased Stroustrup’s C++11 book [1], after borrowing it a number of times from the Markham public library (it’s very popular, and only offered for short term loan) . Here are some notes of some bits and pieces that were new to me for this round of reading. ## nothrow new In DB2 we used to have to compile with -fcheck-new or similar, because we had lots of code that predated new throwing on error (c++98). There is a form of new that explicitly doesn’t throw: void * operator new( size_t sz, const nothrow_t &) noexcept ;  I don’t know if this was introduced in c++11. If this was a c++98 addition, then it should be used in almost all the codebases new calls. When I left DB2 there were still some platform compilers (i.e. AIX xlC which doesn’t use the clang front end like linuxppcle64 xlC) that were not c++11 capable, so if this explicit nothrow isn’t c++98, it probably can’t be used. ## Unnamed function parameters It is common to see function prototypes without named parameters, such as void foo( int, int ) ;  I did not realize that is also possible in the function definition, as in code like the following where a parameter has been dropped or left as a placeholder for future use void foo( int x, int ) { printf( "%d\n", x ) ; }  Not naming the parameter is probably a good way to get rid of unused parameter warnings. This is very likely not a c++11 addition. I just didn’t realize the language allowed for it, and had never seen it done. ## No return attribute Looks like __attribute__ extensions are being baked right into the language, as in [[noreturn]] void exit( int ) ;  I wonder if this is also in the plan for C? ## Thread safe static constructors C++11 explicitly requires static variable constructors are initialized using a “call-once” mechanism class x { public: x() ; } ; void foo( void ) { static x v() ; }  Here there is no data race if foo() is executed concurrently in a number of threads. I remember seeing DB2 code that did this (and opening a defect to have it “fixed”), since I had no idea if it would work. We didn’t (and couldn’t yet) use -std=c++11, so it’s anybody’s guess what that does without that option and on older pre c++11 compilers. ## Implied type initializer lists. In a previous post I mentioned the c++11 uniform initialization syntax, but the basic idea is that is instead of int x(1) ; int y(0) ;  or int x = 1 ; int y = 0 ;  c++11 now allows int x{1} ; int y{} ;  Here the variables are initialized with values 1, and 0 (the default). The motivation for this was to provide an initializer syntax that could be used with container classes. Here’s another variation on the initializer list initialization int x = int{} ; int y = int{3} ;  which can be reduced to int x = {} ; int y = {3} ;  where the types of the lists are implied. I don’t see much value add to use this equals-list syntax in the examples above. Where this might be useful is in templated code to provide defaults template <typename T> void foo( T x, T v = {} ) ;  ## Runtime values for default arguments. I don’t know if this is new to C++11, but the book points out that default arguments can be runtime determined values. Initially, my thought on this was that it is good that is not well known, since it would be confusing. I did however, come up with a scenerio where this could be useful. I wrote some code like the following the other day extern bool g ; inline int foo( ) { int res = 0 ; if ( g ) { // first option } else { // second option } return res ; }  The global g was precomputed at the shared library startup point (effectively const without being marked so). My unit test of this code modified the value of g, which was a hack and I admit ugly. It looked like BOOST_AUTO_TEST_CASE( basicTest ) { for ( auto b : {false, true} ) { g = b ; int res = foo() ; BOOST_REQUIRE( res >= 0 ) ; } }  This has a side effect of potentially changing the global. A different way to do this would have been extern bool g ; inline int foo( bool internalOverrideOfGlobalForTesting = g ) { int res = 0 ; if ( internalOverrideOfGlobalForTesting ) { // first option } else { // second option } return res ; }  The test code could then be rewritten as BOOST_AUTO_TEST_CASE( basicTest ) { for ( auto b : {false, true} ) { int res = foo( b ) ; BOOST_REQUIRE( res >= 0 ) ; } }  This doesn’t touch the global (an internal value), but still would have allowed for testing both codepaths. The fact that this “feature” exists may not actually be in this case, since my interface was a C interface. Does a ## noexcept Functions that intend to provide a C interface can use the noexcept keyword. That allows the compiler to enforce the fact that such functions should provide a firewall that doesn’t let any exceptions through. Example: // foo.h #if defined __cplusplus #define EXTERNC extern "C" #define NOEXCEPT noexcept #else #define EXTERNC #define NOEXCEPT #endif EXTERNC void foo(void) NOEXCEPT ; // foo.cc #include "foo.h" int foo( void ) NOEXCEPT { int rc = 0 ; try { // } catch ( ... ) { // handle error rc = 1 ; } return rc ; }  If foo does not catch all exceptions, then the use of noexcept will drive std::terminate(), like a throw from a destructor does on some platforms. # References [1] Bjarne Stroustrup. The C++ Programming Language, 4th Edition. Addison-Wesley, 2014. ## extern vs const in C++ and C code. We now build DB2 on linux ppcle with the IBM xlC 13.1.2 compiler. This version of the compiler is a hybrid compared to any previous compilers, retaining the IBM xlC backend for power, but using the clang front end. Because of this we are exposed to a large number of warnings that we don’t see with many other compilers (well we probably do for our MacOSX port, but we do not really have active development on that platform at the moment), and I’ve been trying to take down those counts to manageable levels. Header files that produce warnings have been my first target since they introduce the most repeated noise. One message that I was seeing hundreds of was warning: 'extern' variable has an initializer [-Wextern-initializer]  This seemed to be coming from headers that did something like: #if defined FOO_INITIALIZE_IT_IN_SOME_SOURCE_FILE extern const TYPE foo[] = { ... } ; #else extern const TYPE foo[] ; #endif  where FOO_INITIALIZE_IT_IN_SOME_SOURCE_FILE is defined at the top of a source file that explicitly includes this header. My attempt to handle the messages was to remove the ‘extern’ from the initialization case, but I was suprised to see link errors as a result of some of those changes. It turns out that there are some subtle differences between different variations of const and extern with an array declaration of this sort. Here’s a bit of sample code: // t.h extern const int x[] ; extern int y[] ; extern int z[] ; // t.cc #if defined WANT_LINK_ERROR const int x[] = { 42 } ; #else extern const int x[] = { 42 } ; #endif extern int y[] = { 42 } ; int z[] = { 42 } ;  When WANT_LINK_ERROR isn’t defined, this produces just one clang warning message t.cc:8:12: warning: 'extern' variable has an initializer [-Wextern-initializer] extern int y[] = { 42 } ; ^  Note that the ‘extern const’ has no such warning, nor does the non-const symbol that’s been declared ‘extern’ in the header. However, removing the extern from the const case (via -DWANT_LINK_ERROR) results in no symbol ‘x’ available to other consumers. The extern is required for const symbols, but generates a warning for non-const symbols. It appears that this is also C++ specific. A const symbol in C compiled code is available for external use, regardless of whether extern is used: $ clang -c t.c
t.c:5:18: warning: 'extern' variable has an initializer [-Wextern-initializer]
extern const int x[] = { 42 } ;
^
t.c:8:12: warning: 'extern' variable has an initializer [-Wextern-initializer]
extern int y[] = { 42 } ;
^
2 warnings generated.

$nm t.o 0000000000000000 R x 0000000000000000 D y 0000000000000004 D z$ clang -c -DWANT_LINK_ERROR t.c
t.c:8:12: warning: 'extern' variable has an initializer [-Wextern-initializer]
extern int y[] = { 42 } ;
^
1 warning generated.
$nm t.o 0000000000000000 R x 0000000000000000 D y 0000000000000004 D z  whereas that same symbol requires extern if it is const in C++: $ clang++ -c t.cc
t.cc:8:12: warning: 'extern' variable has an initializer [-Wextern-initializer]
extern int y[] = { 42 } ;
^
1 warning generated.
$nm t.o 0000000000000000 R x 0000000000000000 D y 0000000000000004 D z$ clang++ -c -DWANT_LINK_ERROR t.cc
t.cc:8:12: warning: 'extern' variable has an initializer [-Wextern-initializer]
extern int y[] = { 42 } ;
^
1 warning generated.
\$ nm t.o
0000000000000000 D y
0000000000000004 D z



I hadn’t expected the const to interact this way with extern. I am guessing that C++ allows for the compiler to not generate symbols for global scope const variables, unless you ask for that by using extern, whereas with C you get the symbol like-it-or-not. This particular message from the clang front end is only for non-const extern initializations, making across the board fixing of messages for extern initialization of the sort above trickier. This makes it so that you can’t do an across the board replacement of extern in initializers for a given file without first ensuring that the symbol isn’t const. It looks like dealing with this will have to be done much more carefully than I first tried.

## Thoughts on the “new” C++ style cast operators.

I happen to maintain the DB2 coding standards, which are mostly concerned with portability, and not style.  I’ve joked that I was given that job since I had “broken the build” more than anybody else, so was most qualified to let others know how not to do so.

In our coding standards we have a prohibition against the use of exceptions.  This is a historical restriction because we’ve built with compilation flags like -qnoeh (no exception handling) on some platforms to get a bit of additional performance.  These days the compilers do much better at not degrading performance when exception handling is allowed and not used, but since our performance folks will sell their kids for a 1% improvement, we’ve kept using flags like this and the associated restriction.  Components that must (or want) to use exception handling must “firewall” any exceptions, not letting them get thrown to external code (and also explicitly enable exceptions for their code).

We had a note in the coding standards not to use RTTI (Run Time Type Identification), because exceptions are required.  That was a confusing and incomplete statement to include in our standards.  It was interpreted by one developer as meaning that none of:

dynamic_cast<>()
reinterpret_cast<>()
static_cast<>()
const_cast<>()


were allowed. However, only the dynamic_cast is a RTTI operation, and only the dynamic_cast will throw an exception when the cast doesn’t match the underlying type.

I’ve now fixed up our coding standards. It now references dynamic_cast instead of RTTI.

The DB2 code is still very C’ish, compiled with a C++ compiler. I’d say the bulk of the casts in our code are old style C casts, and that most of our developers (including myself) don’t even know when to use the “new” cast operations. Here’s some thoughts on these:

• I’ve seen a fair amount of const_cast use in our code, and I know personally how this can be very useful.  An example:

volatile int x ;
void foo( int * y ) ;

foo( const_cast<short *>(&x) ) ; // compilation error
foo( const_cast<int *>(&x) ) ; // allowed.  Just strips the volatile (or const) off the pointer type.

• A second nice use of const_cast<>() is to enforce type checking in macros.  If the macro parameter is supposed to be a pointer to type T, you can enforce that by using a const_cast<T*>.  This assumes that you don’t actually care about the const-ness of the pointer, and will force a compile error if the macro is used with any other type.
• In general I don’t think it’s a bad idea to use the new cast operators, since they represent a hierarchy of weaker than C style casts.  You can also search for cast operations by name in a given file if they are used, which is much harder to do with a C style cast.
• I’m not sure how much use of static_cast<> and reinterpret_cast<> we have in the code.

Declaring my own stupidity, without looking them up, I didn’t personally know how to use the “new style” static_cast and reinterpret_cast operations correctly. I use a C cast by habit unless I’m trying to strip const or volatile attributes (or enforce a type in a macro).

If DB2 coders start using these, it will likely confuse old guys like me for a while, but I figure I can learn about these.

As a step in this direction, I see some helpful looking info in the C++ reference page on explicit_cast

My take on reference page is roughly: If there is a place to use a C cast, then you can use:

1. const_cast if it compiles.  If not you can use:
2. static_cast.  If that doesn’t compile you can use:
3. reinterpret_cast.  If that doesn’t compile you can use:
4. (In code where exceptions are allowed:) dynamic_cast.
5. If that doesn’t work or is not allowed you can use:
A C style cast.