Month: July 2016

Playing with c++11 and posix regular expression libraries

July 24, 2016 C/C++ development and debugging. , , , , , , , , ,

I was curious how the c++11 std::regex interface compared to the C posix regular expression library. The c++11 interfaces are almost as easy to use as perl. Suppose we have some space separated fields that we wish to manipulate, showing an order switch and the original:

my @strings = ( "hi bye", "hello world", "why now", "one two" ) ;

foreach ( @strings )
{
   s/(\S+)\s+(\S+)/'$&' -> '$2 $1'/ ;

   print "$_\n" ;
}

The C++ equivalent is

   const char * strings[] { "hi bye", "hello world", "why now", "one two" } ;

   std::regex re( R"((\S+)\s+(\S+))" ) ;

   for ( auto s : strings )
   {
      std::cout << regex_replace( s, re, "'$&' -> '$2 $1'\n" )  ;
   }

We have one additional step with the C++ code, compiling the regular expression. Precompilation of perl regular expressions is also possible, but that is usually just as performance optimization.

The posix equivalent requires precompilation too

void posixre_error( regex_t * pRe, int rc )
{
   char buf[ 128 ] ;

   regerror( rc, pRe, buf, sizeof(buf) ) ;

   fprintf( stderr, "regerror: %s\n", buf ) ;
   exit( 1 ) ;
}

void posixre_compile( regex_t * pRe, const char * expression )
{
   int rc = regcomp( pRe, expression, REG_EXTENDED ) ;
   if ( rc )
   { 
      posixre_error( pRe, rc ) ;
   }
}

but the transform requires more work:

void posixre_transform( regex_t * pRe, const char * input )
{
   constexpr size_t N{3} ;
   regmatch_t m[N] {} ;

   int rc = regexec( pRe, input, N, m, 0 ) ;

   if ( rc && (rc != REG_NOMATCH) )
   {
      posixre_error( pRe, rc ) ;
   }

   if ( !rc )
   { 
      printf( "'%s' -> ", input ) ;
      int len ;
      len = m[2].rm_eo - m[2].rm_so ; printf( "'%.*s ", len, &input[ m[2].rm_so ] ) ;
      len = m[1].rm_eo - m[1].rm_so ; printf( "%.*s'\n", len, &input[ m[1].rm_so ] ) ;
   }
}

To get at the capture expressions we have to pass an array of regmatch_t’s. The first element of that array is the entire match expression, and then we get the captures after that. The awkward thing to deal with is that the regmatch_t is a structure containing the start end end offset within the string.

If we want more granular info from the c++ matcher, it can also provide an array of capture info. We can also get info about whether or not the match worked, something we can do in perl easily

my @strings = ( "hi bye", "helloworld", "why now", "onetwo" ) ;

foreach ( @strings )
{
   if ( s/(\S+)\s+(\S+)/$2 $1/ )
   {
      print "$_\n" ;
   }
}  

This only prints the transformed line if there was a match success. To do this in C++ we can use regex_match

const char * pattern = R"((\S+)\s+(\S+))" ;

std::regex re( pattern ) ;

for ( auto s : strings )
{ 
   std::cmatch m ;

   if ( regex_match( s, m, re ) )
   { 
      std::cout << m[2] << ' ' << m[1] << '\n' ;
   }
}

Note that we don’t have to mess around with offsets as was required with the Posix C interface, and also don’t have to worry about the size of the capture match array, since that is handled under the covers. It’s not too hard to do wrap the posix C APIs in a C++ wrapper that makes it about as easy to use as the C++ regex code, but unless you are constrained to using pre-C++11 code and can also live with a Unix only restriction. There are also portability issues with the posix APIs. For example, the perl-style regular expressions like:

   R"((\S+)(\s+)(\S+))" ) ;

work fine with the Linux regex API, but that appears to be an exception. To make code using that regex work on Mac, I had to use strict posix syntax

   R"(([^[:space:]]+)([[:space:]]+)([^[:space:]]+))"

Actually using the Posix C interface, with a portability constraint that avoids the Linux regex extensions, would be horrendous.

Notes on “memory and resources” of Stroustrup’s “The C++ Programming Language”.

July 21, 2016 C/C++ development and debugging. , , , , , , ,

Some chapter 34 notes.

array

There’s a fixed size array type designed to replace raw C style arrays. It doesn’t appear that it is bounds checked by default, and the Xcode7 (clang) compiler doesn’t do bounds checking for it right now. Here’s an example

#include <array>

using a10 = std::array<int, 10> ;

void foo( a10 & a )
{
   a[3] = 7 ;
   a[13] = 7 ;
}

void bar( int * a )
{
   a[3] = 7 ;
   a[13] = 7 ;
}

The generated asm for both of these is identical

$ gobjdump -d --reloc -C --no-show-raw-insn d.o

d.o:     file format mach-o-x86-64

Disassembly of section .text:

0000000000000000 <foo(std::__1::array<int, 10ul>&)>:
   0:   push   %rbp
   1:   mov    %rsp,%rbp
   4:   movl   $0x7,0xc(%rdi)
   b:   movl   $0x7,0x34(%rdi)
  12:   pop    %rbp
  13:   retq   
  14:   data16 data16 nopw %cs:0x0(%rax,%rax,1)

0000000000000020 <bar(int*)>:
  20:   push   %rbp
  21:   mov    %rsp,%rbp
  24:   movl   $0x7,0xc(%rdi)
  2b:   movl   $0x7,0x34(%rdi)
  32:   pop    %rbp
  33:   retq   
  34:   data16 data16 nopw %cs:0x0(%rax,%rax,1)

The foo() function here is also not compile-time bounds checked if the out of bounds access is changed to

   a.at(13) = 7 ;

however, this does at least generate an out of bounds error

$ ./d
libc++abi.dylib: terminating with uncaught exception of type std::out_of_range: array::at
Abort trap: 6

Even though we don’t get compile-time bounds checking (at least with the current clang compiler), array has the nice advantage of knowing its own size, so you can’t screw it up:

void blah( a10 & a )
{
   a[0] = 1 ;

   for ( int i{1} ; i < a.size() ; i++ )
   {
      a[i] = 2 * a[i-1] ;
   }
}

bitset and vector bool

The bitset class provides a fixed size bit array that appears to be formed from an array of register sized words. On a 64-bit platform (mac+xcode 7) I’m seeing that sizeof() == 8 for <= 64 bits, and doubles after that for <= 128 bits.

The code for something like the following (set two bits), is pretty decent, basically a single or immediate instruction:

using b70 = std::bitset<70> ;

void foo( b70 & v )
{
   v[3] = 1 ;
   v[13] = 1 ;
}

Array access operators are provided to access each bit position:

   for ( int i{} ; i < v.size() ; i++ )
   {
      char sep{ ' ' } ;
      if ( ((i+1) % 8) == 0 )
      {
         sep = '\n' ;
      }

      std::cout << v[i] << sep ;
   }
   std::cout << '\n' ;

There is no range-for support built in for this class. I was able to implement a wrapper that allowed that using a wrapper class

template <int N>
struct iter ;

template <int N>
struct mybits : public std::bitset<N>
{
   using T = std::bitset<N> ;

   using T::T ;
   using T::size ;

   inline iter<N> begin( ) ;

   inline iter<N> end( ) ;
} ;

and a helper iterator

template <int N>
struct iter
{
   unsigned pos{} ;
   const mybits<N> & b ;

   iter( const mybits<N> & bits, unsigned p = {} ) : pos{p}, b{bits} {}

   const iter & operator++()
   {
      pos++ ;

      return *this ;
   }

   bool operator != ( const iter & i ) const
   { 
      return pos != i.pos ;
   }

   int operator*() const
   { 
      return b[ pos ] ;
   }
} ;

plus the begin and end function bodies required for the loop

template <int N>
inline iter<N> mybits<N>::begin( )
{
   return iter<N>( *this ) ;
}

template <int N>
inline iter<N> mybits<N>::end( )
{
   return iter<N>( *this, size() ) ;
}

I’m not sure what the rationale for not including such range for support is, when std::vector has exactly that? vector is a vector specialization that is also supposed to be compact, but unlike bitset, allows for a variable sized bit array.

bitset also has a number of handy type conversion operators that vector does not (to string, and string to integer)

tuple

The std::tuple type generalizes std::pair, allowing for easy structures of N different types.

I saw that tuple has a tie method that allows it to behave very much like a perl array assignment. Such an assignment looks like

#!/usr/bin/perl

my ($a, $b, $c) = foo() ;

printf( "%0.1f $b $c\n", $a ) ;

exit 0 ;

sub foo
{
   return (1.0, "blah", 3) ;
}

A similar C++ equivalent is more verbose

#include <tuple>
#include <stdio.h>

using T = std::tuple<float, const char *, int> ;

T foo()
{
   return std::make_tuple( 1.0, "blah", 3 ) ;
}

int main()
{
   float f ;
   const char * k ;
   int i ;

   std::tie( f, k, i ) = foo() ;

   printf("%f %s %d\n", f, k, i ) ;

   return 0 ;
}

I was curious how the code that accepts a tuple return using tie, using different variables (as above), and using a structure return differed

struct S
{
   float f ;
   const char * s ;
   int i ;
} ;

S bar()
{
   return { 1.0, "blah", 3 } ;
}

In each case, using -O2 and the Xcode 7 compiler (clang), a printf function similar to the above ends up looking pretty much uniformly like:

$ gobjdump -d --reloc -C --no-show-raw-insn u.o 
...

0000000000000110 <h()>:
 110:   push   %rbp
 111:   mov    %rsp,%rbp
 114:   sub    $0x20,%rsp
 118:   lea    -0x18(%rbp),%rdi
 11c:   callq  121 <h()+0x11>
                        11d: BRANCH32   foo()
 121:   mov    -0x10(%rbp),%rsi
 125:   mov    -0x8(%rbp),%edx
 128:   movss  -0x18(%rbp),%xmm0
 12d:   cvtss2sd %xmm0,%xmm0
 131:   lea    0xd(%rip),%rdi        # 145 <h()+0x35>
                        134: DISP32     .cstring-0x145
 138:   mov    $0x1,%al
 13a:   callq  13f <h()+0x2f>
                        13b: BRANCH32   printf
 13f:   add    $0x20,%rsp
 143:   pop    %rbp
 144:   retq   

The generated code is pretty much dominated by the stack pushing required for the printf call. I used printf here instead of std::cout because the generated code for std::cout is so crappy looking (and verbose).

shared_ptr

Reading the section on shared_ptr, it wasn’t obvious that it was a thread safe interface. I wondered if some sort of specialization was required to make the reference counting thread safe. It appears that thread safety is built in

This can also be seen in the debugger (assuming the gcc libstdc++ is representitive)

Breakpoint 1, main () at sharedptr.cc:33
33    std::shared_ptr<T> p = std::make_shared<T>() ;
Missing separate debuginfos, use: debuginfo-install libgcc-4.8.5-4.el7.x86_64 libstdc++-4.8.5-4.el7.x86_64
(gdb) n
35    foo( p ) ;
(gdb) s
std::shared_ptr<T>::shared_ptr (this=0x7fffffffe060) at /usr/include/c++/4.8.2/bits/shared_ptr.h:103
103         shared_ptr(const shared_ptr&) noexcept = default;
(gdb) s
std::__shared_ptr<T, (__gnu_cxx::_Lock_policy)2>::__shared_ptr (this=0x7fffffffe060) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:779
779         __shared_ptr(const __shared_ptr&) noexcept = default;
(gdb) s
std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count (this=0x7fffffffe068, __r=...)
    at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:550
550         : _M_pi(__r._M_pi)
(gdb) s
552      if (_M_pi != 0)
(gdb) s
553        _M_pi->_M_add_ref_copy();
(gdb) s
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_add_ref_copy (this=0x607010) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:131
131         { __gnu_cxx::__atomic_add_dispatch(&_M_use_count, 1); }

This was looking at a call of the following form

using Tp = std::shared_ptr<T> ;

void foo( Tp p ) ;

int main()
{
   std::shared_ptr<T> p = std::make_shared<T>() ;

   foo( p ) ;

   return 0 ;
}

some c++11 standard library notes

July 9, 2016 C/C++ development and debugging. , , , , ,

Some notes on Chapter 31, 32 (standard library, STL) of Stroustrup’s “The C++ Programming Language, 4th edition”.

Emplace

I’d never heard the word emplace before, but it turns out that it isn’t a word made up for c++, but is also a dictionary word, meaning to “put into place or position”.

c++11 defines some emplace functions. Here’s an example for vector

#include <vector>
#include <iostream>

int main()
{
   using pair = std::pair<int, int> ;
   using vector = std::vector< pair > ;

   vector v ;

   pair p{ 1, 2 } ;
   v.push_back( p ) ;
   v.push_back( {2, 3} ) ;
   v.emplace_back( 3, 4 ) ;

   for ( auto e : v )
   {
      std::cout << e.first << ", " << e.second << '\n' ;
   }

   return 0 ;
}

The emplace_back is like the push_back function, but does not require that a constructed object be created first, either explicitly as in the object p above, or implictly as done with the {2, 3} pair initializer list.

multimap

I’d written some perl code the other day when I wanted a hash that had multiple entries per key. Since my hashed elememts were simple, I just strung them together as comma separated entries (I could have also used a hash of array references). It looks like c++11 builds exactly the construct that I wanted into STL, and has both a multimap and unordered_multimap. Here’s an example of the latter

#include <unordered_map>
#include <string>
#include <iostream>

int main()
{
   std::unordered_multimap< int, std::string > m ;

   m.emplace( 3, "hi" ) ;
   m.emplace( 3, "bye" ) ;
   m.emplace( 4, "wow" ) ;

   for ( auto & v : m )
   {
      std::cout << v.first << ": " << v.second << '\n' ;
   }
  
   for ( auto f{ m.find(3) } ; f != m.end() ; ++f )
   {
      std::cout << "find: " << f->first << ": " << f->second << '\n' ;
   }
   
   return 0 ;
} 

Running this gives me

$ ./a.out 
4: wow
3: hi
3: bye
find: 3: hi
find: 3: bye

Observe how nice auto is here. I don’t have to care what the typename for the unordered_multimap find result is. According to gdb that type is:

(gdb) whatis f
type = std::__1::__hash_map_iterator<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::__hash_value_type<int, std::__1::basic_string<char> >, void*>*> >

Yikes!

STL

The STL chapter outlines lots of different algorithms. One new powerful feature in c++11 is that the Lambdas can be used instead of predicate function objects, which is so much cleaner. I used that capability in a scientific computing programming assignment earlier this year with partial_sort.

The find_if_not algorthim caught my eye, because I just manually coded exactly that sort of loop translating intel assembly that used ‘REPL SCASB’ instructions, and that code was precisely of this find_if_not form. The c++ equivalent of the assembly was roughly of the following form:

int scan3( const std::string & s, char v )
{
   auto p = s.begin() ;
   for ( ; p != s.end() ; p++ )
   {
      if ( *p != v )
      {
         break ; 
      }
   }

   if ( p == s.end() )
   {
      return 0 ;
   }
   else
   {
      std::cout << "diff: " << p - s.begin() << '\n' ;

      return ( v > *p ) ? 1 : -1 ;
   }

Range for can also be used for this loop, but it is only slightly clearer:

int scan2( const std::string & s, char v )
{
   auto p = s.begin() ;
   for ( auto c : s )
   {
      if ( c != v )
      {
         break ;
      }

      p++ ;
   }

   if ( p == s.end() )
   { 
      return 0 ;
   }
   else
   { 
      std::cout << "diff: " << p - s.begin() << '\n' ;

      return ( v > *p ) ? 1 : -1 ;
   }
}

An STL version of this loop that uses a lambda predicate is

int scan( const std::string & s, char v )
{
   auto i = find_if_not( s.begin(),
                         s.end(),
                         [ v ]( char c ){ return c == v ; }
                       ) ;

   if ( i == s.end() )
   { 
      return 0 ;
   }
   else
   { 
      std::cout << "diff: " << i - s.begin() << '\n' ;

      return ( v > *i ) ? 1 : -1 ;
   }
}

I don’t really think that this is any more clear than explicit for loop versions. All give the same results when tried:

int main()
{
   std::vector< std::function< int( const std::string &, char ) > > v { scan, scan2, scan3 } ;

   for ( auto f : v )
   { 
      int r0 = f( "nnnnn", 'n' ) ;
      int rp = f( "nnnnnmmm", 'n' ) ;
      int rn = f( "nnnnnpnn", 'n' ) ;

      std::cout << r0 << '\n' ;
      std::cout << rp << '\n' ;
      std::cout << rn << '\n' ;
   }

   return 0 ;
}

The compiler does almost the same for all three implementations. With the cout’s removed, and compiling with optimization, the respective instruction counts are:

(gdb) p 0xee3-0xe70
$1 = 115
(gdb) p 0xf4c-0xef0
$2 = 92
(gdb) p 0xfc3-0xf50
$3 = 115

The listings for the STL and the C style for loop are almost the same. The Apple xcode 7 compiler seems to produce slightly more compact code for the range-for version of this function for reasons that are not obvious to me.

c++11 virtual function language changes.

July 1, 2016 C/C++ development and debugging. , , , , , , ,

Chapter 20 of Stroustrup’s book covers a few more new (to me) c++11 features:

  1. override
  2. final
  3. use of using statements for access control.
  4. pointer to member (for data and member functions)

override

The override keyword is really just to make it clear when you are providing a virtual function override.  Because the use of virtual at an override point is redundant, people have used that to explicitly show that the intent is to show the function overrides a base class function. However, if the have the interface erroneously different in the second specification, the use of virtual there means that you are defining a new virtual function.  Here’s a made up example, where the integer type of a virtual function was changed “accidentally” when “overriding” a base class virtual function:

#include <stdio.h>

struct x
{
   virtual void foo( int v ) ;
} ;

struct y : public x
{
   virtual void foo( long v ) ;
} ;

void x::foo( int v ) { printf( "x::foo:%d\n", v ) ; }
void y::foo( long v ) { printf( "y::foo:%ld\n", v ) ; }

Now in c++11 you can be explicit that you intention is to override a base class virtual. Replace the use of the redundant virtual with the override keyword, and the compiler can now tell you if you get things mixed up:

struct x
{
   virtual void foo( int v ) ;
} ;

struct y : public x
{
   void foo( long v ) override ;
} ;

void x::foo( int v )
{
   printf( "x::foo:%d\n", v ) ;
}

void y::foo( long v )
{
   printf( "y::foo:%ld\n", v ) ;
}

This gives a nice compiler message informing you about the error:

$ c++ -std=c++11 -O2 -MMD   -c -o d.o d.cc
d.cc:10:23: error: non-virtual member function marked 'override' hides virtual member function
   void foo( long v ) override ;
                      ^
d.cc:5:17: note: hidden overloaded virtual function 'x::foo' declared here: type mismatch at 1st parameter ('int' vs 'long')
   virtual void foo( int v ) ;
                ^

final

This is a second virtual function modifier designed to cut the performance cost of using virtual functions in some situations. My experimentation with this feature shows the compilers still have more work to do optimizing away the vtable calls. I introduced a square-matrix class that had a single range virtual range checking function:

   void throwRangeError( const indexType i, const indexType j ) const
   { 
      throw rangeError{ i, j, size } ;
   }

   /**
      Introduce a virtual function that allows user selection of optional range error checking.
    */
   virtual void handleRangeError( const indexType i, const indexType j ) const
   { 
      throwRangeError( i, j ) ;
   }

   bool areIndexesOutOfRange( const indexType i, const indexType j ) const
   { 
      if ( (0 == i) or (0 == j) or (i > size) or (j > size) )
      { 
         return true ;
      }

      return false ;
   }

My intent was that a derived class could provide a no-op specialization of handleRangeError:

/**
   Explicitly unchecked matrix element access
 */
class uncheckedMatrix : public matrix
{
public:
   // inherit constructors:
   using matrix::matrix ;

   void handleRangeError( const indexType i, const indexType j ) const final
   {
   }
} ;

This derived class no longer has any virtual functions. Also note that it uses ‘using’ statements to explicitly inherit the base class constructors, which is not a default action (and recommended by Stroustrup only for classes like this that do not add any data members).

The compiler didn’t do too well with this specialization, as calls to the element access operator still took a vtable hit. Here’s some code that when passed a 3×3 matrix object includes out of range accesses:

void outofbounds( const matrix & m, const char * s )
{
   printf( "%s: %g\n", s, m(4,2) ) ;
}

void outofbounds( const checkedMatrix & m, const char * s )
{
   printf( "%s: %g\n", s, m(4,2) ) ;
}

void outofbounds( const uncheckedMatrix & m, const char * s ) noexcept
{
   printf( "%s: %g\n", s, m(4,2) ) ;
}

Here’s the code for the first (base class) matrix class that has virtual functions, but no final overrides:

0000000000000000 <outofbounds(matrix const&, char const*)>:
   0: push   %rbp
   1: mov    %rsp,%rbp
   4: push   %r14
   6: push   %rbx
   7: mov    %rsi,%r14
   a: mov    %rdi,%rbx
   d: mov    0x20(%rbx),%rax
  11: cmp    $0x3,%rax
  15: ja     2d <outofbounds(matrix const&, char const*)+0x2d>
  17: mov    (%rbx),%rax
  1a: mov    $0x4,%esi
  1f: mov    $0x2,%edx
  24: mov    %rbx,%rdi
  27: callq  *(%rax)
  29: mov    0x20(%rbx),%rax
  2d: lea    (%rax,%rax,2),%rax
  31: mov    0x8(%rbx),%rcx
  35: movsd  0x8(%rcx,%rax,8),%xmm0
  3b: lea    0x149(%rip),%rdi        # 18b <__clang_call_terminate+0xb>
         3e: DISP32  .cstring-0x18b
  42: mov    $0x1,%al
  44: mov    %r14,%rsi
  47: pop    %rbx
  48: pop    %r14
  4a: pop    %rbp
  4b: jmpq   50 <outofbounds(checkedMatrix const&, char const*)>
         4c: BRANCH32   printf

The callq instruction is the vtable call. Because this function called through the base class object, and could represent a derived class object, such a call is required. Now look at the code for the uncheckedMatrix class where the handleRangeError() had a no-op final override:

00000000000000a0 <outofbounds(uncheckedMatrix const&, char const*)>:
  a0: push   %rbp
  a1: mov    %rsp,%rbp
  a4: push   %r14
  a6: push   %rbx
  a7: mov    %rsi,%r14
  aa: mov    %rdi,%rbx
  ad: mov    0x20(%rbx),%rax
  b1: cmp    $0x3,%rax
  b5: ja     d0 <outofbounds(uncheckedMatrix const&, char const*)+0x30>
  b7: mov    (%rbx),%rax
  ba: mov    (%rax),%rax
  bd: mov    $0x4,%esi
  c2: mov    $0x2,%edx
  c7: mov    %rbx,%rdi
  ca: callq  *%rax
  cc: mov    0x20(%rbx),%rax
  d0: lea    (%rax,%rax,2),%rax
...

We still have an unnecessary vtable call. This must be a call to handleRangeError(), but that has a final override, and could conceivably be inlined. Some experimentation shows that it is possible to get the desired behaviour (Apple LLVM version 7.3.0 (clang-703.0.31)), but only when the final call is a leaf function. Explicit override of the base class element access operator to omit the check-and-throw logic

/**
   Explicitly unchecked matrix element access
 */
class uncheckedMatrix2 : public matrix
{
public:
   // inherit constructors:
   using matrix::matrix ;

   T operator()( const indexType i, const indexType j ) const
   { 
      return access( i, j ) ;
   }
} ;

has much less horrible code

0000000000000100 <outofbounds(uncheckedMatrix2 const&, char const*)>:
 100: push   %rbp
 101: mov    %rsp,%rbp
 104: mov    0x8(%rdi),%rax
 108: mov    0x20(%rdi),%rcx
 10c: lea    (%rcx,%rcx,2),%rcx
 110: movsd  0x8(%rax,%rcx,8),%xmm0
 116: lea    0x6e(%rip),%rdi        # 18b <__clang_call_terminate+0xb>
         119: DISP32 .cstring-0x18b
 11d: mov    $0x1,%al
 11f: pop    %rbp
 120: jmpq   125 <outofbounds(uncheckedMatrix2 const&, char const*)+0x25>
         121: BRANCH32  printf
 125: data16 nopw %cs:0x0(%rax,%rax,1)

Now we don’t have any of the vtable related epilog and prologue code, nor the indirection required to make such a call. This code isn’t pretty, but isn’t actually that much worse than raw pointer or plain vector access:

void outofbounds( const std::vector<double> m, const char * s ) noexcept
{
   printf( "%s: %g\n", s, m[ 4*3+2-1 ] ) ;
}

void outofbounds( const double * m, const char * s ) noexcept
{
   printf( "%s: %g\n", s, m[ 4*3+2-1 ] ) ;
}

The first generates code like the following:

0000000000000130 <outofbounds(std::__1::vector<double, std::__1::allocator<double> >, char const*)>:
 130: push   %rbp
 131: mov    %rsp,%rbp 
 134: mov    (%rdi),%rax  
 137: movsd  0x68(%rax),%xmm0
 13c: lea    0x48(%rip),%rdi        # 18b <__clang_call_terminate+0xb>
         13f: DISP32 .cstring-0x18b
 143: mov    $0x1,%al
 145: pop    %rbp
 146: jmpq   14b <outofbounds(std::__1::vector<double, std::__1::allocator<double> >, char const*)+0x1b>
         147: BRANCH32  printf
 14b: nopl   0x0(%rax,%rax,1)

Using vector instead of raw array access imposes only a single instruction dereference penalty:

0000000000000150 <outofbounds(double const*, char const*)>:
 150: push   %rbp
 151: mov    %rsp,%rbp
 154: movsd  0x68(%rdi),%xmm0
 159: lea    0x2b(%rip),%rdi        # 18b <__clang_call_terminate+0xb>
         15c: DISP32 .cstring-0x18b
 160: mov    $0x1,%al
 162: pop    %rbp
 163: jmpq   168 <GCC_except_table2>
         164: BRANCH32  printf

With the final override in a leaf function, or a similar explicit hiding of the base class function, we add one additional instruction overhead (one additional load).

pointer to member

This is a somewhat obscure feature. I don’t think that it is new to c++11, but I’ve never seen it used in 20 years. The only thing interesting about it is that the pointer to member objects apparently are entirely offset based, so could be used in shared memory interprocess configurations (where virtual functions cannot!)