shared memory

c++11 virtual function language changes.

July 1, 2016 C/C++ development and debugging. , , , , , , ,

Chapter 20 of Stroustrup’s book covers a few more new (to me) c++11 features:

  1. override
  2. final
  3. use of using statements for access control.
  4. pointer to member (for data and member functions)

override

The override keyword is really just to make it clear when you are providing a virtual function override.  Because the use of virtual at an override point is redundant, people have used that to explicitly show that the intent is to show the function overrides a base class function. However, if the have the interface erroneously different in the second specification, the use of virtual there means that you are defining a new virtual function.  Here’s a made up example, where the integer type of a virtual function was changed “accidentally” when “overriding” a base class virtual function:

#include <stdio.h>

struct x
{
   virtual void foo( int v ) ;
} ;

struct y : public x
{
   virtual void foo( long v ) ;
} ;

void x::foo( int v ) { printf( "x::foo:%d\n", v ) ; }
void y::foo( long v ) { printf( "y::foo:%ld\n", v ) ; }

Now in c++11 you can be explicit that you intention is to override a base class virtual. Replace the use of the redundant virtual with the override keyword, and the compiler can now tell you if you get things mixed up:

struct x
{
   virtual void foo( int v ) ;
} ;

struct y : public x
{
   void foo( long v ) override ;
} ;

void x::foo( int v )
{
   printf( "x::foo:%d\n", v ) ;
}

void y::foo( long v )
{
   printf( "y::foo:%ld\n", v ) ;
}

This gives a nice compiler message informing you about the error:

$ c++ -std=c++11 -O2 -MMD   -c -o d.o d.cc
d.cc:10:23: error: non-virtual member function marked 'override' hides virtual member function
   void foo( long v ) override ;
                      ^
d.cc:5:17: note: hidden overloaded virtual function 'x::foo' declared here: type mismatch at 1st parameter ('int' vs 'long')
   virtual void foo( int v ) ;
                ^

final

This is a second virtual function modifier designed to cut the performance cost of using virtual functions in some situations. My experimentation with this feature shows the compilers still have more work to do optimizing away the vtable calls. I introduced a square-matrix class that had a single range virtual range checking function:

   void throwRangeError( const indexType i, const indexType j ) const
   { 
      throw rangeError{ i, j, size } ;
   }

   /**
      Introduce a virtual function that allows user selection of optional range error checking.
    */
   virtual void handleRangeError( const indexType i, const indexType j ) const
   { 
      throwRangeError( i, j ) ;
   }

   bool areIndexesOutOfRange( const indexType i, const indexType j ) const
   { 
      if ( (0 == i) or (0 == j) or (i > size) or (j > size) )
      { 
         return true ;
      }

      return false ;
   }

My intent was that a derived class could provide a no-op specialization of handleRangeError:

/**
   Explicitly unchecked matrix element access
 */
class uncheckedMatrix : public matrix
{
public:
   // inherit constructors:
   using matrix::matrix ;

   void handleRangeError( const indexType i, const indexType j ) const final
   {
   }
} ;

This derived class no longer has any virtual functions. Also note that it uses ‘using’ statements to explicitly inherit the base class constructors, which is not a default action (and recommended by Stroustrup only for classes like this that do not add any data members).

The compiler didn’t do too well with this specialization, as calls to the element access operator still took a vtable hit. Here’s some code that when passed a 3×3 matrix object includes out of range accesses:

void outofbounds( const matrix & m, const char * s )
{
   printf( "%s: %g\n", s, m(4,2) ) ;
}

void outofbounds( const checkedMatrix & m, const char * s )
{
   printf( "%s: %g\n", s, m(4,2) ) ;
}

void outofbounds( const uncheckedMatrix & m, const char * s ) noexcept
{
   printf( "%s: %g\n", s, m(4,2) ) ;
}

Here’s the code for the first (base class) matrix class that has virtual functions, but no final overrides:

0000000000000000 <outofbounds(matrix const&, char const*)>:
   0: push   %rbp
   1: mov    %rsp,%rbp
   4: push   %r14
   6: push   %rbx
   7: mov    %rsi,%r14
   a: mov    %rdi,%rbx
   d: mov    0x20(%rbx),%rax
  11: cmp    $0x3,%rax
  15: ja     2d <outofbounds(matrix const&, char const*)+0x2d>
  17: mov    (%rbx),%rax
  1a: mov    $0x4,%esi
  1f: mov    $0x2,%edx
  24: mov    %rbx,%rdi
  27: callq  *(%rax)
  29: mov    0x20(%rbx),%rax
  2d: lea    (%rax,%rax,2),%rax
  31: mov    0x8(%rbx),%rcx
  35: movsd  0x8(%rcx,%rax,8),%xmm0
  3b: lea    0x149(%rip),%rdi        # 18b <__clang_call_terminate+0xb>
         3e: DISP32  .cstring-0x18b
  42: mov    $0x1,%al
  44: mov    %r14,%rsi
  47: pop    %rbx
  48: pop    %r14
  4a: pop    %rbp
  4b: jmpq   50 <outofbounds(checkedMatrix const&, char const*)>
         4c: BRANCH32   printf

The callq instruction is the vtable call. Because this function called through the base class object, and could represent a derived class object, such a call is required. Now look at the code for the uncheckedMatrix class where the handleRangeError() had a no-op final override:

00000000000000a0 <outofbounds(uncheckedMatrix const&, char const*)>:
  a0: push   %rbp
  a1: mov    %rsp,%rbp
  a4: push   %r14
  a6: push   %rbx
  a7: mov    %rsi,%r14
  aa: mov    %rdi,%rbx
  ad: mov    0x20(%rbx),%rax
  b1: cmp    $0x3,%rax
  b5: ja     d0 <outofbounds(uncheckedMatrix const&, char const*)+0x30>
  b7: mov    (%rbx),%rax
  ba: mov    (%rax),%rax
  bd: mov    $0x4,%esi
  c2: mov    $0x2,%edx
  c7: mov    %rbx,%rdi
  ca: callq  *%rax
  cc: mov    0x20(%rbx),%rax
  d0: lea    (%rax,%rax,2),%rax
...

We still have an unnecessary vtable call. This must be a call to handleRangeError(), but that has a final override, and could conceivably be inlined. Some experimentation shows that it is possible to get the desired behaviour (Apple LLVM version 7.3.0 (clang-703.0.31)), but only when the final call is a leaf function. Explicit override of the base class element access operator to omit the check-and-throw logic

/**
   Explicitly unchecked matrix element access
 */
class uncheckedMatrix2 : public matrix
{
public:
   // inherit constructors:
   using matrix::matrix ;

   T operator()( const indexType i, const indexType j ) const
   { 
      return access( i, j ) ;
   }
} ;

has much less horrible code

0000000000000100 <outofbounds(uncheckedMatrix2 const&, char const*)>:
 100: push   %rbp
 101: mov    %rsp,%rbp
 104: mov    0x8(%rdi),%rax
 108: mov    0x20(%rdi),%rcx
 10c: lea    (%rcx,%rcx,2),%rcx
 110: movsd  0x8(%rax,%rcx,8),%xmm0
 116: lea    0x6e(%rip),%rdi        # 18b <__clang_call_terminate+0xb>
         119: DISP32 .cstring-0x18b
 11d: mov    $0x1,%al
 11f: pop    %rbp
 120: jmpq   125 <outofbounds(uncheckedMatrix2 const&, char const*)+0x25>
         121: BRANCH32  printf
 125: data16 nopw %cs:0x0(%rax,%rax,1)

Now we don’t have any of the vtable related epilog and prologue code, nor the indirection required to make such a call. This code isn’t pretty, but isn’t actually that much worse than raw pointer or plain vector access:

void outofbounds( const std::vector<double> m, const char * s ) noexcept
{
   printf( "%s: %g\n", s, m[ 4*3+2-1 ] ) ;
}

void outofbounds( const double * m, const char * s ) noexcept
{
   printf( "%s: %g\n", s, m[ 4*3+2-1 ] ) ;
}

The first generates code like the following:

0000000000000130 <outofbounds(std::__1::vector<double, std::__1::allocator<double> >, char const*)>:
 130: push   %rbp
 131: mov    %rsp,%rbp 
 134: mov    (%rdi),%rax  
 137: movsd  0x68(%rax),%xmm0
 13c: lea    0x48(%rip),%rdi        # 18b <__clang_call_terminate+0xb>
         13f: DISP32 .cstring-0x18b
 143: mov    $0x1,%al
 145: pop    %rbp
 146: jmpq   14b <outofbounds(std::__1::vector<double, std::__1::allocator<double> >, char const*)+0x1b>
         147: BRANCH32  printf
 14b: nopl   0x0(%rax,%rax,1)

Using vector instead of raw array access imposes only a single instruction dereference penalty:

0000000000000150 <outofbounds(double const*, char const*)>:
 150: push   %rbp
 151: mov    %rsp,%rbp
 154: movsd  0x68(%rdi),%xmm0
 159: lea    0x2b(%rip),%rdi        # 18b <__clang_call_terminate+0xb>
         15c: DISP32 .cstring-0x18b
 160: mov    $0x1,%al
 162: pop    %rbp
 163: jmpq   168 <GCC_except_table2>
         164: BRANCH32  printf

With the final override in a leaf function, or a similar explicit hiding of the base class function, we add one additional instruction overhead (one additional load).

pointer to member

This is a somewhat obscure feature. I don’t think that it is new to c++11, but I’ve never seen it used in 20 years. The only thing interesting about it is that the pointer to member objects apparently are entirely offset based, so could be used in shared memory interprocess configurations (where virtual functions cannot!)

Differences between Locks, Mutexes, and Semaphores

October 21, 2014 C/C++ development and debugging. , , , , , , , ,

 

James writes to me:

I’m having trouble finding concrete definitions for Locks, Mutexes, and Semaphores

What I do know is:

  • Semaphores keep track of how many processes have gained access (“P”) to a resource and block (suspend and send to priority queue) processes once it’s maxed out. They have no concept of ownership and anyone can call unlock (“V”)
  • Binary semaphores are the same as above except their max is 1
  • Mutexes are the same as a binary semaphore except they have a concept of ownership
  • Spinlocks are the same as mutexes except they do not perform blocking and instead are polled until unlocked

What I don’t know:

Where does the term “lock” fit in here?

  • …Is it a broad category including all of these?
  • …Or is it the equivalent of a mutex but to refer to threads within the same process, as opposed to processes within the same system?
  • …Or is it something else?

What the hell is a lock? I’ve seen the above things referred to as one on some websites but then I’ve seen it listed as a separate thing.

These are all very good definitions.  Before getting to the nebulous term lock, I’d refine the definitions of mutex and spinlocks slightly.  I view these both as constructs that allow exclusive concurrent access by multiple threads or processes to shared memory.

A mutex has a mechanism to ensure that only one “holder” of the mutex resource exists at any point of time, but to be useful, must also include a mechanism to order access of memory protected by that mutex with the memory of the mutex resource itself.  An implementation of a mutex (or spinlock) will have an acquire operation that does

  • an exclusive atomic operation
  • “acquire” (or stronger) memory barrier
  • if the atomic operation fails, there may be code to queue up on an operating system resource (such as a semaphore, or futex)

with the release operation doing:

  • if there are waiters queued up, there may be code to post one or more of these waiters before continuing
  • a “release” memory barrier instruction
  • a store (or atomic) instruction to allow subsequent acquisition.

The memory barriers here are key.   In some cases, appropriate memory barriers may be built into the atomic operations.  For example the intel “LOCK XCHG” on a memory address is atomic and has barrier semantics suitable for an acquire operation.  However, on powerpc, in addition to an atomic operation used to acquire, an instruction like isync will be required to ensure that subsequent memory operations aren’t initiated before the acquire atomic operation has completed.  Similarly, on powerpc, you will need an instruction like “lwsync” on mutex release to ensure that memory operations initiated while the mutex was “held” complete before the store instruction that releases the mutex.

If you did just the following thinking that you have implemented your own mutex:

if ( pSharedMem->mutex.atomicword->atomic_bit_or(1) == successful )
{
   int x = pSharedMem->someInteger ;

   pSharedMem->someValue = v ;

   pSharedMem->mutex.atomicword = 0 ;
}

you would be very disappointed should the underlying hardware execute this as one of

pSharedMem->someValue = v ;

if ( pSharedMem->mutex.atomicword->atomic_bit_or(1) == successful )
{
   int x = pSharedMem->someInteger ;

   pSharedMem->mutex.atomicword = 0 ;
}

or

int x = pSharedMem->someInteger ;

if ( pSharedMem->mutex.atomicword->atomic_bit_or(1) == successful )
{

   pSharedMem->someValue = v ;

   pSharedMem->mutex.atomicword = 0 ;
}

Both of these can occur on platforms that allow out of order execution of memory operations (powerpc, spark, ia64, …) if appropriate memory barrier instructions are not used as part of the mutex implementation.

You would be similarly disappointed with a mutex release operation if it allowed one of the following permutations:

if ( pSharedMem->mutex.atomicword->atomic_bit_or(1) == successful )
{
   int x ;

   pSharedMem->someValue = v ;

   pSharedMem->mutex.atomicword = 0 ;

   x = pSharedMem->someInteger ;
}

or

if ( pSharedMem->mutex.atomicword->atomic_bit_or(1) == successful )
{
   int x = pSharedMem->someInteger ;

   pSharedMem->mutex.atomicword = 0 ;

   pSharedMem->someValue = v ;

}

Both of these permutations can occur on out of order platforms if no appropriate memory barrier instructions are used. This surprises many people that attempt to use atomic operations to avoid contention bottlenecks imposed by a mutex that has significant traffic. One also has to be careful to make sure that the compiler does not move the instructions out of the “mutual exclusion region”. Generally atomic operations (or the plain store above, which could also be implemented with an atomic bitand if required) can be implemented in a way that signals to the compiler that it should not move loads and stores around the atomic.

Now, how about this term “lock”.  The wikipedia definition of lock is fairly consistent with most uses of mutex.  However, it has been my experience that this is defined in whatever way suits the writer.  I have not done an extensive survey, but may also be biased since db2 internals uses “lock” to refer to a serialization mechanism that has a rich set of acquisition modes, as well as concepts of fairness that we don’t have in any of our mutex implementations.

It is common to see lock used in combination.  Examples are spin-locks and reader-writer locks.  I prefer reader writer mutex.  Adding to the confusion, the internal db2 implementation of a reader-writer mutex is referred to as a shared-latch (and we call an exclusive mutex a plain “latch”).  Perhaps this was to distinguish our latch from the already used “lock” term, but before mutex became popular for the same.

My nomenclature preferences are:

  • use mutex for the common sorts of serialization that are required in shared memory multi-thread or multi-procees systems (when talking to db2 developers about generic serialization mechanisms)
  • use spinlock when referring to a mutex that does not use any sort of queuing mechanism for resolving conflict.
  • use semaphore for operating system counting mechanisms (like the unix semop).
  • use lock or latch when talking to db2 developers where these have well defined meaning.
  • use shared latch when talking to db2 developers when I mean reader-writer mutex.

I’d be curious what names other products internal mutex implementations use.