C/C++ development and debugging.

Thoughts on the “new” C++ style cast operators.

December 24, 2014 C/C++ development and debugging. , , , , , ,

I happen to maintain the DB2 coding standards, which are mostly concerned with portability, and not style. I’ve joked that I was given that job since I had “broken the build” more than anybody else, so was most qualified to let others know how not to do so.

In our coding standards we have a prohibition against the use of exceptions. This is a historical restriction because we’ve built with compilation flags like -qnoeh (no exception handling) on some platforms to get a bit of additional performance. These days the compilers do much better at not degrading performance when exception handling is allowed and not used, but since our performance folks will sell their kids for a 1% improvement, we’ve kept using flags like this and the associated restriction. Components that must (or want) to use exception handling must “firewall” any exceptions, not letting them get thrown to external code (and also explicitly enable exceptions for their code).

We had a note in the coding standards not to use RTTI (Run Time Type Identification), because exceptions are required. That was a confusing and incomplete statement to include in our standards. It was interpreted by one developer as meaning that none of:

dynamic_cast<>()
reinterpret_cast<>()
static_cast<>()
const_cast<>()

were allowed. However, only the dynamic_cast is a RTTI operation, and only the dynamic_cast will throw an exception when the cast doesn’t match the underlying type.

I’ve now fixed up our coding standards. It now references dynamic_cast instead of RTTI.

The DB2 code is still very C’ish, compiled with a C++ compiler. I’d say the bulk of the casts in our code are old style C casts, and that most of our developers (including myself) don’t even know when to use the “new” cast operations. Here’s some thoughts on these:

  • I’ve seen a fair amount of const_cast use in our code, and I know personally how this can be very useful. An example:


volatile int x ;
void foo( int * y ) ;

foo( const_cast(&x) ) ; // compilation error
foo( const_cast(&x) ) ; // allowed.  Just strips the volatile (or const) off the pointer type.
  • A second nice use of const_cast<>() is to enforce type checking in macros. If the macro parameter is supposed to be a pointer to type T, you can enforce that by using a const_cast<T*>. This assumes that you don’t actually care about the const-ness of the pointer, and will force a compile error if the macro is used with any other type.
  • In general I don’t think it’s a bad idea to use the new cast operators, since they represent a hierarchy of weaker than C style casts. You can also search for cast operations by name in a given file if they are used, which is much harder to do with a C style cast.
  • I’m not sure how much use of static_cast<> and reinterpret_cast<> we have in the code.

Declaring my own stupidity, without looking them up, I didn’t personally know how to use the “new style” static_cast and reinterpret_cast operations correctly. I use a C cast by habit unless I’m trying to strip const or volatile attributes (or enforce a type in a macro).

If DB2 coders start using these, it will likely confuse old guys like me for a while, but I figure I can learn about these.

As a step in this direction, I see some helpful looking info in the C++ reference page on explicit_cast

My take on reference page is roughly: If there is a place to use a C cast, then you can use:

  1. const_cast if it compiles. If not you can use:
  2. static_cast. If that doesn’t compile you can use:
  3. reinterpret_cast. If that doesn’t compile you can use:
  4. (In code where exceptions are allowed:) dynamic_cast.
  5. If that doesn’t work or is not allowed you can use:
    A C style cast.

resolving merge conflicts due to automated C to C++ comment changes

November 17, 2014 C/C++ development and debugging. , , , ,

I was faced with hundreds of merge conflicts that had the following diff3 -m conflict structure:


<<<<<<< file.C.mine
   /* Allocate memory for ProcNamePattern and memset to blank */
   /* ProcNamePattern is used as an intermediate upper case string to capture the procedure name*/
   /* Allocate space for 128 byte schema, 128 byte procedure name */
   rc = BAR(0,
            len+SCHEMA_IDENT+1,
            MEM_DEFAULT,
            &ProcNamePattern);
||||||| file.C.orig
   /* Allocate memory for ProcNamePattern and memset to blank */
   /* ProcNamePattern is used as an intermediate upper case string to capture the procedure name*/
   /* Allocate space for 128 byte schema, 128 byte procedure name */
   rc = FOO(0,
            len+SCHEMA_IDENT+1,
            (void **) &ProcNamePattern);
=======
   // Allocate memory for ProcNamePattern and memset to blank
   // ProcNamePattern is used as an intermediate upper case string to capture the procedure name
   // Allocate space for 128 byte schema, 128 byte procedure name
   rc = FOO(0,
            len+SCHEMA_IDENT+1,
            (void **) &ProcNamePattern);
>>>>>>> file.C.new
   if (rc  )
   {


I’d run a clang based source editing tool that changed FOO to BAR, added a parameter, and removed a cast. Other maintainers of the code had run a tool, or perhaps an editor macro that changed most (but not all) of the C style /* … */ comments into C++ single line comments // …

Those pairs of changes were unfortunately close enough to generate a diff3 -m conflict.

I can run my clang editing tool again (and will), but need to get the source compile-able first, so was faced with either tossing and regenerating my changes, or resolving the conflicts. Basically I needed to filter these comments in the same fashion, and then accept all of my changes, provided there were no other changes in the .orig -> .new stream.

Here’s a little perl filter I wrote for this task:

#!/usr/bin/perl -n

# a script to replace single line /* */ comments with C++ // comments in a restricted fashion.
#
# - doesn't touch comments of the form:
#                                         /* ... */ ... /* ...
#                                         ^^            ^^
# - doesn't touch comments with leading non-whitespace or trailing non-whitespace
#
# This is used to filter new/old/mine triplets in merges to deal with automated replacements of this sort.

chomp ;

if ( ! m,/\*.*/\*, )
{
   s,
^(\s*)   # restrict replacement to comments that start only after beginning of line and whitespace
/\*      # start of comment
\s*      # opt spaces
(.*)     # payload
\s*      # opt spaces
\*/      # end of comment
\s*$     # opt spaces and end of line
,$1// $2,x ;
}

#s,/\* *(.*)(?=\*/ *$)\*/ *$,// $1, ;

print "$_\n" ;

This consumes stdin, and spits out stdout, making the automated change that had been applied to the code. I didn’t want it to do anything with comments of any of the forms:

  • [non-whitespace] /* … */
  • /* … */ … non-whitespace
  • /* … */ … /* … */

Since the comment filtering that’s now in the current version of the files didn’t do this, and I didn’t want to introduce more conflicts due to spacing changes.

With this filter run on all the .mine, .orig, .new versions I was able to rerun

diff3 -m file.mine file.orig file.new

and have only a few actual conflicts to deal with. Most of those were also due to space change and in some cases comment removal.

A lot of this trouble stems from the fact that our product has no coding standards for layout (or very little, or ones that are component specific). I maintain our coding standards for correctness, but when I was given these standards as fairly green developer I didn’t have the guts to take on the role of coding standards dictator, and impose style guidelines on developers very much my senior.

Without style guidelines, a lot of these sorts of merge conflicts could be avoided or minimized significantly if we would only make automated changes with tools that everybody could (or must) run. That would allow conflict filtering of those sort to be done automatically, without having to write ad-hoc tools to “re-play” the automated change in a subset of the merge contributors.

My use of a clang rewriter flies in the face of this ideal conflict avoidance strategy since our build environment is not currently tooled up for our development to do so. However, in this case, being able to do robust automated maintenance ill hopefully be worth the conflicts that this itself will inject.

decoding some powerpc rotate-mask instructions

November 4, 2014 C/C++ development and debugging. , , , , , , ,

I’m looking at what might be an issue in optimized code, where we are seeing some signs that expected byteswapping is not occuring when optimization is enabled. Trying to find exactly where the code is that does this swapping was been a bit challenging, so I decided to look a simple 2-byte swap sequence in a standalone program first. For such code I see the following in the listing file (xlC compiler)

  182| 00002C rlwinm   5405C23E   1     SRL4      gr5=gr0,8
  182| 000030 rlwinm   5400063E   1     RN4       gr0=gr0,0,0xFF
  182| 000034 rldimi   7805402C   1     RI8       gr5=gr0,8,gr5,0xFFFFFF00

The SRL4, RN4, and RI8 “mnemonics” are internal compiler codes meant to be “intuitive”, but I didn’t find them intuitive until I figured out what the instructions actually did. Here’s a couple reverse engineering notes for that task. Tackling the first rlwinm instruction, note that the raw instruction corresponds to:

0x5405C23E == rlwinm r5,r0,24,8,31

Here’s what the Power ISA says about rlwinm:

Rotate Left Word Immediate then AND with Mask 

rlwinm RA,RS,SH,MB,ME

n  <= SH
r  <= ROTL_32 ((RS)_32:63 , n)
m  <= MASK(MB+32, ME+32)
RA <= r & m

The contents of register RS are rotated 32 left SH bits. A
mask is generated having 1-bits from bit MB+32
through bit ME+32 and 0-bits elsewhere. The rotated
data are ANDed with the generated mask and the
result is placed into register RA.

To interpret this we have to look up the meanings of all the MASK and ROTL operations. My first attempt at that, I got the meaning of MASK() wrong, since I was counting bits from the wrong end. I resorted to the following to figure out the instruction, using gcc inline asm immediate operand constraints “i”, to build up the instruction I wanted to examine the effects of:

#include <stdio.h>

#define rlwinm( output, input, sh, mb, me ) \
   __asm__( "rlwinm %0, %1, %2, %3, %4" \
          : "=r"(output) \
          : "r"(input), "i"(sh), "i"(mb), "i"(me) \
          : )

int main()
{
   long x = 0x1122334455667788L ;
   long y ;

   rlwinm( y, x, 24, 8, 31 ) ;

   printf("0x%016lX -> 0x%016lX\n", x, y ) ;

   return 0 ;
}

This generates an rlwinm instruction with the SH,MB,ME=24,8,31 triplet that I’d found in the listing. This code produces:

0x1122334455667788 -> 0x0000000000556677

Observing the effects of the instruction in this concrete example makes it easier to interpret the effects of the instruction. That seems to be:

   long y = ((int)x << 24) | (char)(x >> 8) ;
   y  |= (y << 32) ; (* The ROTL_32 operation in the ISA appears to have this effect, but it is killed in the mask application *)
   y &= 0xFFFFFF ;

Now the internal mnemonic “SRL4 …,8” has a specific meaning. It looks like it means Shift-Right-Lower4ByteWord 8 bits. It’s intuitive when you know that the L here means Lower. I didn’t guess that, and wondered what the hell shift RightLeft meant.

What does RN4 mean? That instruction was:

0x5400063E == rlwinm r0,r0,0,24,31

This has no shift, but applies a mask, and that mask has 16 bits less ones in it. This appears to be an AND with 0xFF. A little test program using “rlwinm( y, x, 0, 24, 31 )” this time confirms this, as it produces:

0x1122334455667788 -> 0x0000000000000088

What could the R and N have meant? Knowing what the instruction does, I’d now guess RotateNone(andMask).

Finally, how about the RI8 operation? This time we have

0x7805402C == rldimi r5,r0,8,32

The PowerISA says of this:

Rotate Left Doubleword Immediate then Mask Insert  

rldimi RA,RS,SH,MB 

n  <= sh_5 || sh_0:4
r  <= ROTL_64 ((RS), n)
b  <= mb_5 || mb_0:4
m  <= MASK(b, ¬n)
RA  <= r&m | (RA) & ¬ m
The contents of register RS are rotated 64 left SH bits. A
mask is generated having 1-bits from bit MB through bit
63-SH and 0-bits elsewhere. The rotated data are
inserted into register RA under control of the generated
mask.

Let’s also see if an example makes this easier to understand. This time a read/write modifier + is required on the output operand

#include <stdio.h>

#define rldimi( inout, input, sh, mb ) \
   __asm__( "rldimi %0, %1, %2, %3" \
          : "+r"(inout) \
          : "r"(input), "i"(sh), "i"(mb) \
          : )

int main()
{
   long x = 0x1122334455667788L ;
   long y = 0x99aabbccddeeff12L ;
   long yo = y ;

   rldimi( y, x, 8, 32 ) ;

   printf("0x%016lX,0x%016lX -> 0x%016lX\n", x, yo, y ) ;

   return 0 ;
}

This produces:

0x1122334455667788,0x99AABBCCDDEEFF12 -> 0x99AABBCC66778812

It appears that the effect is:

y = (y & ~0xFFFFFF00L) | ((x << 8) & 0xFFFFFF00L) ;

I find it tricky to understand this from the PowerISA description, so if I encountered different values of SH,MB I’d probably run them through this little reverse engineering program. That said, at least the meaning of RI8 in -qlist output is now clear.

Differences between Locks, Mutexes, and Semaphores

October 21, 2014 C/C++ development and debugging. , , , , , , , ,

 

James writes to me:

I’m having trouble finding concrete definitions for Locks, Mutexes, and Semaphores

What I do know is:

  • Semaphores keep track of how many processes have gained access (“P”) to a resource and block (suspend and send to priority queue) processes once it’s maxed out. They have no concept of ownership and anyone can call unlock (“V”)
  • Binary semaphores are the same as above except their max is 1
  • Mutexes are the same as a binary semaphore except they have a concept of ownership
  • Spinlocks are the same as mutexes except they do not perform blocking and instead are polled until unlocked

What I don’t know:

Where does the term “lock” fit in here?

  • …Is it a broad category including all of these?
  • …Or is it the equivalent of a mutex but to refer to threads within the same process, as opposed to processes within the same system?
  • …Or is it something else?

What the hell is a lock? I’ve seen the above things referred to as one on some websites but then I’ve seen it listed as a separate thing.

These are all very good definitions.  Before getting to the nebulous term lock, I’d refine the definitions of mutex and spinlocks slightly.  I view these both as constructs that allow exclusive concurrent access by multiple threads or processes to shared memory.

A mutex has a mechanism to ensure that only one “holder” of the mutex resource exists at any point of time, but to be useful, must also include a mechanism to order access of memory protected by that mutex with the memory of the mutex resource itself.  An implementation of a mutex (or spinlock) will have an acquire operation that does

  • an exclusive atomic operation
  • “acquire” (or stronger) memory barrier
  • if the atomic operation fails, there may be code to queue up on an operating system resource (such as a semaphore, or futex)

with the release operation doing:

  • if there are waiters queued up, there may be code to post one or more of these waiters before continuing
  • a “release” memory barrier instruction
  • a store (or atomic) instruction to allow subsequent acquisition.

The memory barriers here are key.   In some cases, appropriate memory barriers may be built into the atomic operations.  For example the intel “LOCK XCHG” on a memory address is atomic and has barrier semantics suitable for an acquire operation.  However, on powerpc, in addition to an atomic operation used to acquire, an instruction like isync will be required to ensure that subsequent memory operations aren’t initiated before the acquire atomic operation has completed.  Similarly, on powerpc, you will need an instruction like “lwsync” on mutex release to ensure that memory operations initiated while the mutex was “held” complete before the store instruction that releases the mutex.

If you did just the following thinking that you have implemented your own mutex:

if ( pSharedMem->mutex.atomicword->atomic_bit_or(1) == successful )
{
   int x = pSharedMem->someInteger ;

   pSharedMem->someValue = v ;

   pSharedMem->mutex.atomicword = 0 ;
}

you would be very disappointed should the underlying hardware execute this as one of

pSharedMem->someValue = v ;

if ( pSharedMem->mutex.atomicword->atomic_bit_or(1) == successful )
{
   int x = pSharedMem->someInteger ;

   pSharedMem->mutex.atomicword = 0 ;
}

or

int x = pSharedMem->someInteger ;

if ( pSharedMem->mutex.atomicword->atomic_bit_or(1) == successful )
{

   pSharedMem->someValue = v ;

   pSharedMem->mutex.atomicword = 0 ;
}

Both of these can occur on platforms that allow out of order execution of memory operations (powerpc, spark, ia64, …) if appropriate memory barrier instructions are not used as part of the mutex implementation.

You would be similarly disappointed with a mutex release operation if it allowed one of the following permutations:

if ( pSharedMem->mutex.atomicword->atomic_bit_or(1) == successful )
{
   int x ;

   pSharedMem->someValue = v ;

   pSharedMem->mutex.atomicword = 0 ;

   x = pSharedMem->someInteger ;
}

or

if ( pSharedMem->mutex.atomicword->atomic_bit_or(1) == successful )
{
   int x = pSharedMem->someInteger ;

   pSharedMem->mutex.atomicword = 0 ;

   pSharedMem->someValue = v ;

}

Both of these permutations can occur on out of order platforms if no appropriate memory barrier instructions are used. This surprises many people that attempt to use atomic operations to avoid contention bottlenecks imposed by a mutex that has significant traffic. One also has to be careful to make sure that the compiler does not move the instructions out of the “mutual exclusion region”. Generally atomic operations (or the plain store above, which could also be implemented with an atomic bitand if required) can be implemented in a way that signals to the compiler that it should not move loads and stores around the atomic.

Now, how about this term “lock”.  The wikipedia definition of lock is fairly consistent with most uses of mutex.  However, it has been my experience that this is defined in whatever way suits the writer.  I have not done an extensive survey, but may also be biased since db2 internals uses “lock” to refer to a serialization mechanism that has a rich set of acquisition modes, as well as concepts of fairness that we don’t have in any of our mutex implementations.

It is common to see lock used in combination.  Examples are spin-locks and reader-writer locks.  I prefer reader writer mutex.  Adding to the confusion, the internal db2 implementation of a reader-writer mutex is referred to as a shared-latch (and we call an exclusive mutex a plain “latch”).  Perhaps this was to distinguish our latch from the already used “lock” term, but before mutex became popular for the same.

My nomenclature preferences are:

  • use mutex for the common sorts of serialization that are required in shared memory multi-thread or multi-procees systems (when talking to db2 developers about generic serialization mechanisms)
  • use spinlock when referring to a mutex that does not use any sort of queuing mechanism for resolving conflict.
  • use semaphore for operating system counting mechanisms (like the unix semop).
  • use lock or latch when talking to db2 developers where these have well defined meaning.
  • use shared latch when talking to db2 developers when I mean reader-writer mutex.

I’d be curious what names other products internal mutex implementations use.

A debugger walk through an out of module call on AIX

August 29, 2014 C/C++ development and debugging. , , , , , ,

We were seeing the following powerpc instruction sequence mess up, ending up
with the CTR (counter) register containing zero. The CTR register, which can
be used for computed gotos and other stuff, is one of the only registers that I believe can be
both loaded and branched to easily on powerpc, and is one of the volatile
regisers in the ABI (one that doesn’t have to be spilled to and from the stack
before another call).

0x090000000A9E8B78 : E8040000 ld r0,0(r4)
0x090000000A9E8B7C : 7C0903A6 mtctr r0
0x090000000A9E8B80 : E8440008 ld r2,8(r4)
0x090000000A9E8B84 : 4E800421 bctrl # 20,bit0

I had a recollection that this sequence was a call through a function pointer,
but thought it may also be what we get for a plain old out of module call
(calling something in a shared library from the main text segment or a call to some other shared
library function from a shared library function). Let’s see what an out of module looks like, for a simple call like

#include <stdio.h>

int main(int argc, char ** argv)
{
   printf( "out of module call\n" ) ;

   return 0;
}

I set an instruction breakpoint at the bl (branch and link) instruction for
printf, and then step into that

(dbx) stopi at 0x100000788
[1] stopi at 0x100000788 (main+0x28)
(dbx) c
[1] stopped in main at 0x100000788
0x100000788 (main+0x28) 48000059 bl 0x1000007e0 (printf)
(dbx) stepi
stopped in glink64.printf at 0x1000007e0
0x1000007e0 (printf) e98200d8 ld r12,0xd8(r2)
(dbx)

Observe that the debugger is letting us know that we aren’t actually in printf
yet, but are in the glue code for printf. Looking at the instruction sequence
for this glue code we see that it matches the type of code we saw in out NULL
CTR trap sequence above

stopped in glink64.printf at 0x1000007e4
0x1000007e4 (printf+0x4) f8410028 std r2,0x28(r1)
(dbx)

stopped in glink64.printf at 0x1000007e8
0x1000007e8 (printf+0x8) e80c0000 ld r0,0x0(r12)
(dbx)

stopped in glink64.printf at 0x1000007ec
0x1000007ec (printf+0xc) e84c0008 ld r2,0x8(r12)
(dbx)

stopped in glink64.printf at 0x1000007f0
0x1000007f0 (printf+0x10) 7c0903a6 mtctr r0

We save our TOC register (GR2, the table of contents register) to the stack,
load a new value into GR0 to copy to the CTR register, and load the TOC
register (GR2) for the module that we are calling.

Now if we look at what just got put in the TOC register, we see that it’s the
address that we find the actual code for printf at

(dbx) p $r0
0x0900000000004f80
(dbx) listi 0x0900000000004f80
0x900000000004f80 (printf) fbe1fff8 std r31,-8(r1)
0x900000000004f84 (printf+0x4) fbc1fff0 std r30,-16(r1)
0x900000000004f88 (printf+0x8) 7c0802a6 mflr r0
0x900000000004f8c (printf+0xc) fba1ffe8 std r29,-24(r1)
0x900000000004f90 (printf+0x10) ebe20dd0 ld r31,0xdd0(r2)
0x900000000004f94 (printf+0x14) f8010010 std r0,0x10(r1)
0x900000000004f98 (printf+0x18) 8002000c lwz r0,0xc(r2)
0x900000000004f9c (printf+0x1c) f821ff71 stdu r1,-144(r1)
0x900000000004fa0 (printf+0x20) 60000000 ori r0,r0,0x0
0x900000000004fa4 (printf+0x24) 2c000000 cmpi cr0,0x0,r0,0x0

The glue code, a branch table for out of module calls, gets us to there, but
we have to pay a number of instruction penalty for this call, in addition to
the normal function call overhead.

What does this mean for the trap scenerio? One implication is that this isn’t
neccessarily as simple as a NULL function pointer. That instruction sequence
is probably different (but I don’t recall exactly how at the moment). Perhaps
this means that the jump table for the currently exectuting shared library got
corrupted? It is probably writable since the run time loader must be able to
modify it. I’d guess that it remains writable throughout execution to support
lazy runtime loader address fixups. This is likely not going to be an easy
problem to solve.