powerpc

A peek into C++ std::atomic. Extracting the meaning of std::memory_order_seq_cst.

May 19, 2025 C/C++ development and debugging. , , , , , , , , , , , , , , , , , , , , , , , ,

When I worked on DB2-LUW, we had to write our own atomic implementation.  Ours worked on many different types of hardware, operating systems, and compilers, so was quite representative, and matches the capabilities of the now “new” C++ interfaces.  Some of that correspondence is not a surprise, since we provided our implementation to the xlC compiler team, who was advocating against sequentially-consistent being the only option.  That was a choice that would have penalized PowerPC significantly.  The compiler guys were on the standards sub-committee for std::atomic, and were able to reference our cross platform implementation to advocate for the ordering variations that they also wanted.

It was not terribly easy to implement the DB2 atomics.  We used compiler builtins when available, inline assembly in some other cases, and plain old assembly in other cases.  Even in cases where the hardware was the same, we were often using different compilers, but even if the compilers were the same (i.e.: intel compiler on both Linux and Windows), there would be differences that had to be accounted for.  On top of that, we also supported both 32-bit and 64-bit targets in those days (thankfully having ditched our 16-bit client support by that point in time.)  Having 32-bit targets meant that we have to use a mutex based implementation for 64-bit atomics on some platforms.  You’d have instructions like cmpxchg8b on intel, and the option of using 64 bit instructions for sparcv9 32-bit targets, and csg instructions for 32-bit zLinux, but even if you could use them, additional care was required because 8 byte alignment was needed for mutexes and 32-bit allocators didn’t always provide that.

I’d assume that DB2 now only supports 64-bit targets, which cuts half of the platform variants out of the picture.

With the atomic support built into C++ now, the complexity of building a cross platform solution is now reduced considerably.  One still has to be aware of memory ordering, and all the complexities required to build lock free algorithms, but at least the basic infrastructure is now made trivial.

Since my history with atomic types predates std::atomic, I thought it would be interesting to look at acq/rel/cst variations of some basic std::atomic operations and see what the generated assembly looks like. I only have ARM and X64 machines available to me at the moment, which meant that I can’t see what a PowerPC, IA64, sparc, or SGI compiler generates.

My test code included a couple load variations:

int loadv( int & i32 )
{
    return i32;
}

int load( std::atomic & i32 )
{
    return i32.load();
}

int load_acq( std::atomic & i32 )
{
    return i32.load( std::memory_order_acquire );
}

int load_cst( std::atomic & i32 )
{
    return i32.load( std::memory_order_seq_cst );
}

Initially I had made the atomic object global, but the generated assembly on ARM for a global variable was ugly and distracting, so switching to a plain old parameter made things easier to read.

On intel, this is what we get (trimming out post ‘ret’ garbage from the objdump of each function):

0000000000000000 <loadv(int&)>:
   0:   mov    (%rdi),%eax
   2:   ret

0000000000000010 <load(std::atomic&)>:
  10:   mov    (%rdi),%eax
  12:   ret

0000000000000020 <load_acq(std::atomic&)>:
  20:   mov    (%rdi),%eax
  22:   ret

0000000000000030 <load_cst(std::atomic&)>:
  30:   mov    (%rdi),%eax
  32:   ret

We already have .acq semantics for intel loads, as it is mostly ordered already (my recollection is that only a store-load pair can be reordered on intel.) It is interesting that std::memory_order_seq_cst doesn’t require any MFENCE or LFENCE instructions. Effectively, this lack of additional fencing actually gives us an implicit definition of std::memory_order_seq_cst, but let’s comment on that afterwards.

On ARM we have:

0000000000000000 <loadv(int&)>:
   0:   ldr     w0, [x0]
   4:   ret

0000000000000010 <load(std::atomic&)>:
  10:   ldar    w0, [x0]
  14:   ret

0000000000000020 <load_acq(std::atomic&)>:
  20:   ldar    w0, [x0]
  24:   ret

0000000000000030 <load_cst(std::atomic&)>:
  30:   ldar    w0, [x0]
  34:   ret

We have plain old load instruction for a non-atomic load, so it seems reasonable to guess that ldar means a load with acquire semantics. Before looking that up, let’s look at the assembly for some store variations, using the following code:

void storev( int & i32, int v )
{
    i32 = v;
}

void store( std::atomic & i32, int v )
{
    i32.store( v );
}

void store_rel( std::atomic & i32, int v )
{
    i32.store( v, std::memory_order_release );
}

void store_cst( std::atomic & i32, int v )
{
    i32.store( v, std::memory_order_seq_cst );
}

On intel, we get:

0000000000000040 <storev(int&, int)>:
  40:   mov    %esi,(%rdi)
  42:   ret

0000000000000050 <store(std::atomic&, int)>:
  50:   xchg   %esi,(%rdi)
  52:   ret

0000000000000060 <store_rel(std::atomic&, int)>:
  60:   mov    %esi,(%rdi)
  62:   ret

0000000000000070 <store_cst(std::atomic&, int)>:
  70:   xchg   %esi,(%rdi)
  72:   ret

So, a plain old MOV instruction is equivalent to a store with .rel semantics, whereas a std::memory_order_seq_cst (the default atomic semantics for store) requires an implicit LOCK prefix. How about ARM:

0000000000000040 <storev(int&, int)>:
  40:   str     w1, [x0]
  44:   ret

0000000000000050 <store(std::atomic&, int)>:
  50:   stlr    w1, [x0]
  54:   ret

0000000000000060 <store_rel(std::atomic&, int)>:
  60:   stlr    w1, [x0]
  64:   ret

0000000000000070 <store_cst(std::atomic&, int)>:
  70:   stlr    w1, [x0]
  74:   ret

We see that the ARM load w/ acquire and store w/ release instructions use a special instruction, also providing cst semantics for those operations.

The ARM reference manual describes these memory ordered load/store instructions and their semantics:

This matches what we’d expect from a modern instruction set, very much like the ld.acq and st.rel instructions that we had on IA64.

Now let’s look at one more atomic operation, a fetch-and-add:

int fetch_add( std::atomic & i32, int v )
{
    return i32.fetch_add( v );
}

int fetch_add_rel( std::atomic & i32, int v )
{
    return i32.fetch_add( v, std::memory_order_release );
}

int fetch_add_acq( std::atomic & i32, int v )
{
    return i32.fetch_add( v, std::memory_order_acquire );
}

int fetch_add_cst( std::atomic & i32, int v )
{
    return i32.fetch_add( v, std::memory_order_seq_cst );
}

On intel we have:

0000000000000080 <fetch_add(std::atomic&, int)>:
  80:   mov    %esi,%eax
  82:   lock xadd %eax,(%rdi)
  86:   ret

0000000000000090 <fetch_add_rel(std::atomic&, int)>:
  90:   mov    %esi,%eax
  92:   lock xadd %eax,(%rdi)
  96:   ret

00000000000000a0 <fetch_add_acq(std::atomic&, int)>:
  a0:   mov    %esi,%eax
  a2:   lock xadd %eax,(%rdi)
  a6:   ret

00000000000000b0 <fetch_add_cst(std::atomic&, int)>:
  b0:   mov    %esi,%eax
  b2:   lock xadd %eax,(%rdi)
  b6:   ret

There’s no implicit LOCK prefix for xadd, so we see it used explicitly now. On ARM, unfortunately, looking at assembly listings is no longer sufficient:

0000000000000080 <fetch_add(std::atomic&, int)>:
  80:   mov     x2, x0
  84:   stp     x29, x30, [sp, #-16]!
  88:   mov     w0, w1
  8c:   mov     x29, sp
  90:   mov     x1, x2
  94:   bl      0 <__aarch64_ldadd4_acq_rel>
  98:   ldp     x29, x30, [sp], #16
  9c:   ret

00000000000000a0 <fetch_add_rel(std::atomic&, int)>:
  a0:   mov     x2, x0
  a4:   stp     x29, x30, [sp, #-16]!
  a8:   mov     w0, w1
  ac:   mov     x29, sp
  b0:   mov     x1, x2
  b4:   bl      0 <__aarch64_ldadd4_rel>
  b8:   ldp     x29, x30, [sp], #16
  bc:   ret

00000000000000c0 <fetch_add_acq(std::atomic&, int)>:
  c0:   mov     x2, x0
  c4:   stp     x29, x30, [sp, #-16]!
  c8:   mov     w0, w1
  cc:   mov     x29, sp
  d0:   mov     x1, x2
  d4:   bl      0 <__aarch64_ldadd4_acq>
  d8:   ldp     x29, x30, [sp], #16
  dc:   ret

00000000000000e0 <fetch_add_cst(std::atomic&, int)>:
  e0:   mov     x2, x0
  e4:   stp     x29, x30, [sp, #-16]!
  e8:   mov     w0, w1
  ec:   mov     x29, sp
  f0:   mov     x1, x2
  f4:   bl      0 <__aarch64_ldadd4_acq_rel>
  f8:   ldp     x29, x30, [sp], #16
  fc:   ret

We do see that the std::atomic default fetch_add behaviour matches std::memory_order_seq_cst (in this case, an operation with both acquire and release sematics, as was the case on intel due to the LOCK prefix.) We can’t see exactly what happens under the covers for those operations, as there’s a branch-and-link instruction to some system provided interfaces.

Here’s what gdb says these functions do:

(gdb) disassemble __aarch64_ldadd4_acq_rel
Dump of assembler code for function __aarch64_ldadd4_acq_rel:
   0x00000000004008f0 <+0>:	bti	c
=> 0x00000000004008f4 <+4>:	adrp	x16, 0x420000 <__libc_start_main@got.plt>
   0x00000000004008f8 <+8>:	ldrb	w16, [x16, #45]
   0x00000000004008fc <+12>:	cbz	w16, 0x400908 <__aarch64_ldadd4_acq_rel+24>
   0x0000000000400900 <+16>:	ldaddal	w0, w0, [x1]
   0x0000000000400904 <+20>:	ret
   0x0000000000400908 <+24>:	mov	w16, w0
   0x000000000040090c <+28>:	ldaxr	w0, [x1]
   0x0000000000400910 <+32>:	add	w17, w0, w16
   0x0000000000400914 <+36>:	stlxr	w15, w17, [x1]
   0x0000000000400918 <+40>:	cbnz	w15, 0x40090c <__aarch64_ldadd4_acq_rel+28>
   0x000000000040091c <+44>:	ret
End of assembler dump.
(gdb) disassemble __aarch64_ldadd4_rel
Dump of assembler code for function __aarch64_ldadd4_rel:
   0x00000000004008c0 <+0>:	bti	c
   0x00000000004008c4 <+4>:	adrp	x16, 0x420000 <__libc_start_main@got.plt>
   0x00000000004008c8 <+8>:	ldrb	w16, [x16, #45]
   0x00000000004008cc <+12>:	cbz	w16, 0x4008d8 <__aarch64_ldadd4_rel+24>
   0x00000000004008d0 <+16>:	ldaddl	w0, w0, [x1]
   0x00000000004008d4 <+20>:	ret
   0x00000000004008d8 <+24>:	mov	w16, w0
   0x00000000004008dc <+28>:	ldxr	w0, [x1]
   0x00000000004008e0 <+32>:	add	w17, w0, w16
   0x00000000004008e4 <+36>:	stlxr	w15, w17, [x1]
   0x00000000004008e8 <+40>:	cbnz	w15, 0x4008dc <__aarch64_ldadd4_rel+28>
   0x00000000004008ec <+44>:	ret
End of assembler dump.
(gdb) disassemble __aarch64_ldadd4_acq
Dump of assembler code for function __aarch64_ldadd4_acq:
   0x0000000000400890 <+0>:	bti	c
   0x0000000000400894 <+4>:	adrp	x16, 0x420000 <__libc_start_main@got.plt>
   0x0000000000400898 <+8>:	ldrb	w16, [x16, #45]
   0x000000000040089c <+12>:	cbz	w16, 0x4008a8 <__aarch64_ldadd4_acq+24>
   0x00000000004008a0 <+16>:	ldadda	w0, w0, [x1]
   0x00000000004008a4 <+20>:	ret
   0x00000000004008a8 <+24>:	mov	w16, w0
   0x00000000004008ac <+28>:	ldaxr	w0, [x1]
   0x00000000004008b0 <+32>:	add	w17, w0, w16
   0x00000000004008b4 <+36>:	stxr	w15, w17, [x1]
   0x00000000004008b8 <+40>:	cbnz	w15, 0x4008ac <__aarch64_ldadd4_acq+28>
   0x00000000004008bc <+44>:	ret
End of assembler dump.

There’s some sort of runtime determined switching, based on program global variable, probably initialized during pre-main setup.  If that variable is not set, then we use one of the following Atomic add on word or doubleword in memory instructions

  • LDADDA and LDADDAL load from memory with acquire semantics.
  • LDADDL and LDADDAL store to memory with release semantics.
  • (LDADD has neither acquire nor release semantics.)

It looks like these hybrid fetch-and-add instructions are optional, also only provided on ARMv8.1-A or later.  The fallback looks very familiar to a PowerPC programmer, and are load-with-reservation/store-conditional like instructions (i.e.: lwarx/stwcx.), but also include built in memory ordering semantics, which is nice (so we don’t need any supplemental isync/lwsync’s to get the desired memory ordering.)

Some final thoughts.

We saw that store with release semantics was a plain old intel store instruction, whereas store with cst included a full memory barrier (the lock xchg).  My recollection is that store/load was the only reorderable pair of memory operations on intel, which was why we were able to construct a spinlock using xchg (lock implied when the destination was a memory operand), but could use a plain old store for the spinlock release.

We can visualize the spinlock acquire/release like a cage that is porous only from the outside:

loads/stores (dumb tourists)

ATOMIC W/ ACQUIRE (cage wall)

Loads and stores that can’t get out of the cage (lions and tigers)

RELEASE (cage wall)
loads/stores (dumb tourists)

You want to ensure that the loads and stores (the lions and tigers) that are in the cage can’t get out, but if somebody is foolish enough to go into the cage from the outside, you are willing to let them get eaten.  On intel the release-store is enough to ensure that preceding loads and stores are complete before anybody can see the side effect of the lock release operation.  On PPC we needed isync and lwsync to help guard the cage walls.  It looks like we need explicit acquire release semantics on ARM to guard those walls too, but have nice instruction variants to do that.

We’ve seen enough to give meaning to std::memory_order_seq_cst.  It would have made more sense to try to look it up first, but where is the fun in that.  Instead, we’ve seen that:

  • loads with std::memory_order_seq_cst have acquire semantics,
  • stores with std::memory_order_seq_cst have both release semantics and acquire semantics.
  • Operations that have both load and store aspects of their operations (like fetch-and-add), because they are stores, also have both release and acquire semantics for std::memory_order_seq_cst.

On Intel we saw std::memory_order_seq_cst for hybrid (load/store) operations was achieved with an explicit LOCK prefix (bidirectional fence), and on ARM with instructions like LDADDAL that have both release and acquire semantics by construction.

decoding some powerpc rotate-mask instructions

November 4, 2014 C/C++ development and debugging. , , , , , , ,

I’m looking at what might be an issue in optimized code, where we are seeing some signs that expected byteswapping is not occuring when optimization is enabled. Trying to find exactly where the code is that does this swapping was been a bit challenging, so I decided to look a simple 2-byte swap sequence in a standalone program first. For such code I see the following in the listing file (xlC compiler)

  182| 00002C rlwinm   5405C23E   1     SRL4      gr5=gr0,8
  182| 000030 rlwinm   5400063E   1     RN4       gr0=gr0,0,0xFF
  182| 000034 rldimi   7805402C   1     RI8       gr5=gr0,8,gr5,0xFFFFFF00

The SRL4, RN4, and RI8 “mnemonics” are internal compiler codes meant to be “intuitive”, but I didn’t find them intuitive until I figured out what the instructions actually did. Here’s a couple reverse engineering notes for that task. Tackling the first rlwinm instruction, note that the raw instruction corresponds to:

0x5405C23E == rlwinm r5,r0,24,8,31

Here’s what the Power ISA says about rlwinm:

Rotate Left Word Immediate then AND with Mask 

rlwinm RA,RS,SH,MB,ME

n  <= SH
r  <= ROTL_32 ((RS)_32:63 , n)
m  <= MASK(MB+32, ME+32)
RA <= r & m

The contents of register RS are rotated 32 left SH bits. A
mask is generated having 1-bits from bit MB+32
through bit ME+32 and 0-bits elsewhere. The rotated
data are ANDed with the generated mask and the
result is placed into register RA.

To interpret this we have to look up the meanings of all the MASK and ROTL operations. My first attempt at that, I got the meaning of MASK() wrong, since I was counting bits from the wrong end. I resorted to the following to figure out the instruction, using gcc inline asm immediate operand constraints “i”, to build up the instruction I wanted to examine the effects of:

#include <stdio.h>

#define rlwinm( output, input, sh, mb, me ) \
   __asm__( "rlwinm %0, %1, %2, %3, %4" \
          : "=r"(output) \
          : "r"(input), "i"(sh), "i"(mb), "i"(me) \
          : )

int main()
{
   long x = 0x1122334455667788L ;
   long y ;

   rlwinm( y, x, 24, 8, 31 ) ;

   printf("0x%016lX -> 0x%016lX\n", x, y ) ;

   return 0 ;
}

This generates an rlwinm instruction with the SH,MB,ME=24,8,31 triplet that I’d found in the listing. This code produces:

0x1122334455667788 -> 0x0000000000556677

Observing the effects of the instruction in this concrete example makes it easier to interpret the effects of the instruction. That seems to be:

   long y = ((int)x << 24) | (char)(x >> 8) ;
   y  |= (y << 32) ; (* The ROTL_32 operation in the ISA appears to have this effect, but it is killed in the mask application *)
   y &= 0xFFFFFF ;

Now the internal mnemonic “SRL4 …,8” has a specific meaning. It looks like it means Shift-Right-Lower4ByteWord 8 bits. It’s intuitive when you know that the L here means Lower. I didn’t guess that, and wondered what the hell shift RightLeft meant.

What does RN4 mean? That instruction was:

0x5400063E == rlwinm r0,r0,0,24,31

This has no shift, but applies a mask, and that mask has 16 bits less ones in it. This appears to be an AND with 0xFF. A little test program using “rlwinm( y, x, 0, 24, 31 )” this time confirms this, as it produces:

0x1122334455667788 -> 0x0000000000000088

What could the R and N have meant? Knowing what the instruction does, I’d now guess RotateNone(andMask).

Finally, how about the RI8 operation? This time we have

0x7805402C == rldimi r5,r0,8,32

The PowerISA says of this:

Rotate Left Doubleword Immediate then Mask Insert  

rldimi RA,RS,SH,MB 

n  <= sh_5 || sh_0:4
r  <= ROTL_64 ((RS), n)
b  <= mb_5 || mb_0:4
m  <= MASK(b, ¬n)
RA  <= r&m | (RA) & ¬ m
The contents of register RS are rotated 64 left SH bits. A
mask is generated having 1-bits from bit MB through bit
63-SH and 0-bits elsewhere. The rotated data are
inserted into register RA under control of the generated
mask.

Let’s also see if an example makes this easier to understand. This time a read/write modifier + is required on the output operand

#include <stdio.h>

#define rldimi( inout, input, sh, mb ) \
   __asm__( "rldimi %0, %1, %2, %3" \
          : "+r"(inout) \
          : "r"(input), "i"(sh), "i"(mb) \
          : )

int main()
{
   long x = 0x1122334455667788L ;
   long y = 0x99aabbccddeeff12L ;
   long yo = y ;

   rldimi( y, x, 8, 32 ) ;

   printf("0x%016lX,0x%016lX -> 0x%016lX\n", x, yo, y ) ;

   return 0 ;
}

This produces:

0x1122334455667788,0x99AABBCCDDEEFF12 -> 0x99AABBCC66778812

It appears that the effect is:

y = (y & ~0xFFFFFF00L) | ((x << 8) & 0xFFFFFF00L) ;

I find it tricky to understand this from the PowerISA description, so if I encountered different values of SH,MB I’d probably run them through this little reverse engineering program. That said, at least the meaning of RI8 in -qlist output is now clear.

A debugger walk through an out of module call on AIX

August 29, 2014 C/C++ development and debugging. , , , , , ,

We were seeing the following powerpc instruction sequence mess up, ending up
with the CTR (counter) register containing zero. The CTR register, which can
be used for computed gotos and other stuff, is one of the only registers that I believe can be
both loaded and branched to easily on powerpc, and is one of the volatile
regisers in the ABI (one that doesn’t have to be spilled to and from the stack
before another call).

0x090000000A9E8B78 : E8040000 ld r0,0(r4)
0x090000000A9E8B7C : 7C0903A6 mtctr r0
0x090000000A9E8B80 : E8440008 ld r2,8(r4)
0x090000000A9E8B84 : 4E800421 bctrl # 20,bit0

I had a recollection that this sequence was a call through a function pointer,
but thought it may also be what we get for a plain old out of module call
(calling something in a shared library from the main text segment or a call to some other shared
library function from a shared library function). Let’s see what an out of module looks like, for a simple call like

#include <stdio.h>

int main(int argc, char ** argv)
{
   printf( "out of module call\n" ) ;

   return 0;
}

I set an instruction breakpoint at the bl (branch and link) instruction for
printf, and then step into that

(dbx) stopi at 0x100000788
[1] stopi at 0x100000788 (main+0x28)
(dbx) c
[1] stopped in main at 0x100000788
0x100000788 (main+0x28) 48000059 bl 0x1000007e0 (printf)
(dbx) stepi
stopped in glink64.printf at 0x1000007e0
0x1000007e0 (printf) e98200d8 ld r12,0xd8(r2)
(dbx)

Observe that the debugger is letting us know that we aren’t actually in printf
yet, but are in the glue code for printf. Looking at the instruction sequence
for this glue code we see that it matches the type of code we saw in out NULL
CTR trap sequence above

stopped in glink64.printf at 0x1000007e4
0x1000007e4 (printf+0x4) f8410028 std r2,0x28(r1)
(dbx)

stopped in glink64.printf at 0x1000007e8
0x1000007e8 (printf+0x8) e80c0000 ld r0,0x0(r12)
(dbx)

stopped in glink64.printf at 0x1000007ec
0x1000007ec (printf+0xc) e84c0008 ld r2,0x8(r12)
(dbx)

stopped in glink64.printf at 0x1000007f0
0x1000007f0 (printf+0x10) 7c0903a6 mtctr r0

We save our TOC register (GR2, the table of contents register) to the stack,
load a new value into GR0 to copy to the CTR register, and load the TOC
register (GR2) for the module that we are calling.

Now if we look at what just got put in the TOC register, we see that it’s the
address that we find the actual code for printf at

(dbx) p $r0
0x0900000000004f80
(dbx) listi 0x0900000000004f80
0x900000000004f80 (printf) fbe1fff8 std r31,-8(r1)
0x900000000004f84 (printf+0x4) fbc1fff0 std r30,-16(r1)
0x900000000004f88 (printf+0x8) 7c0802a6 mflr r0
0x900000000004f8c (printf+0xc) fba1ffe8 std r29,-24(r1)
0x900000000004f90 (printf+0x10) ebe20dd0 ld r31,0xdd0(r2)
0x900000000004f94 (printf+0x14) f8010010 std r0,0x10(r1)
0x900000000004f98 (printf+0x18) 8002000c lwz r0,0xc(r2)
0x900000000004f9c (printf+0x1c) f821ff71 stdu r1,-144(r1)
0x900000000004fa0 (printf+0x20) 60000000 ori r0,r0,0x0
0x900000000004fa4 (printf+0x24) 2c000000 cmpi cr0,0x0,r0,0x0

The glue code, a branch table for out of module calls, gets us to there, but
we have to pay a number of instruction penalty for this call, in addition to
the normal function call overhead.

What does this mean for the trap scenerio? One implication is that this isn’t
neccessarily as simple as a NULL function pointer. That instruction sequence
is probably different (but I don’t recall exactly how at the moment). Perhaps
this means that the jump table for the currently exectuting shared library got
corrupted? It is probably writable since the run time loader must be able to
modify it. I’d guess that it remains writable throughout execution to support
lazy runtime loader address fixups. This is likely not going to be an easy
problem to solve.