We were seeing the following powerpc instruction sequence mess up, ending up
with the CTR (counter) register containing zero. The CTR register, which can
be used for computed gotos and other stuff, is one of the only registers that I believe can be
both loaded and branched to easily on powerpc, and is one of the volatile
regisers in the ABI (one that doesn’t have to be spilled to and from the stack
before another call).

0x090000000A9E8B78 : E8040000 ld r0,0(r4)
0x090000000A9E8B7C : 7C0903A6 mtctr r0
0x090000000A9E8B80 : E8440008 ld r2,8(r4)
0x090000000A9E8B84 : 4E800421 bctrl # 20,bit0

I had a recollection that this sequence was a call through a function pointer,
but thought it may also be what we get for a plain old out of module call
(calling something in a shared library from the main text segment or a call to some other shared
library function from a shared library function). Let’s see what an out of module looks like, for a simple call like

#include <stdio.h>

int main(int argc, char ** argv)
{
   printf( "out of module call\n" ) ;

   return 0;
}

I set an instruction breakpoint at the bl (branch and link) instruction for
printf, and then step into that

(dbx) stopi at 0x100000788
[1] stopi at 0x100000788 (main+0x28)
(dbx) c
[1] stopped in main at 0x100000788
0x100000788 (main+0x28) 48000059 bl 0x1000007e0 (printf)
(dbx) stepi
stopped in glink64.printf at 0x1000007e0
0x1000007e0 (printf) e98200d8 ld r12,0xd8(r2)
(dbx)

Observe that the debugger is letting us know that we aren’t actually in printf
yet, but are in the glue code for printf. Looking at the instruction sequence
for this glue code we see that it matches the type of code we saw in out NULL
CTR trap sequence above

stopped in glink64.printf at 0x1000007e4
0x1000007e4 (printf+0x4) f8410028 std r2,0x28(r1)
(dbx)

stopped in glink64.printf at 0x1000007e8
0x1000007e8 (printf+0x8) e80c0000 ld r0,0x0(r12)
(dbx)

stopped in glink64.printf at 0x1000007ec
0x1000007ec (printf+0xc) e84c0008 ld r2,0x8(r12)
(dbx)

stopped in glink64.printf at 0x1000007f0
0x1000007f0 (printf+0x10) 7c0903a6 mtctr r0

We save our TOC register (GR2, the table of contents register) to the stack,
load a new value into GR0 to copy to the CTR register, and load the TOC
register (GR2) for the module that we are calling.

Now if we look at what just got put in the TOC register, we see that it’s the
address that we find the actual code for printf at

(dbx) p $r0
0x0900000000004f80
(dbx) listi 0x0900000000004f80
0x900000000004f80 (printf) fbe1fff8 std r31,-8(r1)
0x900000000004f84 (printf+0x4) fbc1fff0 std r30,-16(r1)
0x900000000004f88 (printf+0x8) 7c0802a6 mflr r0
0x900000000004f8c (printf+0xc) fba1ffe8 std r29,-24(r1)
0x900000000004f90 (printf+0x10) ebe20dd0 ld r31,0xdd0(r2)
0x900000000004f94 (printf+0x14) f8010010 std r0,0x10(r1)
0x900000000004f98 (printf+0x18) 8002000c lwz r0,0xc(r2)
0x900000000004f9c (printf+0x1c) f821ff71 stdu r1,-144(r1)
0x900000000004fa0 (printf+0x20) 60000000 ori r0,r0,0x0
0x900000000004fa4 (printf+0x24) 2c000000 cmpi cr0,0x0,r0,0x0

The glue code, a branch table for out of module calls, gets us to there, but
we have to pay a number of instruction penalty for this call, in addition to
the normal function call overhead.

What does this mean for the trap scenerio? One implication is that this isn’t
neccessarily as simple as a NULL function pointer. That instruction sequence
is probably different (but I don’t recall exactly how at the moment). Perhaps
this means that the jump table for the currently exectuting shared library got
corrupted? It is probably writable since the run time loader must be able to
modify it. I’d guess that it remains writable throughout execution to support
lazy runtime loader address fixups. This is likely not going to be an easy
problem to solve.