Rough timings are in, and it's not going to be as easy to fix as I'd hoped... I'm just using libogc + wiiload and PMC3 to count cycles (messy code at https://gist.github.com/hthh/ad17a78d95abd5cdac21954b3b4486f4 )
Dolphin (deterministic):
size 180: cycles = 256, i = 47
Wii:
size 180: cycles = 949, i = 4
size 180: cycles = 498, i = 29
size 180: cycles = 494, i = 28
size 180: cycles = 503, i = 28
size 180: cycles = 517, i = 29
size 180: cycles = 500, i = 29
size 180: cycles = 497, i = 29
size 180: cycles = 497, i = 29
size 180: cycles = 497, i = 29
size 180: cycles = 497, i = 29
size 180: cycles = 497, i = 29
size 180: cycles = 497, i = 29
size 180: cycles = 494, i = 28
size 20: cycles = 889, i = 3
size 20: cycles = 511, i = 3
size 40: cycles = 500, i = 29
size 40: cycles = 509, i = 28
size 60: cycles = 509, i = 25
size 60: cycles = 500, i = 30
size 80: cycles = 497, i = 29
size 80: cycles = 494, i = 28
size a0: cycles = 497, i = 30
size a0: cycles = 497, i = 30
size c0: cycles = 497, i = 30
size c0: cycles = 497, i = 30
size e0: cycles = 494, i = 28
size e0: cycles = 497, i = 30
size 100: cycles = 497, i = 30
size 100: cycles = 494, i = 28
size 120: cycles = 494, i = 28
size 120: cycles = 497, i = 30
size 140: cycles = 494, i = 28
size 140: cycles = 497, i = 30
size 160: cycles = 494, i = 28
size 160: cycles = 494, i = 28
size 180: cycles = 494, i = 28
size 180: cycles = 497, i = 30
size 1a0: cycles = 497, i = 30
size 1a0: cycles = 497, i = 30
size 1c0: cycles = 497, i = 30
size 1c0: cycles = 494, i = 28
size 1e0: cycles = 494, i = 28
size 1e0: cycles = 494, i = 28
The high first result in each run is presumably related to cold cache? My methodology isn't too precise, but I'd guess we could up the ticks to 400 and be strictly more accurate. Even when I accidentally included too much libogc overhead in the timing, the biggest cold-cache result I saw was under 2000, so I think increasing the delay to anything like 8000 is the wrong answer.
The "i" number is produced by a loop of "while (!done) i++;" that waits for the callback to be invoked. It shows that real hardware executes fewer iterations even in around twice as many cycles, which is a bit surprising.
I'm a fair bit more confused than when I began... If we imagine the 400 cycles is exact, we're exactly in the middle of the numbers that work (logarithmically). Make the CPU 20x slower and we'd run before the callback is registered, or make the CPU 20x faster and it'd initialise the global before we run. The latter seems impossible, and the former merely improbable.
Could it be cache? Looking at the 32 instructions in the window, the only likely miss would be the call to AXRegisterCallback. Everything else should have been in cache recently, (although we do cross a cache-line with linear execution in AXStartDMA).
Could it be unexpectedly slow instructions? The final mfmsr in OSDisableInterrupts might be a bit slower than we account for? There's three BLRs and two BLs I guess?
Maybe these 32 instructions run just slow enough that another interrupt always lands in that window, allowing the 400 cycles to pass before the interrupt we care about?
Maybe some other DSP related configuration I've overlooked changes the speed of the interrupt?
Maybe there's a mistake in my test and 400 cycles is completely wrong?
Anyway, sorry about the ramble - I really hoped this'd be easy after hardware testing.