Emulator Issues #12565
closed(WiiVC Majora's Mask) [NARE01] Dolphin does not emulate unaligned uncached stores the same as real Wii hardware.
Added by Rylie over 3 years ago. Updated over 3 years ago.
0%
Description
Game Name?
The Legend of Zelda: Majora's Mask
Game ID? (right click the game in the game list, Properties, Info tab)
NARE01 (000100014e415245)
MD5 Hash? (right click the game in the game list, Properties, Verify tab, Verify Integrity button)
ac4cdf326371cf619744241ed693feb3
What's the problem? Describe what went wrong.
This is a highly technical problem that pretty much only affects speedrunners, requires hefty explanation and game knowledge, and isn't documented in the Broadway CPU manual.
To begin, let me say that The Legend of Zelda: Majora's Mask is an extremely broken videogame. In particular, it has a glitch known as SRM (Stale Reference Manipulation) that allows the player to overwrite nearly anything within the region of memory known as the "actor heap" [[https://www.zeldaspeedruns.com/mm/srm/srm-primer]]. A recently found application of SRM called "LightNode SRM" enables us to write an arbitrary (32-bit) word to an arbitrary address. This is the holy grail of glitches for any videogame, and is typically referred to as "Arbitrary RAM Write".
LightNode SRM is actually pretty weak on a real N64, because, as a MIPS machine, it doesn't support misaligned loads and stores. But the PowerPC architecture does support misaligned loads and stores, and this ability carries through to the Wii Virtual Console version of Majora's Mask (likely the GCN and Wii U VC versions as well, but those are untested since the speedrunning community does not have practice tools for those versions). The catch is that the Broadway CPU manual says very little about how misaligned loads and stores are actually implemented, merely that they incur a performance penalty.
Through testing using our practice tool, called "kz", on real Wiis, our community has discovered some idiosyncrasies about how misaligned stores work (at least within the environment of the WiiVC version of Majora's Mask). The behavior of misaligned stores on real hardware is dependent on the memory domain to which the word is being stored to. In particular, we have 4 memory domains that work on NARE (that we've discovered so far).
- 80xxxxxx: Emulated N64 Cached RDRAM / Wii Cached MEM1 1T-SRAM
- A0xxxxxx: Emulated N64 Uncached RDRAM
- C0xxxxxx: Wii Uncached MEM1 1T-SRAM
- E0xxxxxx: Not entirely sure. Possibly N64 TLB as used in Paper Mario, which released on WiiVC before Majora's Mask did.
When the misaligned store is to the 80/A0 memory domain, it works basically as expected, and Dolphin works the same as hardware. Namely, exactly 4 consecutive Bytes will be written at the specified address.
When the misaligned store is to the C0/E0 memory domain, things get interesting. Instead of writing only 4 Bytes (1 word), we see either 8 Bytes (2 words) or 16 Bytes (4 words) depending on the address being written to. If the address is such that at least one Byte will be written on either side of a doubleword boundary (so xxxxxxx0 or xxxxxxx8) then a full 4 words is written (we call this a QuadWord Write or QWW). If the crossing point is only a singleword boundary (so xxxxxxx4 or xxxxxxxC), then only 2 words are written (we call this a DoubleWord Write or DWW). Dolphin, however, treats this case the same as the previous. Only a single (misaligned) word is written, not two or four.
As you can probably imagine, this behavior that real hardware has is extremely desireable. If being able to write a singleword to an arbitrary address was powerful, being able to write a quadword is even moreso. In particular, the latest Any% route for WiiVC hard-requires the QWW behavior, which means it does not work on Dolphin.
Using the Any% route as a concrete example:
We are writing the word E02D0043 to the address C01C5557. The memory in the destination region looks like this in Dolphin after the write:
C01C5550: ????????
C01C5554: ??????E0
------------------ doubleword boundary
C01C5558: 2D0043??
C01C555C: ????????
On real Wii hardware, however, we get the following:
C01C5550: 2D0043E0
C01C5554: 2D0043E0
------------------ doubleword boundary
C01C5558: 2D0043E0
C01C555C: 2D0043E0
My conjecture about what is happening is this:
The Broadway CPU implements unaligned stores via a multistep process. First, it bitshifts the word to be written. Next, it writes that shifted word to both word-aligned addresses involved. In the case of crossing a doubleword boundary, it does the same with writing a bitshifted doubleword to both doubleword-aligned addresses involved. Note that this is a complete guess with no evidence, merely a gut feeling based on previous studies of assembly language and electronics. As to why this only affects certain memory domains, I haven't the foggiest. The memory domain-specific aspects might actually be due to the N64 emulator packaged in the WiiVC release, but I honestly believe the misaligned store duplication behavior itself is CPU hardware-level.
What steps will reproduce the problem?
Follow the steps in the video linked at the bottom (it's far too much to explain in a text format since it requires knowledge of game-specific glitches and mechanics).
Alternatively, simply load my savestate (not attached, but can be provided), target-walk into the tunnel, drop the invisible pot, and play Song of Time.
Is the issue present in the latest development version? For future reference, please also write down the version number of the latest development version.
Yes, 5.0-14519
Is the issue present in the latest stable version?
Yes, 5.0
If the issue isn't present in the latest stable version, which is the first broken version? (You can find the first broken version by bisecting. Windows users can use the tool https://forums.dolphin-emu.org/Thread-green-notice-development-thread-unofficial-dolphin-bisection-tool-for-finding-broken-builds and anyone who is building Dolphin on their own can use git bisect.)
[First broken version number here (if applicable)]
If your issue is a graphical issue, please attach screenshots and record a three frame fifolog of the issue if possible. Screenshots showing what it is supposed to look like from either console or older builds of Dolphin will help too. For more information on how to use the fifoplayer, please check here: https://wiki.dolphin-emu.org/index.php?title=FifoPlayer
[Attach any fifologs if possible, write a description of fifologs and screenshots here to assist people unfamiliar with the game.]
What are your PC specifications? (CPU, GPU, Operating System, more)
Intel 4570R CPU, Intel Iris Pro 5200 Integrated GPU, 16GB 1600 MHz DDR3 RAM, Windows 10 Pro 21H1
Is there anything else that can help developers narrow down the issue? (e.g. logs, screenshots,
configuration files, savefiles, savestates)
I do have savestates but they exceed the maximum file upload size, even when 7z-compressed. If you would like them, just let me know where you'd like me to upload them.
I also made this video demonstrating the behavior, teaching how to reproduce it, and explaining exactly what is going on in the game: [[https://youtu.be/g8IOuVOL-oU]]
Files
WiiAddressResolution.PNG (70.3 KB) WiiAddressResolution.PNG | Rylie, 07/06/2021 03:52 PM | ||
gc-unaligned.zip (456 KB) gc-unaligned.zip | Standalone hwtest gamecube | phire, 07/24/2021 12:42 AM | |
gc-unaligned-passed.png (1.68 KB) gc-unaligned-passed.png | Results of running on a real gamecube | phire, 07/24/2021 12:43 AM | |
DolphinCreditsWarp.PNG (353 KB) DolphinCreditsWarp.PNG | Rylie, 07/27/2021 03:37 PM |
Updated by degasus over 3 years ago
- Subject changed from (WiiVC Majora's Mask) [NARE01] Dolphin does not emulate unaligned stores the same as real Wii hardware. to (WiiVC Majora's Mask) [NARE01] Dolphin does not emulate unaligned uncached stores the same as real Wii hardware.
Updated by JosJuice over 3 years ago
Based on a hardware test, unaligned writes to the 0x80000000 area behave as usual, and unaligned writes to the 0xC0000000 area seem to cause exceptions. This matches IBM's PowerPC documentation. Perhaps the funky "duplication" behavior is caused by the game's alignment exception handler?
I will try to implement this exception correctly and create a build you can test. (I only have NARP personally and not NARE, so I wouldn't be able to load your savestate.)
Updated by Rylie over 3 years ago
Oh interesting! I hadn't considered that possibility before. I wonder if the store is being implemented by the internal N64-to-Wii dynarec as a stwcx. That would definitely generate an alignment exception, and would certainly make sense, since for non-80/A0 address space, it simply adds the N64 address (C01C5557 in this case) to a fixed offset relative to the start of MEM1 where emulated RDRAM is stored (0x00F22FC0 for the JP version of our practice tool, "kz", and approximately that value for other versions). stwcx would be an efficient way to sum the N64 address and MEM1 offset and do a store in a single instruction.
Admittedly I have not researched the PPC recompiled assembly nearly as much as the MIPS original, so that's why a lot of this is conjecture for me. I can try to look into it more if needed, I'm just not as comfortable with Dolphin's debugging tools as PJ64's, but I'm sure I can figure it out with enough time.
In any case, I would be happy to test such a build for you! Unfortunately, the speedrunning community doesn't tend to make setups for PAL/NARP, since it's presumed slower by default due to 50 Hz instead of 60 Hz.
Updated by JosJuice over 3 years ago
stwcx is documented as generating an alignment exception on misaligned addresses regardless of whether the memory mapping is cache-inhibited, but as you've shown, the behavior here is different for 0xC0000000 (which is cache-inhibited). So the instruction involved is probably one of the more "normal" store instructions, like stwx.
And yes, the PAL version is pretty slow. Even in casual play it feels sluggish unless I put on the Bunny Hood :)
Updated by JosJuice over 3 years ago
Here's the pull request containing the changes I would like you to test: https://github.com/dolphin-emu/dolphin/pull/9865
Windows build: https://dl.dolphin-emu.org/prs/66/1f/pr-9865-dolphin-latest-x64.7z
Testing instructions:
- Start Dolphin, then close it again (just to add the
AlignmentExceptions
line to Config/Dolphin.ini) - Open Config/Dolphin.ini and change
AlignmentExceptions = False
toAlignmentExceptions = True
- Start Dolphin again and test if Majora's Mask behaves differently
Loading a savestate from a previous version of Dolphin is probably fine (as long as you know that the misaligned store didn't happen before making the savestate, of course), but if you can't manage to get the intended behavior when using an old savestate, please also try without using any old savestates.
Updated by Rylie over 3 years ago
Unfortunately, that build crashes when I do the unaligned write (even with making new savestates), complete with the characteristic Wii buzzing sound :) Is there any crash log or anything I can provide?
Something that may be worth mentioning... LightNode SRM actually does 2 writes. So while writing E02D0043 to the address C01C5557 is the goal, it comes with a spurious write of C01C5553 to the address E02D004B (which ends up as a DWW on real hardware, since it only crosses the singleword boundary of E02D004C). Obviously E0 is not a valid memory domain for the Wii (as far as I know), which is why I'm assuming it's something built into the internal N64-Wii emulator, likely to handle Paper Mario which uses the N64 E0xxxxxx domain heavily for its TLB. Also note, all of the above are N64 addresses, so add 0x00F1D0E0 to convert to Wii addresses.
It would be better if I could read the recompiled PPC asm for how the internal dynarec handles writes to N64 address space, but I can't seem to figure out breakpoints in Dolphin. When I set a write breakpoint on the address C01C5557, I expect it to stop execution when that address is written to for the first time, but that doesn't happen for some reason.
Updated by JosJuice over 3 years ago
- Assignee deleted (
JosJuice)
I think I'll unassign myself from this issue since looking into what the game actually is doing is a bit too much for me. But if someone else is able to figure out what's wrong, I'm up for updating the pull request and trying to get it merged.
Updated by JMC4789 over 3 years ago
Breakpoints in Dolphin only work in MEM1, which is why that breakpoint doesn't work.
Updated by Rylie over 3 years ago
- File WiiAddressResolution.PNG WiiAddressResolution.PNG added
I figured out how to get breakpoints to work. Turns out I was breaking on cached (8) address space when the write was actually happening to uncached (C) address space.
Here's a rough mental decompilation for how the Wii address is being resolved for this write.
uint32_t WiiAddress = N64Address & 0xDFFFFFFF + 0x00F1D0E0;
Since the the first nybble is being &ed with 0xD, that means that 0x8 and 0xA N64 addresses both resolve to 0x8 Wii addresses (cached MEM1), and 0xC and 0xE N64 addresses both resolve to 0xC Wii addresses (uncached MEM1). So this bug report is definitely only about uncached unaligned writes. It's a good thing degasus updated the title 4 days ago.
Lastly, the store that's happening is just a regular stw instruction. So somehow, unaligned stw to uncached MEM1 is what's causing the write duplication behavior. I hope that helps. I'll keep messing around with Dolphin debugging tools now that I know how they work, and see if I can come up with any more information for you.
Updated by JMC4789 over 3 years ago
- Status changed from New to Accepted
I don't know about the other developers, but I'd love to emulate this correctly because it's a super cool weird edge-case.
Updated by Rylie over 3 years ago
The OoT community has finally started playing around with this hardware quirk for their new Any% route. Usually once they get their hands on something, it gets figured out a lot faster since the OoT community is much bigger. Someone from that community suggested to me that this is likely a 60x bus behavior regarding memory transactions, not a CPU behavior regarding loads and stores. They linked this reference manual of the 60x bus: https://www.nxp.com/docs/en/reference-manual/MPC60XBUSRM.pdf
I've taken a look at it and while it does confirm that crossing a doubleword boundary with a memory access requires two separate accesses (one for each doubleword), it does not mention that each access will fill the entire doubleword if cache-inhibited. Maybe someone who understands the GC/Wii architecture better than me will get more out of this.
Updated by JosJuice over 3 years ago
"Cache-inhibited" is a concept that exists in the CPU, not on the bus. However, whether an access is cache-inhibited does end up affecting the bus access patterns of the CPU. When caching is used, the CPU reads and writes entire cache lines at once (or at least I believe so?), and when caching isn't used, the CPU just reads and writes exactly what the program requested. So when caching is not inhibited, none of the CPU's accesses are misaligned as far as the bus can tell, even if the program is doing misaligned accesses from its own perspective.
So, yes, it's possible that this behavior has to do with how the bus works. I don't know if the fact that I managed to get what looks like a crash when doing misaligned writes to a cache-inhibited view of memory is a red herring or not...
Updated by leo60228 over 3 years ago
While he doesn't go into much detail on the exact behavior, this thread from marcan seems closely related.
Updated by delroth over 3 years ago
Has someone managed to reproduce this on real hardware with a minimal test case? e.g. a single "stw" to the uncached 1T-SRAM mapping which triggers the 128B overwrite. This would be a good starting point.
#3 mentions trying to reproduce but I suspect it wasn't done right -- JosJuice probably knows that now after having written the pull request, but "stw" shouldn't trigger an unaligned exception, even when writing to M=0 mappings. Instead the PPC should convert that to two single-beat 60x writes, with the right write byte enables to mask writes to the irrelevant areas within each 64b dword.
I've been brainstorming this with segher, the current most plausible hypothesis we have is that something in Hollywood is mishandling these write masks on the 60x transactions. The 1T-SRAM is probably not directly on the 60x bus, there's likely some crossbar in Hollywood itself and some 1T interface, and it's possible that at one of these levels the write masks get converted to some other system (e.g. access sizes). If this conversion logic pattern matches "known patterns" that only supports aligned accesses mask patterns, and defaults to full 64b write otherwise, then that might explain this situation.
It's known that Hollywood does weird stuff with these write masks in some situations: for example, they get ignored on some MMIO accesses. This clearly indicates that something somewhere drops that information. But clearly it's not entirely dropped, because otherwise no single-beat 60x memory transaction (aka. any cache-inhibited write) would ever behave properly.
Updated by Rylie over 3 years ago
Has someone managed to reproduce this on real hardware with a minimal test case? e.g. a single "stw" to the uncached 1T-SRAM mapping which triggers the 128B overwrite. This would be a good starting point.
I don't really have the Wii development skills necessary to write a program like this, but I'm reasonably sure what is happening in OoT/MM VC doesn't have any funny business beyond the single unaligned, uncached stw we know it does (see screenshot of the disassembly in my earlier post).
I've been brainstorming this with segher, the current most plausible hypothesis we have is that something in Hollywood is mishandling these write masks on the 60x transactions. The 1T-SRAM is probably not directly on the 60x bus, there's likely some crossbar in Hollywood itself and some 1T interface, and it's possible that at one of these levels the write masks get converted to some other system (e.g. access sizes). If this conversion logic pattern matches "known patterns" that only supports aligned accesses mask patterns, and defaults to full 64b write otherwise, then that might explain this situation.
This sounds very likely based on marcan's thread.
It's known that Hollywood does weird stuff with these write masks in some situations: for example, they get ignored on some MMIO accesses. This clearly indicates that something somewhere drops that information. But clearly it's not entirely dropped, because otherwise no single-beat 60x memory transaction (aka. any cache-inhibited write) would ever behave properly.
My guess is the catch here is that it's not any cache-inhibited write, it's cache-inhibited unaligned writes (perhaps also, cache-inhibited writes of less than 1 word). Each of these by itself is rare, both at once is probably something that almost never happened in retail software. When you're using a cache, even "unaligned" writes present themselves to the 1T-SRAM as an aligned 1+ word write.
Updated by delroth over 3 years ago
Basic attempts at a repro failed, so it seems like there must be some kind of funny business involved. Here are the cases I've tested, for completeness. ubuffer here is a 16B aligned uncached MEM1 address:
printf("aligned 32b to +4\n");
reset_buffer();
ubuffer[1] = 0x12345678;
hexdump(ubuffer);
printf("aligned 16b write to +6\n");
reset_buffer();
((u16*)ubuffer)[3] = 0x1234;
hexdump(ubuffer);
printf("aligned 8b write to +7\n");
reset_buffer();
((u8*)ubuffer)[7] = 0x12;
hexdump(ubuffer);
printf("unaligned 32b to +5\n");
reset_buffer();
*(u32*)((u8*)ubuffer + 5) = 0x12345678;
hexdump(ubuffer);
printf("unaligned 32b to +6\n");
reset_buffer();
*(u32*)((u8*)ubuffer + 6) = 0x12345678;
hexdump(ubuffer);
printf("unaligned 32b to +7\n");
reset_buffer();
*(u32*)((u8*)ubuffer + 7) = 0x12345678;
hexdump(ubuffer);
printf("unaligned 16b to +7\n");
reset_buffer();
*(u16*)((u8*)ubuffer + 7) = 0x1234;
hexdump(ubuffer);
printf("unaligned 32b to +3\n");
reset_buffer();
*(u32*)((u8*)ubuffer + 3) = 0x12345678;
hexdump(ubuffer);
printf("unaligned 16b to +3\n");
reset_buffer();
*(u16*)((u8*)ubuffer + 3) = 0x1234;
hexdump(ubuffer);
All of these show the "expected" writes, nothing being overwritten that shouldn't be overwritten (https://i.imgur.com/VaRohkD.png). I've also tried matching the HID0/HID2/HID4 settings from OOT VC, in case some weird bit got enabled there, but that didn't change anything. DBAT mappings also don't seem unusual for the 0xC range in OOT VC. Running out of ideas right now, will come back to it tomorrow morning maybe :)
Updated by delroth over 3 years ago
Good news: turns out my repro was broken because I accidentally dropped a volatile at a critical place and the compiler decided it should just make all the writes aligned...
Bad news: now the repro looks pretty much as simple as it could be ("stw rA, 7(rB)"). But when doing any kind of unaligned u32 write, or even aligned u16 writes to +2, we get a Broadway hang which doesn't even look like it triggers any exception (like, it's not a DSI/ISI/Alignment/MCE, it just freezes as far as we can tell). I'm entirely confused by this, because this is not a result that anyone else has been getting. I suspect I'm doing something stupid, but it's very unclear what, and aligned 32b writes do work with that same test harness...
Updated by Rylie over 3 years ago
Very interesting. At least we are finally getting some sort of abnormal behavior, though it's not exactly what we were expecting.
This is a long shot, but is there any chance that the behavior is somehow reliant on the b (branch) instruction immediately after the stw? I know on N64, we have to be particularly careful around branches and jumps due to the delay slot. Pretty sure PPC doesn't have one of those, but it's just an idea I had, since I'm still finding it hard to envision any shenanigans within the game itself, since the code is so simple (literally just the four instructions: and, add, stw, b)
Updated by delroth over 3 years ago
It definitely seems like there's an interaction with either the instruction pipeline, the load/store execution units on the CPU, or something else altogether. So far we've found many ways in which it freezes, one very careful way where it doesn't freeze with just the right sequence of instructions... but then it still doesn't repro the behavior you've been seeing with 32b->128b, instead the unaligned 32b write just get executed normally. eigenform is working on trying to refactor and minimize the working unaligned write instruction sequence, then we want to try it on MEM2 to see if the behavior differs there (marcan claims it should?).
This is a rabbit hole of wtf, and with 4 experienced Wii reverse engineers / emulator developers / PPC experts brainstorming this for now almost 6h we've yet to even figure out a theory as to what's going on.
Updated by meta over 3 years ago
Alright, I think GCC threw delroth, phire and I for a loop: it was coalescing stores to avoid the misaligned one. :^(
Avoiding all compiler magic, misaligned accesses to the uncacheable MEM1 mapping seem to consistently crash in my tests built with libogc.
Not clear what's happening yet. Have tests and notes at https://github.com/eigenform/broadway-misaligned-accs
Updated by phire over 3 years ago
My current unproven theory is the following:
Broadway supports uncached, unaligned writes perfectly fine (documented, but not proven)Hollywood's implementation of the x60 bus is incomplete and doesn't support unaligned writes.On receiving the two-beat, two part unaligned bus transfer, Hollywood's Processor Interface interprets it as a regular four-beat 32byte cacheline transfer.Hollywood is stuck waiting for the next two beats, which never come.Broadway is stuck waiting for the write compete acknowledgement, which never comes.Deadlock. Broadway is hung.
And on our homebrew testing, this is where we are stuck, any unaligned write end up in a broadway hang.
But Zelda VC has a bunch of other activity happening within hollywood. 3d rendering, efb write-backs, audio rendering, IOS commands etc. My theory is that one of those (probably a write to MEM1) is causing PI and Broadway to exit their deadlock and continue execution.
I'm currently setting up some tooling so I can examine this broadway hang from starlet, which I hope will tell us more about the nature of the hang.
Edit: Wild theory disproved below.
Updated by JosJuice over 3 years ago
Bad news: now the repro looks pretty much as simple as it could be ("stw rA, 7(rB)"). But when doing any kind of unaligned u32 write, or even aligned u16 writes to +2, we get a Broadway hang which doesn't even look like it triggers any exception (like, it's not a DSI/ISI/Alignment/MCE, it just freezes as far as we can tell). I'm entirely confused by this, because this is not a result that anyone else has been getting. I suspect I'm doing something stupid, but it's very unclear what, and aligned 32b writes do work with that same test harness...
It is possible that I in fact was getting a hang. I was just assuming that it was an exception that wasn't getting handled properly, since getting an actual hang of the CPU is quite strange. However I was only getting hangs when doing unaligned writes to the 0xC0 area (with libogc's default mappings), not the 0x80 area.
Updated by Rylie over 3 years ago
But Zelda VC has a bunch of other activity happening within hollywood. 3d rendering, efb write-backs, audio rendering, IOS commands etc. My theory is that one of those (probably a write to MEM1) is causing PI and Broadway to exit their deadlock and continue execution.
I just asked our resident expert on the N64 VC emulator, and he said that the Gfx fifo and likely the frame buffers are both in MEM1. He doesn't think that the GPU would write to the Gfx fifo, but is positive it writes to the frame buffers. So the GPU writing to MEM1 in Zelda 64 VC is highly plausible. I'm under the impression (perhaps incorrectly) that MEM2 is usually used for the GPU in most games, but in N64 VC it's basically only used to store the N64 ROM, meaning that MEM1 is the general purpose RAM for both CPU and GPU.
Updated by phire over 3 years ago
Ok, turns out the reason why meta's, josjuice's and delroth's reproduction attempts were failing was stupid.
libogc enables the ERROR interrupt from PI for some reason (likely bad copy/paste). It then never handles or clears the interrupt flag, resulting in the interrupt continually triggering.
Thanks to Extrems for pointing this out.
In my mini/ppcskel environment, I've replicated the both the original behaviour of this issue (both DWW and QWW) as described, along with broadway hanging when you enable (but don't handle) ERROR interrupts.
There are no special conditions for replicating the bug, it should be super-easy for someone to actually implement. It was just libogc's faulty exception handling throwing everyone off.
Updated by Rylie over 3 years ago
Oh awesome! Glad it was just a compiler issue, and it really is as simple as it sounds (unaligned uncached stw = DWW/QWW with no VC emu funny business). So is the behavior likely due to faulty write-masking between the MEM1 and the 60x bus as delroth suggested in the quote below?
I've been brainstorming this with segher, the current most plausible hypothesis we have is that something in Hollywood is mishandling these write masks on the 60x transactions. The 1T-SRAM is probably not directly on the 60x bus, there's likely some crossbar in Hollywood itself and some 1T interface, and it's possible that at one of these levels the write masks get converted to some other system (e.g. access sizes). If this conversion logic pattern matches "known patterns" that only supports aligned accesses mask patterns, and defaults to full 64b write otherwise, then that might explain this situation.
Updated by Rylie over 3 years ago
Also, do we think this behavior would happen on GCN as well? Like I mentioned previously, the MM community thinks it probably does, but it's difficult for us to test since we don't have practice tools for the GCN version of the game. Would anyone be able to run a basic test like this on an actual GCN (without using upstream libogc due to the bug)?
Updated by phire over 3 years ago
Actually, I can point directly to the patent which show the issue (at least for the gamecube)
https://patents.google.com/patent/US8098255
If you check out figure 11, it shows the exact internal bus layout between the Processor Interface (which terminates the x60 bus) and Memory Controller (which connects to the 1T-SRAM). There are only two bits of mask infomation labelled as pi_mem_msk(1:0), which allow masking just the top 32bit or bottom 32bit of data in this internal 64bit transfer.
The text of the patent confirms this. The QWW behaviour is coming from the fact that they cross the 8 bit boundary, and two of these 64bit transfers are needed.
Also, if you check out figure 6A/6B, the Memory controller is connected to four Memory Access Controllers, each controlling 32bits of memory. It's likely that the memory controller can only mask by disabling a whole Memory Access Controller.
It looks like the Wii has inherited this behaviour (Hollywood is a faster flipper with a few extra things bolted on). I did some testing and looks like the behaviour is the same on both MEM1 and MEM2.
Updated by phire over 3 years ago
Yes, I'm 99.9% sure this will apply to the gamecube too.
BTW, I also did some quick tests with byte and halfword writes. It seems they simply ignore the size and write the full 32bits of the register, and then the same DWW and QWW behaviour happens.
I'm curious what happens when you do unaligned writes to registers within flipper. I don't think they are going over this internal PI <--> MI bus and the behaviour could be different.
Updated by JMC4789 over 3 years ago
If you do get this working in Dolphin, I request some footage of the glitch in action on Dolphin for the progress report :)
Updated by phire over 3 years ago
I'm not 100% sure who's doing the word replication behaviour.
I suspect it's actually Boardway, who just fills out the rest of the bus with duplicated bits. From the point of the x60 bus, those bits are undefined and it might have been easier internally to just replicate.
The other option is that PI is replicating the words. Would seem a bit weird, since PI clearly knows that Broadway has messed up and done an unaligned write. It triggers an interrupt to let the code know it messed up, and even has address and cause registers with details of the error (Note to self, Batman: Vengeance is a debug build that contains code to print out these error registers)
But despite notifying Broadway of the error, PI then lets the write pass though to Memory interface. I guess in their quest for performance, ArtX/ATI decided to not add hardware to actually block the request.
Updated by JosJuice over 3 years ago
- Assignee set to JosJuice
Now that we know what's going on, I'd like to continue to try to implement this.
Updated by JMC4789 over 3 years ago
I'm excited to see this in action in Dolphin.
Updated by JosJuice over 3 years ago
I've posted a working hardware test here: https://github.com/dolphin-emu/hwtests/pull/42
Unfortunately I don't understand the pattern behind when PI ERROR interrupts are raised and when they aren't. I was wondering if it was due to the interrupt not happening immediately, but (back when I was running the tests for 32-bit stores before 16-bit and 8-bit stores) I always got an interrupt after the very first misaligned write, and adding an eieio did not appear to change the results whatsoever (though I am not certain if this is sufficient for avoiding race conditions). Because of this, I think I will focus on implementing the QWW/DWW behavior in Dolphin without implementing PI ERROR interrupts.
Updated by phire over 3 years ago
- File gc-unaligned.zip gc-unaligned.zip added
- File gc-unaligned-passed.png gc-unaligned-passed.png added
I modified JosJuice's hardware test to be standalone for gamecube.
Confirmed that the DWW/QWW hardware behaviour is identical on gamecube. Though that doesn't guarantee it's actually exploitable, as the gamecube version of the N64 emulator might be different enough to not trigger an unaligned uncached write.
Updated by Rylie over 3 years ago
Wow this is fascinating. So it really is basically just like how I guessed it was in my initial video on the subject. The Broadway just bitshifts the word to be written by 1 to 3 bytes, then tries to copy paste that a couple times, expecting the irrelevant parts will be masked, however, the MI only has the capability to mask entire words (this is the part I couldn't possibly foresee). Without any masking, the full doubleword is written, and in the case of crossing a doubleword boundary, it does these same shenanigans twice.
If you do get this working in Dolphin, I request some footage of the glitch in action on Dolphin for the progress report :)
Consider it done. I would be thrilled to represent yet another N64 community that discovered bizarre hardware behavior via our broken-ass VC games not working the same in Dolphin as on real Wiis (the SM64 floating platform was legendary, amazing work on that).
I'm excited to see this in action in Dolphin.
Me three :)
Updated by Rylie over 3 years ago
phire wrote:
I modified JosJuice's hardware test to be standalone for gamecube.
Confirmed that the DWW/QWW hardware behaviour is identical on gamecube. Though that doesn't guarantee it's actually exploitable, as the gamecube version of the N64 emulator might be different enough to not trigger an unaligned uncached write.
That's wonderful to know! The GameCube version of the N64 emulator was written by the same guy as VC was, so I expect it very likely would trigger that same write (barring any issues with the code being dynarec'd more frequently due to the GCN's smaller available memory) [Bounds checking is only done during dynarec phase, but if the code is already dynarec'd it will just let you write wherever you want.]
Updated by phire over 3 years ago
So it really is basically just like how I guessed it was in my initial video on the subject. The Broadway just bitshifts the word to be written by 1 to 3 bytes, then tries to copy paste that a couple times, expecting the irrelevant parts will be masked, however, the MI only has the capability to mask entire words (this is the part I couldn't possibly foresee). Without any masking, the full doubleword is written, and in the case of crossing a doubleword boundary, it does these same shenanigans twice.
The main other things I would personally add to the explanation:
- Processor Interface totally knows about unaligned, uncached accesses and will actually raise an error exception
- But release builds of the official gamecube/wii SDKs mask these exceptions, ignoring them.
- Debug builds will print out an error message with the error type and address.
- The doubleword boundary crossing behaviour is a fully documented and defined feature of Broadway.
- Despite erroring, PI will totally pass on an undefined write. It seems whenever there is any kind of error, it just throws open both mask bits.
That's any kind of error. Even a byte or half write on an aligned word boundary will cause PI to throw open both mask bits.
It's possible opening both mask bits is a hardware bug, that they ARTX/ATI actually intended to close both mask bits to suppress the write. Or MI doesn't understand a masked write and flips back to writing the entire dword.
It's also possible that ARTX/ATI decided to do this for performance reasons. That the error exception is intended to crash the program immediately and there is no expiation that the program could or should continue. Supporting this argument, the Memory Controller supports "memory protection exceptions" which are documented to act in a similar way, throwing an exception but allowing the read/write to continue anyway.
Updated by marcan over 3 years ago
FWIW, the error logic being distinct from the actual interface logic is not uncommon. On the Apple M1, if you use <32bit writes to the UART peripheral, it throws an SError, but it works anyway (in particular the TX/RX registers are only 8 bits wide so it's fine for those).
At the point where you're throwing around bad memory ops, blocking them doesn't really matter much. The error is supposed to tell you that something is definitely wrong so you can fix it and make it not happen. That said, throwing open both mask bits seems like a poor default in this particular case.
Updated by segher over 3 years ago
phire wrote:
I'm not 100% sure who's doing the word replication behaviour.
I suspect it's actually Boardway, who just fills out the rest of the bus with duplicated bits. From the point of the x60 bus, those bits are undefined and it might have been easier internally to just replicate.
("60x bus"... it means 601, 603, 604).
The other option is that PI is replicating the words. Would seem a bit weird, since PI clearly knows that Broadway has messed up and done an unaligned write. It triggers an interrupt to let the code know it messed up, and even has address and cause registers with details of the error (Note to self, Batman: Vengeance is a debug build that contains code to print out these error registers)
Yes, it almost certainly is the 750, because it has to drive something
there, and just the rotated data is the simplest thing to do.
Btw, unaligned accesses on the 60x bus are perfectly fine... it is the
flipper / hwood that does not support this.
But despite notifying Broadway of the error, PI then lets the write pass though to Memory interface. I guess in their quest for performance, ArtX/ATI decided to not add hardware to actually block the request.
This is perfectly fine if this is meant as a debug feature (instead of as
something that will be used during normal operation).
Nice work everyone :-)
Updated by Rylie over 3 years ago
The other LNSRM researching for the MM Community, Türkenheimer, just uploaded a video showing off some strange behavior he has encountered from uncached loads: https://youtu.be/3mIyHJ47IvA
Our community doesn't understand this behavior nearly as well, but it appears as though uncached loads are causing some sort of zero-fill behavior (the place he keeps warping to is Scene 0, Spawn 0, with Params 0, aka a 32 bit word of all zeros. Türk claims that alignment doesn't matter in the case of the uncached loads.
Admittedly, I understand the loads behavior far less than the stores behavior because I have personally worked on routes that encounter the stores behavior, but he is really the only one who has worked on a route that encounters the loads behavior. More investigation is obviously needed. I'm not even sure at this point whether Dolphin does or does not emulate the loads correctly. I just wanted to throw it out there though, because it likely is related.
In the meantime though, I would be perfectly satisfied if only the unaligned uncached store behavior were implemented in Dolphin, as that's the behavior that is actually useful for speedrunning. The uncached loads behavior Türk describes is actually a mild inconvenience.
Updated by JMC4789 over 3 years ago
I might have distracted JosJuice from fixing this earlier by tricking them into fixing a different issue first ;)
It's really interesting seeing all of these uncached things. It's nice having a REAL case to look at that uses them, rather than just hardware tests.
Updated by JosJuice over 3 years ago
If this odd behavior with loads happen even with aligned 32-bit loads, and the behavior does not happen in Dolphin, it's likely that it's the old problem of the CPU's data cache containing stale data. Dolphin does not emulate the CPU's data cache correctly because it would make the performance much worse.
In case you're not familiar with the problem: Let's say that you have some piece of memory at 0x80001000/0xC0001000 which starts out having the value 0. At some point, the CPU reads from 0x80001000 and retrieves the value 0. This has a side effect of placing an entry into the CPU's data cache that says that the memory at 0x80001000 contains the value 0 (and also further entries for some additional memory adjacent to 0x80001000, in case the program will want to access that later). Then at some later point, you write the value 1 to 0xC0001000. This causes the value in memory to change to 1, but if that entry in the data cache is still there, that entry does not get updated, because the write was not to 0x80001000. If you then try to read from 0x80001000 and the entry is still in the data cache, the CPU will find the entry in the data cache and return the value 0, skipping accessing main memory to save time, even though the value in main memory is now actually 1.
The intended way for game developers to avoid this kind of issue is to use the instruction dcbi to get rid of the entry from the data cache at some point. This is most likely not feasible in a speedrun setting unless you have ACE. The less reliable alternative would be to make the CPU access so many other addresses in between the first 0x80001000 read and the second 0x80001000 read that the data cache fills up and evicts the entry for 0x80001000. Though, I suppose it's likely that you don't have much control of this either in a speedrun setting.
Dolphin basically acts as if the data cache contains nothing, but while giving the emulated game the same performance as if the data cache contained every part of memory.
Updated by JosJuice over 3 years ago
Sorry, for some reason when coming up with the example scenario above I was thinking of an uncached store instead of an uncached load even though you had specified otherwise. There is an equivalent problem with uncached loads, though.
Let's once again have 0x80001000/0xC0001000 in memory contain the value 0 to begin with. Now the CPU writes the value 1 to 0x80001000. This does not immediately cause the value in memory to update – rather, an entry gets added to the data cache that says that the value 1 is at 0x80001000. If the CPU now were to read from 0xC0001000, it would get the value 0 back because the value written to the cache has not been written back to main memory yet.
This too can be avoided by either using dcbi or ensuring that enough other addresses have entered the data cache in between the write and the read.
Updated by Rylie over 3 years ago
I got a second opinion on this, and we think the uncached loads behavior is due to the VC emulator. We think it probably is just that C0/E0 is not a valid memory domain on the N64, so if the code isn't already dynarec'd, it will bounds check that load and realize it's invalid and replace with zeros. If it is already dynarec'd, it should work normally, since it will bypass the bounds check.
So the unaligned uncached stores behavior is likely separate. That's a hardware "feature", whereas the loads behavior is a VC emu "feature".
Updated by Rylie over 3 years ago
Though it does seem to always not be dynarec'd in Dolphin (uncached loads always invalid -> zero filled), whereas on a real Wii it's variable. That could be due to any number of things though. No cache emulation, using JIT64, who knows. Definitely not worth chasing down. We'll just avoid those uncached loads since they only cause us grief.
Updated by JosJuice over 3 years ago
I've figured out why the interrupts weren't behaving as expected in my hardware test. Before each test, I was writing zeroes to my buffer to get rid of what the previous test wrote. However, I was doing so using std::fill with a volatile u8* pointer, and this was triggering the DWW behavior. Changing the pointer type to volatile u32* fixed the problem, and now I'm getting a very consistent pattern: There is an interrupt after every write to uncached memory, even ones that are 32-bit and aligned. But in my original iteration of the hardware test (before we figured out what was going on with the interrupts), I can't recall 32-bit aligned writes ever causing hangs...
Updated by JosJuice over 3 years ago
- Status changed from Accepted to Fix pending
Here is a pull request that implements the DWW/QWW behavior: https://github.com/dolphin-emu/dolphin/pull/9964
A Windows build is available at https://dl.dolphin-emu.org/prs/e1/1e/pr-9964-dolphin-latest-x64.7z
Rylie: I would like you to test this build like you tested the previous build. There's just one difference in the testing instructions: The option in Dolphin.ini is now called AlignmentQuirks instead of AlignmentExceptions (since it turns out that alignment exceptions weren't involved after all).
Updated by phire over 3 years ago
JosJuice wrote:
and now I'm getting a very consistent pattern: There is an interrupt after every write to uncached memory, even ones that are 32-bit and aligned.
I can't replicate this on my setup. No interrupts on 32bit aligned, uncached writes. Interrupts for everything else.
Updated by Rylie over 3 years ago
Just ran through the entire credits warp setup (QWW), start to finish, and yep, it works on the new build. I tested DWW as well and that seems to work as expected too. I also got someone else to confirm DWW/QWW is working for them.
Updated by JMC4789 over 3 years ago
Would it be possible for us to get a video in a public place that we could link to in the progress report following this getting merged?
Updated by Rylie over 3 years ago
How much performance would it cost to enable this feature by default?
If the cost is low (less than 1% in most games) it might be worth enabling it by default. Do games even do that many uncached writes by default?
Like phire wrote in the PR, I too am curious what the performance cost really is. I highly doubt many games at all do unaligned uncached writes, because if they did, they'd be overwriting so much memory they didn't intend to due to DWW/QWW. That sounds like a recipe for crashing or glitchy behavior.
On the flip side, maybe there were a small handful of games that were aware of this behavior and exploited it to their advantage, and so some previous glitches caused by running in Dolphin could be fixed? (complete speculation)
Updated by Rylie over 3 years ago
JMC4789 wrote:
Would it be possible for us to get a video in a public place that we could link to in the progress report following this getting merged?
Yes, absolutely. I just wanted to do an offline test first. I'm gonna have to think how I want to make that video though, because doing such a precise setup on a controller on Dolphin without savestates (which I made heavy use of in my offline test) is a bit difficult. Maybe I'll use keyboard for the first part.
Updated by JMC4789 over 3 years ago
You could use movie recording to essentially "TAS" it.
Updated by Rylie over 3 years ago
Oh I just realized that Dolphin has a "virtual notches" feature. That's probably gonna fix the issue I was having doing the setup entirely on controller with no savestates. So I'll probably just do it live in that case :)
Updated by phire over 3 years ago
Rylie wrote:
I highly doubt many games at all do unaligned uncached writes, because if they did, they'd be overwriting so much memory they didn't intend to due to DWW/QWW. That sounds like a recipe for crashing or glitchy behavior.
The problem is we take the performance hit on both unaligned aligned uncached writes, we can't tell them apart ahead of time.
Though I do suspect the number of games doing many aligned uncached writes is low. It's a large performance hit on real hardware to not use cached writes (or something else, like the write gather pipe or locked L1 cache DMA)
On the flip side, maybe there were a small handful of games that were aware of this behavior and exploited it to their advantage, and so some previous glitches caused by running in Dolphin could be fixed?
That's my main motivation to wanting it on by default if the performance hit is low.
Updated by Rylie over 3 years ago
- File DolphinCreditsWarp.PNG DolphinCreditsWarp.PNG added
Finished my video demonstration, so while we're waiting for that to come out of the oven, here's a teaser photo.
The problem is we take the performance hit on both unaligned aligned uncached writes, we can't tell them apart ahead of time.
You can't do a % 4 once you see it's uncached? I mean sure that would hurt performance for aligned uncached writes, but less than slowmem would I assume? (I don't really understand the difference between fastmem and slowmem).
Updated by delroth over 3 years ago
The problem is we take the performance hit on both unaligned aligned uncached writes, we can't tell them apart ahead of time.
Can we not slowmem the uncached memory space?
Updated by Rylie over 3 years ago
Here's the video :)
Updated by JosJuice over 3 years ago
Rylie wrote:
You can't do a % 4 once you see it's uncached? I mean sure that would hurt performance for aligned uncached writes, but less than slowmem would I assume? (I don't really understand the difference between fastmem and slowmem).
At the point we see it's uncached, we are already in the slowmem handler and have taken a page fault. (Unless we introduce an explicit check in the fastmem handler for whether memory is uncached, but then we would be hurting performance for all stores, so that's a no-go.) I suppose it isn't impossible to add a % 4 in the slowmem handler and then based on the result either do a simple store or branch to Write_U*, but now we have the problem that this simple store inside the slowmem handler in itself may trigger a page fault and have to go through slowmem again, which... Well, it's probably possible to solve it somehow, but it's a ton of complexity for very little gain. So I don't think it's a workable idea.
delroth wrote:
The problem is we take the performance hit on both unaligned aligned uncached writes, we can't tell them apart ahead of time.
Can we not slowmem the uncached memory space?
That is what my pull request does. I think phire's comment was missing an "and" – what he meant is that all uncached writes take a performance hit, regardless of whether they are aligned. And that is precisely because we're not allowing fastmem to be used with the uncached memory space.
Updated by JosJuice over 3 years ago
- Status changed from Fix pending to Fixed
- Fixed in set to 5.0-14829