Emulator Issues #4144
closedOpenCL texture decoding is global memory access bound
0%
Description
The OpenCL code for texture decoding spends more time doing global reads and writes than actual math, at least according to AMD KernelAnalyzer.
This is due to the vload/vstore functions. The spec, and Nvidia/ATI documentation appear to suggest that it requires alignment as this does, but, it generates 4 times as many fetch/store operations per loop as the equivalent vectorized code.
Changing the file to use vectors in the pointers, and correcting any load/store/pointer math to account for that, leads to a ~8x speedup in functions like DecodeI4 through DecodeRGBA8_RGBA, and a ~17-25x speedup in decodeCMPRBlock_RGBA.
This alone wouldn't be that great, given that the functions already appear to be fast, but, it eliminates the GLobal Memory bottleneck.
On Evergreen type cards, the writes are guaranteed to be 96 cycles minimum (creeping up to >280 on DecodeCMPR_RGBA).
This essentially hardcaps possible performance on any 5000 series or greater card, and makes them take considerably more time than any 4000 series card at the same logical tasks.
The number of ALU operations, fetches, and GPRs used are also significantly reduced. I've personally tested it it live, but haven't dumped textures and made sure of exact matches yet.
But, for instance, on DecodeCMPR_RGBA, the estimated throughput is just 22M on 5670, caps at ~100M for 5870, 6870, 6970, etc with the existing OpenCL file. The Radeon 4870 on the same function 188M.
With the modified file, 83M on the 5670, 725M on the 5870, and 737M on the 6970, and the 4870 remains unchanged (but still does 20% less work overall for the same results).
Intel, AMD, and Nvidia OpenCL performance/tuning guides all appear to strongly recommend this pattern (as well as constifying the correct things, avoiding implicit conversions, etc).
Basically all this does is enforce vectorization of the loads and stores, since the function calls are not. No actual functionality, logic, or math difference should result, but feel free to double check my math.
I've attached the full file, and a diff. If anyone wanted to run it through a code beautifier, no skin off my nose, I just put it together in a text editor on Windows. I don't have my normal tools with me. It gave some issues with tabs, apparently, but I can't see them.
Updated by MofoMan2000 almost 14 years ago
Neato. As I recall the developers were considering killing OpenCL because of the relatively minor performance increase it offered. This could add to the small list of reasons to keep it around.
Personally, I think they should keep it around but remove the option of disabling it, since it doesn't seem to cause any problems. I believe OpenCL requires an external, but it's 743 KiB and it isn't that hard to maintain. Looks like you already did most of the work here anyway.
</>
Updated by zephiris almost 14 years ago
I can't really say that this would actually increase performance in the real world; it would probably be less hard on weaker cards with weaker CPUs. The low-mid range card (5670) goes from a speedbump into performing about as well as the top end cards -used- to, while the low-low end card (5450) barely changes at all because it seems to have trouble dealing with OpenCL kernels at all. Texture decoding IIRC was a far larger and more expensive issue when this code first came about, as far as a casual observer like me remembers.
I don't think removing the option to disable would be a good idea. Checking vs. the reference implementation is very useful, at a minimum, and occasionally bugs crop up in the drivers or SDKs where the only solution for certain hardware/OS combinations is to disable OpenCL until it's resolved. The drivers and SDKs have mostly improved over the last year or so, though.
I don't know if there's an "obvious" reason why vstore and company don't meaningfully vectorize anything. It also appears to generate an average of 150 extra control flow instructions (serialization, no doubt?). It would apparently be nearly as easy to parallelize the groups, convert to OpenCL images (much faster), but given that it's not bound anymore, should scale well on anything that doesn't mind the looping to begin with, and apparently isn't called frequently...apparently that'd be better left to more performance critical work.
File: slightly updated version, safely avoids more implicit conversions and makes local counter/address types consistent according to spec. Saves an additional 3 operations, 1 cycle per function average.
Updated by skidau almost 14 years ago
Enabling OpenCL disables the SSE decoders, so the option has to stay.
Maybe we should continue to optimise the OpenCL decoders for the machines without SSE?
Updated by MofoMan2000 almost 14 years ago
Is OpenCL simply not compatible with SSE in this implementation? And what version of SSE is diabled, and does it really make a big difference?
Updated by johnchabot13 almost 14 years ago
Personally an improvement like this would be great, I know personally that gpu wise dolphin with AA which is pretty much broken on all backends anyways, isn't stressful, but alot of game even on very fast cpus (mine 3.8 ghz i7 920) still has trouble keeping full speed, so anything to free up more cpu cycles would be great
Updated by skidau almost 14 years ago
There are three texture decoders in Dolphin - OpenCL, SSE and plain C++. They work kinda like plugins and are swapped in and out depending on the capabilities of the user's machine. Using the OpenCL decoder does not disable the other SSE optimisations in the rest of the emulator.
Updated by danialhorton almost 14 years ago
"Enabling OpenCL disables the SSE decoders, so the option has to stay."
not on PC platforms.
Updated by Florent.Castelli almost 14 years ago
If the patch works with Nvidia drivers, I think it's ok to commit it. Nice work on the profiling !
If I get the time, I'll test it at home on my 580 during next week.
Updated by hrydgard almost 14 years ago
zephiris, you've been added to committers.
Updated by zephiris almost 14 years ago
Less than fun week. I was hoping to try and get more testing in on the linux version, ran into unrelated system problems. The Windows version started misbehaving; entirely unrelated to tinkering or this.
I wasn't figuring on commit access, sorry for the delay; should I just go ahead and commit what I know at least works/compiles/runs well locally, or wait for the noted results from Nvidia cards? I don't have an nvidia card anymore, and don't know what the "new" toolkit can do, as far as inspecting things like a hunk of OpenCL code.
Updated by glennricster almost 14 years ago
I just tested this on linux with an nvidia video card. There are many corrupt textures with this texture decoder. It compiles and runs though.
Updated by zephiris almost 14 years ago
Is there any option of running the Nvidia CUDA debug stuff on linux, then? I'm not seeing any generic/AMD tools. Unless I messed up the math in a very specific way...only thing I could think of would be that it's allocating the memory unaligned?
Nvidia's own documentation says that clCreateBuffer and such guarantees alignment of at least 256 bytes when allocating device memory (and vstore/vload use the same alignment requirements).
Shouldn't matter, but 32-bit or 64-bit?
Updated by jroseff almost 14 years ago
@MofoMan: "I think they should keep it around but remove the option of disabling it, since it doesn't seem to cause any problems"
Perhaps on Windows. However, OpenCL causes fairly severe texture corruption on Mac OS X (at the very least, seems to do the same for Linux too).
Updated by danialhorton almost 14 years ago
"I just tested this on linux with an nvidia video card. There are many corrupt textures with this texture decoder. It compiles and runs though."
Nvidia fakes opencl with a poorly implemented opencl to cuda wrapper. alot of problems arise and opencl on nvidia is slow as hell because of it.
Updated by johnchabot13 almost 14 years ago
has any further work been made on this
Updated by zephiris almost 14 years ago
People said there were problems, but didn't expand on them, or verify that it was because of the patch.
I asked if anybody wanted to double-check the math to ensure it does precisely the same thing. It appears to equate to the same things in every case, and I've had no texture corruption since the initial submission.
It works perfectly locally on an AMD card on Windows and Linux, but follows AMD, Nvidia, and Intel guides to the letter. The primary change is that it takes the 'function' load/store and replaces it with equivalent in-line code. The function call-based code takes many branches, and does very 'shallow' fetches repeatedly, which is (in comparison) extremely slow.
It's possible to've caused something 'off by one', but, any previous instances caused obvious texture corruption whenever OpenCL was used, and so was corrected before submitting.
If there's something I can do to 'fix' it, let me know. Otherwise it's still waiting on people to test it appropriately. If there's a bug on Nvidia that's not actually caused by the code, or shouldn't be, then it should be reported to them to get it resolved. Nvidia and the spec guarantee same alignments; they're required to use vstore/vload in the first place. So functionally identical 'rolled out' code should not be causing any kind of difference except on performance.
Updated by skidau almost 14 years ago
If that is your conclusion, commit it into trunk after Dolphin 3.0 is released so the code gets wider testing.