Project

General

Profile

Actions

Emulator Issues #4144

closed

OpenCL texture decoding is global memory access bound

Added by zephiris about 13 years ago.

Status:
Won't fix
Priority:
Normal
Assignee:
-
Category:
GFX
% Done:

0%

Operating system:
N/A
Issue type:
Other
Milestone:
Regression:
No
Relates to usability:
No
Relates to performance:
Yes
Easy:
No
Relates to maintainability:
No
Regression start:
Fixed in:

Description

The OpenCL code for texture decoding spends more time doing global reads and writes than actual math, at least according to AMD KernelAnalyzer.

This is due to the vload/vstore functions. The spec, and Nvidia/ATI documentation appear to suggest that it requires alignment as this does, but, it generates 4 times as many fetch/store operations per loop as the equivalent vectorized code.

Changing the file to use vectors in the pointers, and correcting any load/store/pointer math to account for that, leads to a ~8x speedup in functions like DecodeI4 through DecodeRGBA8_RGBA, and a ~17-25x speedup in decodeCMPRBlock_RGBA.

This alone wouldn't be that great, given that the functions already appear to be fast, but, it eliminates the GLobal Memory bottleneck.

On Evergreen type cards, the writes are guaranteed to be 96 cycles minimum (creeping up to >280 on DecodeCMPR_RGBA).

This essentially hardcaps possible performance on any 5000 series or greater card, and makes them take considerably more time than any 4000 series card at the same logical tasks.

The number of ALU operations, fetches, and GPRs used are also significantly reduced. I've personally tested it it live, but haven't dumped textures and made sure of exact matches yet.

But, for instance, on DecodeCMPR_RGBA, the estimated throughput is just 22M on 5670, caps at ~100M for 5870, 6870, 6970, etc with the existing OpenCL file. The Radeon 4870 on the same function 188M.

With the modified file, 83M on the 5670, 725M on the 5870, and 737M on the 6970, and the 4870 remains unchanged (but still does 20% less work overall for the same results).

Intel, AMD, and Nvidia OpenCL performance/tuning guides all appear to strongly recommend this pattern (as well as constifying the correct things, avoiding implicit conversions, etc).

Basically all this does is enforce vectorization of the loads and stores, since the function calls are not. No actual functionality, logic, or math difference should result, but feel free to double check my math.

I've attached the full file, and a diff. If anyone wanted to run it through a code beautifier, no skin off my nose, I just put it together in a text editor on Windows. I don't have my normal tools with me. It gave some issues with tabs, apparently, but I can't see them.

Actions

Also available in: Atom PDF