Emulator Issues #6880
closedLLVM JIT in the VertexLoader
Added by nodchip almost 11 years ago.
0%
Description
Purpose of code changes on this branch:
The objective of this patch is a proof of concept to introduce LLVM into Dolphin.
When reviewing my code changes, please focus on:
Could you test if this patch works in environments other than Windows 64-bit platform?
After the review, I'll merge this branch into:
I don't have any ideas about this. Could you suggest a right branch?
About three years ago, I and other developers tried to implement a new CPU emulator with LLVM but we gave up. After that, I tried to optimize VertexLoader with LLVM because VertexLoader was one of performance bottle necks. Unfortunatelly, this patch decreases the fps (40 fps -> 33 fps in the first scene of "The Legend of Zelda: Twilight Princess"). But I think that we can optimize the code and improve the performance.
How to enable LLVM JIT in Vertexloader:
- Download and install CMake. (CMake - Cross Platform Make http://www.cmake.org/)
- Download LLVM source code. (The LLVM Compiler Infrastructure Project http://llvm.org/)
- Build LLVM in the Release mode with "Visual Studio 12 Win64" mode.
- Add the "include" directories both in the LLVM src folder and the LLVM build folder to the additional include folders of the VideoCommon.
- Add the "lib\Release" in the LLVM build folder to the additional library folder of the VideoCommon.
- Add the LLVM*.lib files in the "lib\Release" folder to the additional libraries of the VideoCommon.
- Apply this patch.
- Comment out "#define USE_JIT" in VertexLoader.cpp.
- Uncomment out "#define USE_LLVM_JIT" in VertexLoader.cpp.
About files:
- VertexLoader.cpp: The USE_LLVM_JIT enables the LLVM JIT in Vertexloader. USE_LLVM_JIT_DEBUG enables to call the entry point of the JIT directly. This is useful to debug VertexLoader_LLVMEntryPoint.cpp.
- VertexLoader_LLVMEntryPoint.cpp: This file contains almost all the code to convert the GC/Wii vertex data format into the DirectX/OpenGL vertex data format. We need to compile this source code into LLVM assembler by clang with -emit-llvm option, include the LLVM assembler into VertexLoader.cpp, partially evaluated with parameters which are passed to VertexLoader, optimize with Function/ModulePassManager and compile into x86.
- VertexLoader_LLVMEntryPoint_s.cpp: This file contains a part of code to generate a string object which represents the LLVM assembler above. This file is included from VertexLoader.cpp.
- to_cpp_string.py: This python script converts the LLVM assembler file into VertexLoader_LLVMEntryPoint_s.cpp.
- llvm_compile.sh: This shell script compiles VertexLoader_LLVMEntryPoint.cpp and generates VertexLoader_LLVMEntryPoint_s.cpp. Please use this shell script once you update VertexLoader_LLVMEntryPoint.cpp.
Updated by delroth almost 11 years ago
Yet another huge an mostly unreviewable patch. Yay.
First problem I have with this: LLVM is not a magic bullet, and its optimization algorithms might be good for the general case of "I'm a compiler and I want to compile this IR that roughly matches natural source code" but I've yet to see anyone using LLVM successfully for dynamic recompilation and code generation. Especially for vertex loaders, it sounds like a case where the input problem space is small enough that we should be able to do some smarter algorithms than LLVM.
Second problem I have with this: it's likely to be a huge dependency that adds bloat to the binary. Care to share the size of Dolphin.exe (+ added DLLs if you don't build statically) with your patch? I would assume it's over 10MB of overhead (uncompressed), aka. doubling the size of our binaries.
Third problem I have with this: "I think we can optimize the code" but no ideas given as to how. Could you share the ideas you have to optimize this code and why this has not be done prior to submitting this patch? To me it sounds like you weren't able to find a way to make LLVM as performant as our hand-tuned JIT for this case, and it makes perfect sense to me (see 1.).
Fourth problem: compilation time with LLVM. Could you share metrics of time spent compiling vertex loaders with LLVM vs. Jit64? If compilation time is larger than 10ms for some vertex loaders, how are you planning to avoid stuttering when playing and having to build a new vertex loader?
This is a -0.5 from me in its current form, but could definitely go up to a +0 if you actually have considered these 4 problems and have potential solutions.
Updated by degasus almost 11 years ago
- Priority changed from High to Normal
As I've suggested the llvm usage here, I will review it.
delroth 1st: We could optimize this by hand, but atm we don't do it. For VertexLoader, we should be able to optimize between different attributes (load tex coords + vertex position at once) and between different vertices. So it's mostly a parallelize (eg SSE) work. Also register usage would be needed as some attributes are 24bit width. So it's mostly common use for compilers ;)
2nd: true
3rd: "hand-tuned JIT" - our "jit" generates just a list of calls. Each of this calls just store one attribute of one vertex.
4th: We only compile up to 100 vertex loaders per game. There is no Jit64 vertex loader, did you mixed this up with ppc? But the stuttering issue is valid. But let's think about this if there is a big speedup which legitimate 2nd (eg sw + threaded compiler).
nodchip: I'll look at your patch in the next days. I bet we'll find the bug for the performance issue.
Updated by delroth almost 11 years ago
What do you mean by "There is no Jit64 vertex loader"? VertexLoader::CompileVertexTranslator sure looks like something that uses our Jit64 infrastructure to me.
Updated by NeoBrainX almost 11 years ago
I'm curious what the motivation behind using LLVM was. Particularly, are there any reasons for developing a new JIT from scratch instead of improving the current JIT or is it more of a "because we can" thing?
Also, what's the point of this PR? Do you expect us to merge an unfinished JIT without any advantage (from a user-perspective) over the current JIT into Dolphin or is this just a request-for-comments/testing?
I personally currently see no reason to merge this into Dolphin since it has no practical advantage over the currently available options and it's not clear if any further work is going to happen once it's merged. Additionally, delroth explained thoroughly that chances are our current JIT could be optimized much better than anything LLVM is likely to get at anytime soon.
Updated by nodchip almost 11 years ago
In summary, I will drop this patch if I can not improve the performance.
@delroth
Thank you for your comments.
First problem I have with this: LLVM is not a magic bullet, and its optimization algorithms might be good for the general case of "I'm a compiler and I want to compile this IR that roughly matches natural source code" but I've yet to see anyone using LLVM successfully for dynamic recompilation and code generation. Especially for vertex loaders, it sounds like a case where the input problem space is small enough that we should be able to do some smarter algorithms than LLVM.
As wickmarkus86 said, it is like a compiler work to optimize VertexLoader.
Second problem I have with this: it's likely to be a huge dependency that adds bloat to the binary. Care to share the size of Dolphin.exe (+ added DLLs if you don't build statically) with your patch? I would assume it's over 10MB of overhead (uncompressed), aka. doubling the size of our binaries.
The binary sizes of Dolphin.exe were:
- Original JIT: 12,343,296 Bytes
- LLVM JIT: 19,687,424 Bytes
The binary size was increased by 60%. I agree that this is a big problem and don't have any solutions. We would be better to drop this patch if there are no performance improvements or other merits.
Third problem I have with this: "I think we can optimize the code" but no ideas given as to how. Could you share the ideas you have to optimize this code and why this has not be done prior to submitting this patch? To me it sounds like you weren't able to find a way to make LLVM as performant as our hand-tuned JIT for this case, and it makes perfect sense to me (see 1.).
As wickmarkus86 said, the current VertexLoader JIT generates the list of calls. My optimization ideas are:
- Split LLVM modules for each VertexLoader to avoid recompilations. In the current patch, all the JIT code are recompiled when a new VertexLoader is instanciated.
- Cache the function pointer for the compiled code once a VertexLoader is compiled. The current implementation does not cache because the function pointer changes each time all the JIT code are recompiled.
- Introduce SSE intrinsics to parallelize vector operations. LLVM can vectorize some code but hand-write may help it.
- Simplify the code so that LLVM can automatically vectorize operations.
I will implement some of these ideas in a week and measure the performance.
Fourth problem: compilation time with LLVM. Could you share metrics of time spent compiling vertex loaders with LLVM vs. Jit64? If compilation time is larger than 10ms for some vertex loaders, how are you planning to avoid stuttering when playing and having to build a new vertex loader?
The compilation time in the current implementation is very poor. The game completely stops each time a new VertexLoader is initiated. This is because all the JIT code are contained in an LLVM module and whole the module is recompiled each time a new VertexLoader is initiated. We need to spilit the module for each VertedLoader as I described above.
@wickmarkus86
Thank you for comments and supporting the patch. I will brush up this patch, measure the performance and get them back to this thread.
@NeoBrainX
Thank you for your comments.
I'm curious what the motivation behind using LLVM was. Particularly, are there any reasons for developing a new JIT from scratch instead of improving the current JIT or is it more of a "because we can" thing?
The original motivation was to improve the performance of VertexLoader. Below is the performance profile of the current Vertexloader JIT by CodeXL (A performance profile provided by AMD). The performance was profiled using the opening scene of "The Legend of Zelda: Twilight Princess".
Module, Samples, % of Hotspot Samples
unknown module pid (7532), 45935, 59.42739487
Dolphin.exe, 22964, 29.70916939
msvcr120.dll, 8330, 10.77675438
msvcp120.dll, 52, 0.06727386
SDL2.dll, 5, 0.00646864
Function, Module, Timer
Pos_ReadIndex_Float_SSSE3<unsigned short,1>(void), Dolphin.exe, 727
`anonymous namespace'::Normal_Index<unsigned short,short,1>::function(void), Dolphin.exe, 694
TexCoord_ReadIndex_Short2_SSE4(void), Dolphin.exe, 609
Color_ReadIndex16_32b_8888(void), Dolphin.exe, 201
This shows that about 77234 samples were picked by the profile and about 2231 samples are related to the VertexLoader during the opening scene. About 3% of the CPU were used for VertexLoader. If we can improve the performance of the VertexLoader, we might be able to increase the fps by 1. This might not worth to challenge from the view point of the time-cost efficiency. But this is not a paid work so I don't mind it. Though I will drop this patch if it does not worth to the increase of the binary size.
I wrote this patch as the first experiment and measured the performance. As I wrote in the first comment, the performance was decreased. I will brush up this patch and would like to say "because we can increase the performance". If I can not increase the performance, I will drop this patch.
Also, what's the point of this PR? Do you expect us to merge an unfinished JIT without any advantage (from a user-perspective) over the current JIT into Dolphin or is this just a request-for-comments/testing?
There were two points. The first point was a request-for-testing in variouse environemnt because my development environemnt is only Windows 64-bit. The second was a request-for-comments.
I personally currently see no reason to merge this into Dolphin since it has no practical advantage over the currently available options and it's not clear if any further work is going to happen once it's merged. Additionally, delroth explained thoroughly that chances are our current JIT could be optimized much better than anything LLVM is likely to get at anytime soon.
I agree with your opinion. As I described above, I will drop this patch if I can not improve the performance than the current JIT.
Thanks,
Updated by degasus almost 11 years ago
iirc "The Legend of Zelda: Twilight Princess" is bottlenecked by GP and lots of BP writes, so you won't see a big speedup. My favor game to show vertexloader performance issues is Super Mario Galaxy (both 1&2): http://markus.members.selfnet.de/dolphin/smg2-perf.png
About your patch:
- "Cache the function pointer" this one is required as we often call the vertexloader with 4 vertices. So you would generate more overhead than the vertexloader itself
- I don't like that you've forked the vertex loader in VertexLoader_LLVMEntryPoint.cpp. I see that you don't want to write llvm code by hand, but why didn't you modify the current vertex loader to meet your requirements? (eg still create a wrapper with the current api, so it could be used easyly)
- for performance, you should try to experiment with the keyword restrict. As g_pVideoData and VertexManager::s_pCurBufferPointer won't overlapp, it will allow better optimizations.
Updated by nodchip almost 11 years ago
iirc "The Legend of Zelda: Twilight Princess" is bottlenecked by GP and lots of BP writes, so you won't see a big speedup. My favor game to show vertexloader performance issues is Super Mario Galaxy (both 1&2): http://markus.members.selfnet.de/dolphin/smg2-perf.png
I see. I will use Super Mario Galaxy for the further performance measurements.
- I don't like that you've forked the vertex loader in VertexLoader_LLVMEntryPoint.cpp. I see that you don't want to write llvm code by hand, but why didn't you modify the current vertex loader to meet your requirements? (eg still create a wrapper with the current api, so it could be used easyly)
Because the current vertex loader accesses to global variables in the host memory space. There is a way to access global variables in the host memory space from LLVM JITted code. But it is needed to write llvm code by hand using IRBuilder (http://d.hatena.ne.jp/nodchip/20100811/1281495581 Japanese page). I didn't choose this way because I didn't want to write llvm code directly. It's an option to copy global variables in the host memory space to global variables in the LLVM memory space at the beginning of the JITted code and copy back at the end. But it adds an overhead and I'm not sure whether it is worth to do. If there is a better way to access global variables, I would like to try it.
- for performance, you should try to experiment with the keyword restrict. As g_pVideoData and VertexManager::s_pCurBufferPointer won't overlapp, it will allow better optimizations.
This is a great tip. I will try it during the performance tuning.
Updated by delroth almost 11 years ago
Re: performance measurements, it might be interesting to have a few
homebrew programs that stress-test vertex loaders in different ways
(very high throughput, high number of vertex formats, etc.). That
might be more useful during development than benchmarks based on real
games.
Re: LLVM and globals, I'm 90% sure there is a good solution to that
but my LLVM skills are kind of rusty :) I'll check your code and some
of my LLVM experiments tonight and get back to you if I don't forget.
Updated by degasus almost 11 years ago
Profiled with a current build (VSH, nvidia threaded opts, disabled efb): http://markus.members.selfnet.de/dolphin/smg2-perf2.png - 40% in the to replaced vertex loading loop.
nodchip: Do you think hand written llvm code would be that much work? As all loaders supports different input formats /su|float/, they may be merged easyly.
Updated by nodchip almost 11 years ago
Profiled with a current build (VSH, nvidia threaded opts, disabled efb): http://markus.members.selfnet.de/dolphin/smg2-perf2.png - 40% in the to replaced vertex loading loop.
This is a great investigation. It will encourage my development. Thank you so much!
nodchip: Do you think hand written llvm code would be that much work? As all loaders supports different input formats /su|float/, they may be merged easyly.
I think that it is a heavy task to write llvm code by hand. Because llvm code is basically an assemble code and we would need to write thousand lines of assembler code. VertexLoader_LLVMEntryPoint_s.cpp contains 3.9k lines and we would need to write almost the same amount code. I think that it is not a good idea to write llvm code by hand.
Updated by nodchip almost 11 years ago
@delroth
Sorry, I missed your comment.
Re: performance measurements, it might be interesting to have a few homebrew programs that stress-test vertex loaders in different ways (very high throughput, high number of vertex formats, etc.). That might be more useful during development than benchmarks based on real games.
It is a good idea. I don't have enough knowledge to create homebrew programs so I hope someone develops them.
Re: LLVM and globals, I'm 90% sure there is a good solution to that but my LLVM skills are kind of rusty :) I'll check your code and some of my LLVM experiments tonight and get back to you if I don't forget.
Thank you so much!
Updated by degasus almost 11 years ago
btw globals: I'd suggest to hard code all of this scale factors into the vertex loader directly, so we won't have to copy them before.
I also don't expect ~3.9k lines af code as all vertex loader functions do almost the same (but with different parameters)
Updated by nodchip almost 11 years ago
wickmarkus86: It is hard at least for me to write llvm code by hand. I would like to begin with the current way and switch to hand-written llvm code if I have time.
Updated by delroth almost 11 years ago
Re: LLVM and globals: there is a way to attach a global variable binding to the ExecutionEngine. See this: https://bitbucket.org/delroth/dolphin-llvm-dsp/src/87f0c317887a1b1b23efa50b5e1c5a3744373d95/Source/Core/Core/Src/DSP/DSPLLVMEmitter.cpp?at=dsp-llvm#cl-496
Also, random feedback on your code, please do not use std::auto_ptr (BAD). Use C++11 std::unique_ptr instead.
Updated by nodchip almost 11 years ago
Also, random feedback on your code, please do not use std::auto_ptr (BAD). Use C++11 std::unique_ptr instead.
I see. I will fix it.
Updated by nodchip almost 11 years ago
This is a progress report. I attached a new patch file with the following changes.
- Split LLVM modules for each VertexLoader to avoid recompilations.
- Changed to cache the function pointer for each compiled code once a VertexLoader is compiled.
- Introduce SSE intrinsics to parallelize vector operations.
- Changed std::auto_ptr to std::unique_ptr.
As the result, the fps increased by 1-2 in the first scene of "The Legend of Zelda: Twilight Princess". But there were some garbages only in USE_LLVM_JIT mode and the fps decreased in "Super Mario Galaxy" by 2-3. I'm investigating these issues.
I also tried the following things but didn't included in the attached patch.
- Add "restrict" keyword.
- Simplify the code so that LLVM can automatically vectorize operations.
The reason for the first one is that clang does not support restrict keyword. The reason for the second is that LLVM 3.3 emits bswap intrinsic for byte swap operations but it is not vectorized by the SLP Vectorizer.
Updated by degasus almost 11 years ago
One of the bug for the broken rendering is because we strip the frac from the vertex loader uid: https://code.google.com/p/dolphin-emu/source/browse/Source/Core/VideoCommon/Src/VertexLoader.h#35
Did you changed your VertexLoader_LLVMEntryPoint.cpp between your patches? At least you haven't attached it this time.
Updated by nodchip almost 11 years ago
One of the bug for the broken rendering is because we strip the frac from the vertex loader uid: https://code.google.com/p/dolphin-emu/source/browse/Source/Core/VideoCommon/Src/VertexLoader.h#35
Thank you so much. I will check it.
Did you changed your VertexLoader_LLVMEntryPoint.cpp between your patches? At least you haven't attached it this time.
I forgot to add it. I attached in this comment.
Updated by degasus almost 11 years ago
And here (gcc), your USE_LLVM_JIT_DEBUG is about 10% slower than USE_LLVM_JIT - are you sure this function is optimized for the current use?
btw, USE_LLVM_JIT_DEBUG itself is like 20% slower than USE_JIT. So in fact, our current jit is almost as slow as the generic one? wtf...
Updated by nodchip almost 11 years ago
Could you show the fps for each mode? The speed for each mode in my environment was below.
The Legend of Zelda: Twilight Princess - The first scene
USE_JIT: 41 fps
USE_LLVM_JIT: 43 fps
USE_LLVM_JIT_DEBUG: 37 fps
Super Mario Galaxy - Tutorial
USE_JIT: 60 fps
USE_LLVM_JIT: 61 fps
USE_LLVM_JIT_DEBUG: 56 fps
USE_LLVM_JIT was fastest and USE_JIT and USE_LLVM_JIT_DEBUG followed. USE_LLVM_JIT_DEBUG should be slower than other modes. Because it directly calls the LLVM entry point from the host without LLVM execution engine. It would be slow because parameters are not partially evaluated and not folded. Could you use USE_LLVM_JIT_DEBUG only to debug the entry point with debuggers?
OS: Windows 7 64-bit
CPU: Core i7 2.3 GHz
Memory: 8GB
GPU: Intel(R) HD Graphics 4000
Compiler: Visual Studio 12 (Visual Studio Express 2013 for Windows Desktop)
Updated by nodchip almost 11 years ago
This is a progress report. I'm still investigating the issue that only the USE_LLVM_JIT mode shows garbages on the screen. I check the difference of the output between the USE_JIT mode and the USE_LLVM_JIT mode. But there are no differences. The output binaries completely matched.
In the case of "The Legend of Zelda: Twilight Princess", the texture of the link's face and hair looks bad. I guess the scale of the texture coordinates is wrong. Please see the screenshot attached in this comment (2013-12-20-15-54.jpg).
In the case of "Super Mario Galaxy", the stars are not shown on their actual coordinates and they are shown in the background of the scene. I guess the scale of the polygon is wrong. Please see the screenshot attached in this comment (2013-12-20-15-55.jpg).
Do you have any hints for the fix?
Updated by degasus almost 11 years ago
Have you disabled the frac bits: http://pastie.org/8564953
Updated by nodchip almost 11 years ago
wickmarkus86: It fixed the issue! Thank you so much! I did't understand your comment in #18 and had not been disabled the frac bits. I will cleanup the patch and attach here in a few days.
Updated by nodchip almost 11 years ago
I cleaned up the code and attached a new patch in this comment. The current performance are:
The Legend of Zelda: Twilight Princess - The first scene
USE_JIT: 42 fps
USE_LLVM_JIT: 45 fps
USE_LLVM_JIT_DEBUG: 39 fps
Super Mario Galaxy - Tutorial
USE_JIT: 68 fps
USE_LLVM_JIT: 72 fps
USE_LLVM_JIT_DEBUG: 56 fps
The compilation speed is slow and games pause often. This can be resolved by background jit compile. But I have not implemented yet. Could you take another look and make a decision whether this patch can be merged to the main branch?
Updated by degasus almost 11 years ago
I don't know what I do wrong, but I still get broken rendering here on linux. I've compiled it with this patch: http://pastie.org/8567791
Updated by NeoBrainX almost 11 years ago
"Could you take another look and make a decision whether this patch can be merged to the main branch?"
Regardless of the quality of your patch I still don't think we should merge it just for the heck of having a LLVM backend (particularly having in mind delroth's point about basically doubling application size due to the LLVM runtime/libs).
Now, if there was an actual considerable speedup it would be somewhat different...
Updated by delroth almost 11 years ago
I'd say considerable speedup and no stuttering, which should be even more difficult to attain with LLVM.
Still interested in seeing the impact in some non-synthetic benchmarks though.
Updated by nodchip almost 11 years ago
wickmarkus86:
Thank you for testing in linux. I'm sorry but I don't have the developing environment for linux and can't debug the issue.
NeoBrainX, delroth:
Do you think 42fps -> 45fps is a considerable speedup? If not, I will drop this patch because it is hard for me to improve the speed with this strategy.
Updated by NeoBrainX almost 11 years ago
Nope, I wouldn't say 7% is a considerable speedup.
Updated by nodchip almost 11 years ago
- Status changed from New to Won't fix
I see. I will drop this patch and close this thread. Thank you so much for reviewing my patch.
Updated by NeoBrainX almost 11 years ago
And thank you for the understanding :)