Emulator Issues #10904
closedMesa/radeonsi hangs during glClientWaitSync()
0%
Description
Flashy click bait title to call your attention!
So... I installed latest Dolphin into my computer. It always freezes in the first 60 seconds into the game.
First of all, this is a race condition so I should post my HW & SW settings because it looks like I'm unlucky to have the combination that triggers this bug, because I literally could not find anything related to this problem.
Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
16GB RAM (1 stick)
GPU: Radeon RX 560 Series (POLARIS11 / DRM 3.19.0 / 4.14.11, LLVM 6.0.0)
Mesa 18.1.0-devel (git-183ce5e629)
Xubuntu 17.10
Kernel 4.14.11
I downloaded from source and recompiled Dolphin in full debug to get a better sense of what's going on.
Dolphin freezes inside the GL driver. Callstack:
1 syscall syscall.S 38 0x7f5548e36a49
2 sys_futex futex.h 38 0x7f550450a57c
3 futex_wait futex.h 50 0x7f550450a57c
4 do_futex_fence_wait u_queue.c 114 0x7f550450a57c
5 _util_queue_fence_wait u_queue.c 129 0x7f550450aa79
6 util_queue_fence_wait u_queue.h 160 0x7f55046e637d
7 si_fence_finish si_fence.c 195 0x7f55046e637d
8 st_client_wait_sync st_cb_syncobj.c 115 0x7f550441dbfe
9 client_wait_sync syncobj.c 355 0x7f550438cfe3
10 OGL::StreamBuffer::AllocMemory StreamBuffer.cpp 153 0xca8fd6
11 OGL::BufferStorage::Map StreamBuffer.cpp 277 0xcab35f
12 OGL::StreamBuffer::Map StreamBuffer.h 40 0xc962c2
13 OGL::VertexManager::ResetBuffer VertexManager.cpp 98 0xcb2e58
14 VertexManagerBase::PrepareForAdditionalData VertexManagerBase.cpp 136 0xc5b5d8
15 VertexLoaderManager::RunVertices VertexLoaderManager.cpp 280 0xc4723c
16 OpcodeDecoder::Run<false> OpcodeDecoding.cpp 224 0xc1e103
17 OpcodeDecoder::InterpretDisplayList OpcodeDecoding.cpp 56 0xc1cd81
18 OpcodeDecoder::Run<false> OpcodeDecoding.cpp 180 0xc1deb9
19 Fifo::RunGpuLoop()::$_0::operator()() const Fifo.cpp 351 0xc0caa2
20 Common::BlockingLoop::Run<Fifo::RunGpuLoop()::$_0>(Fifo::RunGpuLoop()::$_0, long) BlockingLoop.h 135 0xc0bc18
It's stalled at:
// if buffer is full
if (m_iterator + size >= m_size)
{
//...
// wait for space at the start
for (int i = 0; i <= Slot(m_iterator + size); i++)
{
glClientWaitSync(m_fences[i], GL_SYNC_FLUSH_COMMANDS_BIT, GL_TIMEOUT_IGNORED); //---> here
glDeleteSync(m_fences[i]);
}
}
So began researching. My suspects were:
- CPU bug. I disabled HyperThreading. Bug still remained.
- AMD_pinned_memory. I disabled it (uses BufferStorage instead).
- Mesa driver bug. I updated to latest. This commit sounds horribly suspicious https://github.com/mesa3d/mesa/commit/50b06cbc10dbca1dfee89b529ba9b564cc4ea6f6 It's possible there's still some dependency going on.
- Dolphin corrupting memory. I could not test this theory.
- glxSwapBuffer behavior. I've seen Mesa act funny when it comes to glxSwapBuffer, and I saw the behavior changed on whether it was rendered in wxWidgets or its own window, so I toggled the option.
- glClientWaitSync being called from multiple threads. Doesn't appear to be the case but I'm not 100% certain.
- glClientWaitSync being used on a dangling fence. I tested this theory, there's no bug about this.
- StreamBuffer leaking too many fences. I tested this theory. IT IS LEAKING. But I've workarounded the issue and still the hangs remain.
- GPU hardware bug. I could not test this theory.
- Some bizarre behavior in StreamBuffer. It is hard to know what's going on (yes, I know it's a ring buffer, but it's complex).
So, I'm attaching a patch to test my findings, which is by no means meant to be included in the repo (it's just debug code, doesn't comply with coding conventions, etc).
By applying the patch I provided, we can see that:
void StreamBuffer::AllocMemory(u32 size)
{
// insert waiting slots for used memory
for (int i = Slot(m_used_iterator); i < Slot(m_iterator); i++)
{
assert( !m_fences_check[i] ); //---> this assert triggers
This means that the code is leaking memory. The constructor populates the array with fences. After mapping->unmapping enough memory, Slot(m_iterator) becomes 1 for the first time, while Slot(m_used_iterator) is still 0. This will cause the snippet of code to create another fence, leaking the old one. This is a leak.
We can also check that after playing for a while, hitting the button stop will trigger yet another one:
for (int i = 0; i < SYNC_POINTS; i++)
assert( !m_fences_check[i] );
This means StreamBuffer::DeleteFences is not deleting all the fences. This is a leak.
But this does not explain my hangs. I workarounded the code (included in patch) so that it doesn't leak (unless I missed something), but it still hangs.
A few things I noticed:
- glClientWaitSync ALWAYS hangs on m_fences[7].
- The snippet before the waits (the one that creats more fences when the buffer is full) has (m_used_iterator >> 21u) == 8. Always.
- Printing the number of times we hit the potentially problematic code doesn't show an exact value. Sometimes it hangs at 49, another time at 52, sometimes at 99.
So I figgured this is probably a driver SW or HW timing bug.
Adding this snippet:
glFlush();
Between the creation of more fences and the waits fixed my hangs.
It fixes the problem. Or at least, I no longer get hangs 60 seconds into the game. Maybe now I'll get one hang per hour? I haven't played long enough to find out.
While this reeks of driver bug, the behavior of StreamBuffer is sketchy, to say the least. Leaks fences, calls glClientWaitSync too frequent (if glClientWaitSync[n+1] returned, then we can assume glClientWaitSync[n] is already signaled!). Honestly I would rewrite the whole class and see if the hangs are gone.
It wouldn't hurt to put some padding in the buffers to check if the buffers have been overflown (by prefilling the boundaries with magic words).
I'm a heavy user glClientWaitSync and I've never experienced hangs in my machine, which is what strickens me odd. I also tried making a stress test to see if I could find a freeze by quickly creating fences and consuming them, sequentially and out of order, but no luck :(
For the moment, my workaround fixes the leaks and hangs, and I can play.
But you may understand now why I chose the title I wrote.
I'm using dolphin from source fca9c28f38e46b78bf65c3b427a21560e5fb42b7
Files
Updated by Stenzek over 6 years ago
- Subject changed from StreamBuffer is a disaster! to Mesa/radeonsi hangs during glClientWaitSync()
- Status changed from New to Accepted
- Operating system Linux added
- Operating system deleted (
N/A)
Thank you for the detailed investigation and write-up.
I can investigate the fence leaking situation when I get a chance, however, I think the underlying issue here is a bug in the radeonsi driver. I also found Dolphin was hanging in client_wait_sync when reading back staging textures in https://github.com/dolphin-emu/dolphin/pull/6367
Funnily enough, doing an explicit glFlush() here fixes the problem too. But obviously this will have a performance cost, so it's not something we want to be doing all the time, particularly during sub-buffer allocations. Looking into the driver source, for glClientWaitSync(), the GL_SYNC_FLUSH_COMMANDS_BIT flag is completely ignored, the driver flushes unconditionally here. But there appears to be a difference between the flush which happens in glClientWaitSync() vs glFlush(). I didn't end up investigating it any further, as the early flush can potentially hand us performance benefits when the access does not immediately follow the flush.
All that said, radv works quite well with Dolphin, so that's always an option in the meantime.
Updated by dark_sylinc over 6 years ago
Yes! Your glFlush in the staging textures is what prompted me to do the same. My thoughts were "well... that doesn't make sense. The flush bit should do the same as glFlush so why is Dolphin here.... unless they found the same bug and I'm looking at a workaround".
Regarding your performance concerns:
glFlush isn't veeeeeeeeeery expensive because it just tells the driver to send everything it has queued to the GPU. Flushing excessively can cause threading synchronization overhead, prevent the driver from optimizing a batch by analyzing all the commands (I seriously doubt Mesa does this sort of thing, and other drivers seem to be moving away from doing this) and maybe some HW excessive synchronization too (e.g. PCIE sync packet overhead, kernel transitions etc).
But it's not thaaaaat expensive. Furthermore:
- glClientSync is supposed to flush already (in fact Mesa's source code says they ignore the flush flag and always assume it's set because many apps forget to set it).
- The glFlush is only when the buffer is full
- At least with Tales of Symphonia DotNW, when I was printf debugging, I saw that the buffer gets full like 3 times per second (time measure is eyeballed)
- If you're that seriously concerned, you'll gain a lot more performance by only calling glClientWaitSync on the last fence you need to delete. Right now you're calling glClientWaitSync on each one of them, sequentially.
Stenzek wrote:
All that said, radv works quite well with Dolphin, so that's always an option in the meantime.
Thanks! Last time I checked 6 months ago, radv would freeze my system quickly enough. But 6 months is eons ago for that driver.
Anyway, I'm already playing with my custom build. About 1 hour of gameplay and no hangs yet; unless having to disable the bloody screensaver because I'm not touching the keyboard or mouse counts as a bug :)
Updated by Stenzek over 6 years ago
glFlush isn't veeeeeeeeeery expensive because it just tells the driver to send everything it has queued to the GPU. Flushing excessively can cause threading synchronization overhead, prevent the driver from optimizing a batch by analyzing all the commands (I seriously doubt Mesa does this sort of thing, and other drivers seem to be moving away from doing this) and maybe some HW excessive synchronization too (e.g. PCIE sync packet overhead, kernel transitions etc).
Yep. That's the only reason I even considered it as a workaround for staging textures, as the CPU readback is a slow path anyway.
glClientSync is supposed to flush already (in fact Mesa's source code says they ignore the flush flag and always assume it's set because many apps forget to set it).
That was my conclusion as well. But something does not appear to be working correctly here. I found the hang was deterministic in that it happened at the same point after booting each time, but didn't narrow down the exact state where it happens.
At least with Tales of Symphonia DotNW, when I was printf debugging, I saw that the buffer gets full like 3 times per second (time measure is eyeballed)
Three times per second is fine when we're pushing 30/60 frames per second. It only becomes an issue if we have to flush and sync on the same frame that the sub-buffer is used, as otherwise the implicit flush on swap takes care of it.
However, our vertex streaming code currently has some disadvantages. We require the full "maximum batch size" to be reserved at the beginning of every batch, even though almost all of the time, not all this storage is committed. IIRC we set the buffer to 2*maxsize. So as soon as we use half the buffer, we can no longer reserve maxsize, therefore wrapping around early and syncing. I've been wanting to address this for a while now, but been busy with other things.
If you're that seriously concerned, you'll gain a lot more performance by only calling glClientWaitSync on the last fence you need to delete. Right now you're calling glClientWaitSync on each one of them, sequentially.
Yep, this is a potential optimization. However, it depends on how the driver implements syncs, a "best case" here would simply return immediately as the intermediate sync objects have already been signaled.
Updated by Lithium64 over 6 years ago
I can reproduce the issue, but it only happens with 18.1.0-devel here.
My specs are:
i3 4150 @ 3.50ghz
DDR3 12GB RAM
AMD R7 260X 2GB VRAM
Ubuntu 17.10
Kernel 4.15.4
Mesa 18.1.0-devel
Updated by dark_sylinc over 6 years ago
I've been able to reproduce this bug in one of my samples (completely unrelated to Dolphin) which confirms this is a Mesa bug.
I've reported the bug to Mesa https://bugs.freedesktop.org/show_bug.cgi?id=105339 and sent them a repro sample. While it's not a trivial sample, it's far simpler than Dolphin.
Updated by Billiard26 about 4 years ago
- Status changed from Accepted to Fixed
The above linked Mesa issue is marked as fixed.
Will re-open upon request.