Project

General

Profile

Actions

Emulator Issues #10904

closed

Mesa/radeonsi hangs during glClientWaitSync()

Added by dark_sylinc almost 7 years ago. Updated over 4 years ago.

Status:
Fixed
Priority:
Normal
Assignee:
-
% Done:

0%

Operating system:
Linux
Issue type:
Bug
Milestone:
Regression:
No
Relates to usability:
No
Relates to performance:
No
Easy:
No
Relates to maintainability:
No
Regression start:
Fixed in:

Description

Flashy click bait title to call your attention!

So... I installed latest Dolphin into my computer. It always freezes in the first 60 seconds into the game.

First of all, this is a race condition so I should post my HW & SW settings because it looks like I'm unlucky to have the combination that triggers this bug, because I literally could not find anything related to this problem.

Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
16GB RAM (1 stick)
GPU: Radeon RX 560 Series (POLARIS11 / DRM 3.19.0 / 4.14.11, LLVM 6.0.0)
Mesa 18.1.0-devel (git-183ce5e629)
Xubuntu 17.10
Kernel 4.14.11

I downloaded from source and recompiled Dolphin in full debug to get a better sense of what's going on.

Dolphin freezes inside the GL driver. Callstack:

1   syscall                                                                           syscall.S               38  0x7f5548e36a49 
2   sys_futex                                                                         futex.h                 38  0x7f550450a57c 
3   futex_wait                                                                        futex.h                 50  0x7f550450a57c 
4   do_futex_fence_wait                                                               u_queue.c               114 0x7f550450a57c 
5   _util_queue_fence_wait                                                            u_queue.c               129 0x7f550450aa79 
6   util_queue_fence_wait                                                             u_queue.h               160 0x7f55046e637d 
7   si_fence_finish                                                                   si_fence.c              195 0x7f55046e637d 
8   st_client_wait_sync                                                               st_cb_syncobj.c         115 0x7f550441dbfe 
9   client_wait_sync                                                                  syncobj.c               355 0x7f550438cfe3 
10  OGL::StreamBuffer::AllocMemory                                                    StreamBuffer.cpp        153 0xca8fd6       
11  OGL::BufferStorage::Map                                                           StreamBuffer.cpp        277 0xcab35f       
12  OGL::StreamBuffer::Map                                                            StreamBuffer.h          40  0xc962c2       
13  OGL::VertexManager::ResetBuffer                                                   VertexManager.cpp       98  0xcb2e58       
14  VertexManagerBase::PrepareForAdditionalData                                       VertexManagerBase.cpp   136 0xc5b5d8       
15  VertexLoaderManager::RunVertices                                                  VertexLoaderManager.cpp 280 0xc4723c       
16  OpcodeDecoder::Run<false>                                                         OpcodeDecoding.cpp      224 0xc1e103       
17  OpcodeDecoder::InterpretDisplayList                                               OpcodeDecoding.cpp      56  0xc1cd81       
18  OpcodeDecoder::Run<false>                                                         OpcodeDecoding.cpp      180 0xc1deb9       
19  Fifo::RunGpuLoop()::$_0::operator()() const                                       Fifo.cpp                351 0xc0caa2       
20  Common::BlockingLoop::Run<Fifo::RunGpuLoop()::$_0>(Fifo::RunGpuLoop()::$_0, long) BlockingLoop.h          135 0xc0bc18       

It's stalled at:

// if buffer is full
if (m_iterator + size >= m_size)
{
  //...
  // wait for space at the start
  for (int i = 0; i <= Slot(m_iterator + size); i++)
  {
    glClientWaitSync(m_fences[i], GL_SYNC_FLUSH_COMMANDS_BIT, GL_TIMEOUT_IGNORED); //---> here
    glDeleteSync(m_fences[i]);
  }
}

So began researching. My suspects were:

  1. CPU bug. I disabled HyperThreading. Bug still remained.
  2. AMD_pinned_memory. I disabled it (uses BufferStorage instead).
  3. Mesa driver bug. I updated to latest. This commit sounds horribly suspicious https://github.com/mesa3d/mesa/commit/50b06cbc10dbca1dfee89b529ba9b564cc4ea6f6 It's possible there's still some dependency going on.
  4. Dolphin corrupting memory. I could not test this theory.
  5. glxSwapBuffer behavior. I've seen Mesa act funny when it comes to glxSwapBuffer, and I saw the behavior changed on whether it was rendered in wxWidgets or its own window, so I toggled the option.
  6. glClientWaitSync being called from multiple threads. Doesn't appear to be the case but I'm not 100% certain.
  7. glClientWaitSync being used on a dangling fence. I tested this theory, there's no bug about this.
  8. StreamBuffer leaking too many fences. I tested this theory. IT IS LEAKING. But I've workarounded the issue and still the hangs remain.
  9. GPU hardware bug. I could not test this theory.
  10. Some bizarre behavior in StreamBuffer. It is hard to know what's going on (yes, I know it's a ring buffer, but it's complex).

So, I'm attaching a patch to test my findings, which is by no means meant to be included in the repo (it's just debug code, doesn't comply with coding conventions, etc).

By applying the patch I provided, we can see that:

void StreamBuffer::AllocMemory(u32 size)
{
  // insert waiting slots for used memory
  for (int i = Slot(m_used_iterator); i < Slot(m_iterator); i++)
  {
      assert( !m_fences_check[i] ); //---> this assert triggers

This means that the code is leaking memory. The constructor populates the array with fences. After mapping->unmapping enough memory, Slot(m_iterator) becomes 1 for the first time, while Slot(m_used_iterator) is still 0. This will cause the snippet of code to create another fence, leaking the old one. This is a leak.

We can also check that after playing for a while, hitting the button stop will trigger yet another one:
for (int i = 0; i < SYNC_POINTS; i++)
    assert( !m_fences_check[i] );

This means StreamBuffer::DeleteFences is not deleting all the fences. This is a leak.

But this does not explain my hangs. I workarounded the code (included in patch) so that it doesn't leak (unless I missed something), but it still hangs.

A few things I noticed:

  1. glClientWaitSync ALWAYS hangs on m_fences[7].
  2. The snippet before the waits (the one that creats more fences when the buffer is full) has (m_used_iterator >> 21u) == 8. Always.
  3. Printing the number of times we hit the potentially problematic code doesn't show an exact value. Sometimes it hangs at 49, another time at 52, sometimes at 99.

So I figgured this is probably a driver SW or HW timing bug.
Adding this snippet:

glFlush();

Between the creation of more fences and the waits fixed my hangs.

It fixes the problem. Or at least, I no longer get hangs 60 seconds into the game. Maybe now I'll get one hang per hour? I haven't played long enough to find out.

While this reeks of driver bug, the behavior of StreamBuffer is sketchy, to say the least. Leaks fences, calls glClientWaitSync too frequent (if glClientWaitSync[n+1] returned, then we can assume glClientWaitSync[n] is already signaled!). Honestly I would rewrite the whole class and see if the hangs are gone.
It wouldn't hurt to put some padding in the buffers to check if the buffers have been overflown (by prefilling the boundaries with magic words).

I'm a heavy user glClientWaitSync and I've never experienced hangs in my machine, which is what strickens me odd. I also tried making a stress test to see if I could find a freeze by quickly creating fences and consuming them, sequentially and out of order, but no luck :(

For the moment, my workaround fixes the leaks and hangs, and I can play.

But you may understand now why I chose the title I wrote.

I'm using dolphin from source fca9c28f38e46b78bf65c3b427a21560e5fb42b7


Files

mypatch.diff (4.49 KB) mypatch.diff Patch to flesh out the bugs dark_sylinc, 02/22/2018 03:18 AM
Actions #1

Updated by Stenzek almost 7 years ago

  • Subject changed from StreamBuffer is a disaster! to Mesa/radeonsi hangs during glClientWaitSync()
  • Status changed from New to Accepted
  • Operating system Linux added
  • Operating system deleted (N/A)

Thank you for the detailed investigation and write-up.

I can investigate the fence leaking situation when I get a chance, however, I think the underlying issue here is a bug in the radeonsi driver. I also found Dolphin was hanging in client_wait_sync when reading back staging textures in https://github.com/dolphin-emu/dolphin/pull/6367

Funnily enough, doing an explicit glFlush() here fixes the problem too. But obviously this will have a performance cost, so it's not something we want to be doing all the time, particularly during sub-buffer allocations. Looking into the driver source, for glClientWaitSync(), the GL_SYNC_FLUSH_COMMANDS_BIT flag is completely ignored, the driver flushes unconditionally here. But there appears to be a difference between the flush which happens in glClientWaitSync() vs glFlush(). I didn't end up investigating it any further, as the early flush can potentially hand us performance benefits when the access does not immediately follow the flush.

All that said, radv works quite well with Dolphin, so that's always an option in the meantime.

Actions #2

Updated by dark_sylinc almost 7 years ago

Yes! Your glFlush in the staging textures is what prompted me to do the same. My thoughts were "well... that doesn't make sense. The flush bit should do the same as glFlush so why is Dolphin here.... unless they found the same bug and I'm looking at a workaround".

Regarding your performance concerns:

glFlush isn't veeeeeeeeeery expensive because it just tells the driver to send everything it has queued to the GPU. Flushing excessively can cause threading synchronization overhead, prevent the driver from optimizing a batch by analyzing all the commands (I seriously doubt Mesa does this sort of thing, and other drivers seem to be moving away from doing this) and maybe some HW excessive synchronization too (e.g. PCIE sync packet overhead, kernel transitions etc).

But it's not thaaaaat expensive. Furthermore:

  1. glClientSync is supposed to flush already (in fact Mesa's source code says they ignore the flush flag and always assume it's set because many apps forget to set it).
  2. The glFlush is only when the buffer is full
  3. At least with Tales of Symphonia DotNW, when I was printf debugging, I saw that the buffer gets full like 3 times per second (time measure is eyeballed)
  4. If you're that seriously concerned, you'll gain a lot more performance by only calling glClientWaitSync on the last fence you need to delete. Right now you're calling glClientWaitSync on each one of them, sequentially.

Stenzek wrote:

All that said, radv works quite well with Dolphin, so that's always an option in the meantime.

Thanks! Last time I checked 6 months ago, radv would freeze my system quickly enough. But 6 months is eons ago for that driver.

Anyway, I'm already playing with my custom build. About 1 hour of gameplay and no hangs yet; unless having to disable the bloody screensaver because I'm not touching the keyboard or mouse counts as a bug :)

Actions #3

Updated by Stenzek almost 7 years ago

glFlush isn't veeeeeeeeeery expensive because it just tells the driver to send everything it has queued to the GPU. Flushing excessively can cause threading synchronization overhead, prevent the driver from optimizing a batch by analyzing all the commands (I seriously doubt Mesa does this sort of thing, and other drivers seem to be moving away from doing this) and maybe some HW excessive synchronization too (e.g. PCIE sync packet overhead, kernel transitions etc).

Yep. That's the only reason I even considered it as a workaround for staging textures, as the CPU readback is a slow path anyway.

glClientSync is supposed to flush already (in fact Mesa's source code says they ignore the flush flag and always assume it's set because many apps forget to set it).
That was my conclusion as well. But something does not appear to be working correctly here. I found the hang was deterministic in that it happened at the same point after booting each time, but didn't narrow down the exact state where it happens.

At least with Tales of Symphonia DotNW, when I was printf debugging, I saw that the buffer gets full like 3 times per second (time measure is eyeballed)

Three times per second is fine when we're pushing 30/60 frames per second. It only becomes an issue if we have to flush and sync on the same frame that the sub-buffer is used, as otherwise the implicit flush on swap takes care of it.

However, our vertex streaming code currently has some disadvantages. We require the full "maximum batch size" to be reserved at the beginning of every batch, even though almost all of the time, not all this storage is committed. IIRC we set the buffer to 2*maxsize. So as soon as we use half the buffer, we can no longer reserve maxsize, therefore wrapping around early and syncing. I've been wanting to address this for a while now, but been busy with other things.

If you're that seriously concerned, you'll gain a lot more performance by only calling glClientWaitSync on the last fence you need to delete. Right now you're calling glClientWaitSync on each one of them, sequentially.

Yep, this is a potential optimization. However, it depends on how the driver implements syncs, a "best case" here would simply return immediately as the intermediate sync objects have already been signaled.

Actions #4

Updated by Lithium64 almost 7 years ago

I can reproduce the issue, but it only happens with 18.1.0-devel here.

My specs are:

i3 4150 @ 3.50ghz
DDR3 12GB RAM
AMD R7 260X 2GB VRAM
Ubuntu 17.10
Kernel 4.15.4
Mesa 18.1.0-devel

Actions #5

Updated by dark_sylinc almost 7 years ago

I've been able to reproduce this bug in one of my samples (completely unrelated to Dolphin) which confirms this is a Mesa bug.

I've reported the bug to Mesa https://bugs.freedesktop.org/show_bug.cgi?id=105339 and sent them a repro sample. While it's not a trivial sample, it's far simpler than Dolphin.

Actions #6

Updated by Billiard26 over 4 years ago

  • Status changed from Accepted to Fixed

The above linked Mesa issue is marked as fixed.
Will re-open upon request.

Actions

Also available in: Atom PDF