Project

General

Profile

Actions

Emulator Issues #10904

closed

Mesa/radeonsi hangs during glClientWaitSync()

Added by dark_sylinc about 6 years ago. Updated over 3 years ago.

Status:
Fixed
Priority:
Normal
Assignee:
-
% Done:

0%

Operating system:
Linux
Issue type:
Bug
Milestone:
Regression:
No
Relates to usability:
No
Relates to performance:
No
Easy:
No
Relates to maintainability:
No
Regression start:
Fixed in:

Description

Flashy click bait title to call your attention!

So... I installed latest Dolphin into my computer. It always freezes in the first 60 seconds into the game.

First of all, this is a race condition so I should post my HW & SW settings because it looks like I'm unlucky to have the combination that triggers this bug, because I literally could not find anything related to this problem.

Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
16GB RAM (1 stick)
GPU: Radeon RX 560 Series (POLARIS11 / DRM 3.19.0 / 4.14.11, LLVM 6.0.0)
Mesa 18.1.0-devel (git-183ce5e629)
Xubuntu 17.10
Kernel 4.14.11

I downloaded from source and recompiled Dolphin in full debug to get a better sense of what's going on.

Dolphin freezes inside the GL driver. Callstack:

1   syscall                                                                           syscall.S               38  0x7f5548e36a49 
2   sys_futex                                                                         futex.h                 38  0x7f550450a57c 
3   futex_wait                                                                        futex.h                 50  0x7f550450a57c 
4   do_futex_fence_wait                                                               u_queue.c               114 0x7f550450a57c 
5   _util_queue_fence_wait                                                            u_queue.c               129 0x7f550450aa79 
6   util_queue_fence_wait                                                             u_queue.h               160 0x7f55046e637d 
7   si_fence_finish                                                                   si_fence.c              195 0x7f55046e637d 
8   st_client_wait_sync                                                               st_cb_syncobj.c         115 0x7f550441dbfe 
9   client_wait_sync                                                                  syncobj.c               355 0x7f550438cfe3 
10  OGL::StreamBuffer::AllocMemory                                                    StreamBuffer.cpp        153 0xca8fd6       
11  OGL::BufferStorage::Map                                                           StreamBuffer.cpp        277 0xcab35f       
12  OGL::StreamBuffer::Map                                                            StreamBuffer.h          40  0xc962c2       
13  OGL::VertexManager::ResetBuffer                                                   VertexManager.cpp       98  0xcb2e58       
14  VertexManagerBase::PrepareForAdditionalData                                       VertexManagerBase.cpp   136 0xc5b5d8       
15  VertexLoaderManager::RunVertices                                                  VertexLoaderManager.cpp 280 0xc4723c       
16  OpcodeDecoder::Run<false>                                                         OpcodeDecoding.cpp      224 0xc1e103       
17  OpcodeDecoder::InterpretDisplayList                                               OpcodeDecoding.cpp      56  0xc1cd81       
18  OpcodeDecoder::Run<false>                                                         OpcodeDecoding.cpp      180 0xc1deb9       
19  Fifo::RunGpuLoop()::$_0::operator()() const                                       Fifo.cpp                351 0xc0caa2       
20  Common::BlockingLoop::Run<Fifo::RunGpuLoop()::$_0>(Fifo::RunGpuLoop()::$_0, long) BlockingLoop.h          135 0xc0bc18       

It's stalled at:

// if buffer is full
if (m_iterator + size >= m_size)
{
  //...
  // wait for space at the start
  for (int i = 0; i <= Slot(m_iterator + size); i++)
  {
    glClientWaitSync(m_fences[i], GL_SYNC_FLUSH_COMMANDS_BIT, GL_TIMEOUT_IGNORED); //---> here
    glDeleteSync(m_fences[i]);
  }
}

So began researching. My suspects were:

  1. CPU bug. I disabled HyperThreading. Bug still remained.
  2. AMD_pinned_memory. I disabled it (uses BufferStorage instead).
  3. Mesa driver bug. I updated to latest. This commit sounds horribly suspicious https://github.com/mesa3d/mesa/commit/50b06cbc10dbca1dfee89b529ba9b564cc4ea6f6 It's possible there's still some dependency going on.
  4. Dolphin corrupting memory. I could not test this theory.
  5. glxSwapBuffer behavior. I've seen Mesa act funny when it comes to glxSwapBuffer, and I saw the behavior changed on whether it was rendered in wxWidgets or its own window, so I toggled the option.
  6. glClientWaitSync being called from multiple threads. Doesn't appear to be the case but I'm not 100% certain.
  7. glClientWaitSync being used on a dangling fence. I tested this theory, there's no bug about this.
  8. StreamBuffer leaking too many fences. I tested this theory. IT IS LEAKING. But I've workarounded the issue and still the hangs remain.
  9. GPU hardware bug. I could not test this theory.
  10. Some bizarre behavior in StreamBuffer. It is hard to know what's going on (yes, I know it's a ring buffer, but it's complex).

So, I'm attaching a patch to test my findings, which is by no means meant to be included in the repo (it's just debug code, doesn't comply with coding conventions, etc).

By applying the patch I provided, we can see that:

void StreamBuffer::AllocMemory(u32 size)
{
  // insert waiting slots for used memory
  for (int i = Slot(m_used_iterator); i < Slot(m_iterator); i++)
  {
      assert( !m_fences_check[i] ); //---> this assert triggers

This means that the code is leaking memory. The constructor populates the array with fences. After mapping->unmapping enough memory, Slot(m_iterator) becomes 1 for the first time, while Slot(m_used_iterator) is still 0. This will cause the snippet of code to create another fence, leaking the old one. This is a leak.

We can also check that after playing for a while, hitting the button stop will trigger yet another one:
for (int i = 0; i < SYNC_POINTS; i++)
    assert( !m_fences_check[i] );

This means StreamBuffer::DeleteFences is not deleting all the fences. This is a leak.

But this does not explain my hangs. I workarounded the code (included in patch) so that it doesn't leak (unless I missed something), but it still hangs.

A few things I noticed:

  1. glClientWaitSync ALWAYS hangs on m_fences[7].
  2. The snippet before the waits (the one that creats more fences when the buffer is full) has (m_used_iterator >> 21u) == 8. Always.
  3. Printing the number of times we hit the potentially problematic code doesn't show an exact value. Sometimes it hangs at 49, another time at 52, sometimes at 99.

So I figgured this is probably a driver SW or HW timing bug.
Adding this snippet:

glFlush();

Between the creation of more fences and the waits fixed my hangs.

It fixes the problem. Or at least, I no longer get hangs 60 seconds into the game. Maybe now I'll get one hang per hour? I haven't played long enough to find out.

While this reeks of driver bug, the behavior of StreamBuffer is sketchy, to say the least. Leaks fences, calls glClientWaitSync too frequent (if glClientWaitSync[n+1] returned, then we can assume glClientWaitSync[n] is already signaled!). Honestly I would rewrite the whole class and see if the hangs are gone.
It wouldn't hurt to put some padding in the buffers to check if the buffers have been overflown (by prefilling the boundaries with magic words).

I'm a heavy user glClientWaitSync and I've never experienced hangs in my machine, which is what strickens me odd. I also tried making a stress test to see if I could find a freeze by quickly creating fences and consuming them, sequentially and out of order, but no luck :(

For the moment, my workaround fixes the leaks and hangs, and I can play.

But you may understand now why I chose the title I wrote.

I'm using dolphin from source fca9c28f38e46b78bf65c3b427a21560e5fb42b7


Files

mypatch.diff (4.49 KB) mypatch.diff Patch to flesh out the bugs dark_sylinc, 02/22/2018 03:18 AM
Actions

Also available in: Atom PDF