[PR #3404] video_core: Readback optimizations #3441

New issue

Open

opened 2026-02-27 22:03:42 +03:00 by kerem · 0 comments

kerem commented

2026-02-27 22:03:42 +03:00

Owner

📋 Pull Request Information

Original PR: https://github.com/shadps4-emu/shadPS4/pull/3404
Author: @raphaelthegreat
Created: 8/8/2025
Status: 🔄 Open

Base: main ← Head: readback-opts

📝 Commits (10+)

5f8fcc8 Basic CPU fence detection
a37def0 liverpool: Patch self modifying DispatchDirect to DispatchIndirect
7c7b610 Remove some rewind flushes
a16a681 buffer_cache: Preemptive downloads of frequently flushed pages
43ba88b vk_scheduler: Remove pending op pop in wait
9fb27a5 amdgpu: Split fence detection code to header
5ab3d8e pm4_cmds: Use bit_cast
34892a2 buffer_cache: Attempt to fix readback off
8ed2c07 video_core: Add fence detection setting
f86dc6c Avoid crash on fence detection

📊 Changes

17 files changed (+481 additions, -139 deletions)

View changed files

📝 CMakeLists.txt (+1 -0)
📝 src/common/config.cpp (+9 -0)
📝 src/common/config.h (+13 -2)
📝 src/emulator.cpp (+3 -1)
➕ src/video_core/amdgpu/fence_detector.h (+148 -0)
📝 src/video_core/amdgpu/liverpool.cpp (+65 -10)
📝 src/video_core/amdgpu/pm4_cmds.h (+13 -13)
📝 src/video_core/buffer_cache/buffer_cache.cpp (+132 -67)
📝 src/video_core/buffer_cache/buffer_cache.h (+36 -19)
📝 src/video_core/buffer_cache/memory_tracker.h (+23 -2)
📝 src/video_core/buffer_cache/range_set.h (+10 -3)
📝 src/video_core/buffer_cache/region_definitions.h (+1 -0)
📝 src/video_core/buffer_cache/region_manager.h (+13 -2)
📝 src/video_core/page_manager.h (+1 -0)
📝 src/video_core/renderer_vulkan/vk_rasterizer.cpp (+12 -11)
📝 src/video_core/renderer_vulkan/vk_rasterizer.h (+1 -0)
📝 src/video_core/renderer_vulkan/vk_scheduler.cpp (+0 -9)

📄 Description

Note there is still some cleanup to do, this is not final code. It can also cause freezes/bugs (hope not though)

General idea

If you were to write a vulkan program that generates some data on the GPU and later want to access that data on the host, you need a sync operation to ensure the GPU has finished its work. Said sync operation is called a fence because it makes the CPU wait for the GPU.

In a similar fashion the guest also has to use fences before reading GPU data on the host. Because of its unified memory, it doesn't have to copy said data to a host visible memory, but it still must sync with a fence operation before accessing it. The emulator can rely on that promise, that the guest will not overwrite nor read any GPU generated data before a fence operation has given it opportunity to sync with the GPU.

So the main idea of the PR is to attempt to detect these fence operations in the PM4 command stream and defer read-protecting GPU modified pages until right before them. If a page read/write then happens before a fence, it will pass through without a flush, because the emulator can be sure the guest cannot access the data yet.

The aforementioned detection is not trivial though, because there is little indication to the emulator about what sync operations are used for. AMD GCN uses labels, 4 or 8 byte memory addresses, where "signal" packets write to and "wait" packets can wait on, or the host can poll in case of fence. All that means its close to impossible to detect actual wait of a fence. Instead, this PR implements a prepass which scans input command lists and tries to "guess" which packets act as fences and which not.

Possible packets that can write labels are EventWriteEos, EventWriteEop, WriteData (GFX) and ReleaseMem (ACB). There is a simple heuristic where if the label of a signal packet is waited by the GPU with WaitRegMem packet it is considered a GPU->GPU sync (something akin to pipeline barriers). It is in fact possible for a label to act both as fence and pipeline barrier so the heuristic can fail but its very unlikely.

Deferring read protections allows for some powerful optimizations, 2 of which are implemented here.

Rewind indirect patch

The rewind packet has a misleading name, as it implies execution going back somewhere, but what it does is tell CP to drop all prefetched packets and reload them from memory. It is almost exclusively used for command list self modification (from a compute shader for example), which driveclub does for a dozen dispatches at the start of the frame. It uses a compute shader to patch the dimentions of the DispatchDirect PM4 packet before executing it. Why it didnt use an indirect dispatch I'm not sure.

Before readbacks this lead to launching a dispatch with garbage (often huge) dimensions, freezing the GPU. Readbacks, on the other hand, fixed it by read protecting the memory and flushing the modified data. That works but is very expensive; around a dozen flushes, one per patched dispatch.

Defering read protections allows emulator to reach rewind packet before a flush. Then the emulator can scan the pending GPU ranges inside the current command list, check they are dispatch dimention patches and convert the direct dispatch into an indirect dispatch. The latter reads dimentions from GPU buffers avoiding need for flushing memory.

Preemptive buffer downloads

This optimization is a lot more general then above one and should affect all games that rely on readbacks. It is possible to implement without CPU fence detection, but it makes the implementation more efficient because the emulator can batch preemptive download copies upon reaching the fence.

The idea is to track how many times a page has been flushed and if that number exceeds a threshold, any future GPU data inside it will be copied to host asynchronously. If a flush is triggered, the GPU thread simply has to wait for the GPU to finish and copy data to guest memory. The advatange here is the reduction or (in certain cases) elimination of the wait time, as GPU likely has had time to catch up to host. In addition, once the wait has been done, the rest of preemptive downloads become "free" and don't need further stalls.

Add config option to control aggressiveness of fence detection
Revoke preempt status from pages that stop being flushed

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/shadps4-emu/shadPS4/pull/3404 **Author:** [@raphaelthegreat](https://github.com/raphaelthegreat) **Created:** 8/8/2025 **Status:** 🔄 Open **Base:** `main` ← **Head:** `readback-opts` --- ### 📝 Commits (10+) - [`5f8fcc8`](https://github.com/shadps4-emu/shadPS4/commit/5f8fcc880dd8a4e6b48d9fdc7a1e7ce6cd7b46de) Basic CPU fence detection - [`a37def0`](https://github.com/shadps4-emu/shadPS4/commit/a37def0b2cbd0ef113217525aa1ed940a06514d8) liverpool: Patch self modifying DispatchDirect to DispatchIndirect - [`7c7b610`](https://github.com/shadps4-emu/shadPS4/commit/7c7b6103fe84a3cb5e2a4b2c8a1a7f5104e590df) Remove some rewind flushes - [`a16a681`](https://github.com/shadps4-emu/shadPS4/commit/a16a681d04fe8bd23887774ef44ae5cfb96aa995) buffer_cache: Preemptive downloads of frequently flushed pages - [`43ba88b`](https://github.com/shadps4-emu/shadPS4/commit/43ba88b4b72d5dbb1df72adfcf463e2f7a31ba4c) vk_scheduler: Remove pending op pop in wait - [`9fb27a5`](https://github.com/shadps4-emu/shadPS4/commit/9fb27a5e4cc938cdf16fa85308c68695f5fc36c2) amdgpu: Split fence detection code to header - [`5ab3d8e`](https://github.com/shadps4-emu/shadPS4/commit/5ab3d8e6c152edfad5896583d00827681c781aca) pm4_cmds: Use bit_cast - [`34892a2`](https://github.com/shadps4-emu/shadPS4/commit/34892a2212e29a74ae11007eea11ec90709dc891) buffer_cache: Attempt to fix readback off - [`8ed2c07`](https://github.com/shadps4-emu/shadPS4/commit/8ed2c07d3e5eddd6b3182c23c8132028f07c4f14) video_core: Add fence detection setting - [`f86dc6c`](https://github.com/shadps4-emu/shadPS4/commit/f86dc6ccf8ccf25e200c7eb1ace39b52f2aa668d) Avoid crash on fence detection ### 📊 Changes **17 files changed** (+481 additions, -139 deletions) <details> <summary>View changed files</summary> 📝 `CMakeLists.txt` (+1 -0) 📝 `src/common/config.cpp` (+9 -0) 📝 `src/common/config.h` (+13 -2) 📝 `src/emulator.cpp` (+3 -1) ➕ `src/video_core/amdgpu/fence_detector.h` (+148 -0) 📝 `src/video_core/amdgpu/liverpool.cpp` (+65 -10) 📝 `src/video_core/amdgpu/pm4_cmds.h` (+13 -13) 📝 `src/video_core/buffer_cache/buffer_cache.cpp` (+132 -67) 📝 `src/video_core/buffer_cache/buffer_cache.h` (+36 -19) 📝 `src/video_core/buffer_cache/memory_tracker.h` (+23 -2) 📝 `src/video_core/buffer_cache/range_set.h` (+10 -3) 📝 `src/video_core/buffer_cache/region_definitions.h` (+1 -0) 📝 `src/video_core/buffer_cache/region_manager.h` (+13 -2) 📝 `src/video_core/page_manager.h` (+1 -0) 📝 `src/video_core/renderer_vulkan/vk_rasterizer.cpp` (+12 -11) 📝 `src/video_core/renderer_vulkan/vk_rasterizer.h` (+1 -0) 📝 `src/video_core/renderer_vulkan/vk_scheduler.cpp` (+0 -9) </details> ### 📄 Description Note there is still some cleanup to do, this is not final code. It can also cause freezes/bugs (hope not though) ### General idea If you were to write a vulkan program that generates some data on the GPU and later want to access that data on the host, you need a sync operation to ensure the GPU has finished its work. Said sync operation is called a _fence_ because it makes the CPU wait for the GPU. In a similar fashion the guest also has to use fences before reading GPU data on the host. Because of its unified memory, it doesn't have to copy said data to a host visible memory, but it still must sync with a fence operation before accessing it. The emulator can rely on that promise, that the guest will not overwrite nor read any GPU generated data before a fence operation has given it opportunity to sync with the GPU. So the main idea of the PR is to attempt to detect these fence operations in the PM4 command stream and _defer_ read-protecting GPU modified pages until right before them. If a page read/write then happens before a fence, it will pass through without a flush, because the emulator can be sure the guest cannot access the data yet. The aforementioned detection is not trivial though, because there is little indication to the emulator about what sync operations are used for. AMD GCN uses labels, 4 or 8 byte memory addresses, where "signal" packets write to and "wait" packets can wait on, or the host can poll in case of fence. All that means its close to impossible to detect actual wait of a fence. Instead, this PR implements a prepass which scans input command lists and tries to "guess" which packets act as fences and which not. Possible packets that can write labels are EventWriteEos, EventWriteEop, WriteData (GFX) and ReleaseMem (ACB). There is a simple heuristic where if the label of a signal packet is waited by the GPU with WaitRegMem packet it is considered a GPU->GPU sync (something akin to pipeline barriers). It is in fact possible for a label to act both as fence and pipeline barrier so the heuristic can fail but its very unlikely. Deferring read protections allows for some powerful optimizations, 2 of which are implemented here. ### Rewind indirect patch The rewind packet has a misleading name, as it implies execution going back somewhere, but what it does is tell CP to drop all prefetched packets and reload them from memory. It is almost exclusively used for command list self modification (from a compute shader for example), which driveclub does for a dozen dispatches at the start of the frame. It uses a compute shader to patch the dimentions of the DispatchDirect PM4 packet before executing it. Why it didnt use an indirect dispatch I'm not sure. Before readbacks this lead to launching a dispatch with garbage (often huge) dimensions, freezing the GPU. Readbacks, on the other hand, fixed it by read protecting the memory and flushing the modified data. That works but is very expensive; around a dozen flushes, one per patched dispatch. Defering read protections allows emulator to reach rewind packet before a flush. Then the emulator can scan the pending GPU ranges inside the current command list, check they are dispatch dimention patches and convert the direct dispatch into an indirect dispatch. The latter reads dimentions from GPU buffers avoiding need for flushing memory. ### Preemptive buffer downloads This optimization is a lot more general then above one and should affect all games that rely on readbacks. It is possible to implement without CPU fence detection, but it makes the implementation more efficient because the emulator can batch preemptive download copies upon reaching the fence. The idea is to track how many times a page has been flushed and if that number exceeds a threshold, any future GPU data inside it will be copied to host asynchronously. If a flush is triggered, the GPU thread simply has to wait for the GPU to finish and copy data to guest memory. The advatange here is the reduction or (in certain cases) elimination of the wait time, as GPU likely has had time to catch up to host. In addition, once the wait has been done, the rest of preemptive downloads become "free" and don't need further stalls. - [x] Add config option to control aggressiveness of fence detection - [ ] Revoke preempt status from pages that stop being flushed --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>