mirror of
https://github.com/shadps4-emu/shadPS4.git
synced 2026-04-24 23:36:00 +03:00
[PR #3404] video_core: Readback optimizations #3441
Labels
No labels
Bloodborne
bug
contributor wanted
documentation
enhancement
frontend
good first issue
help wanted
linux
pull-request
question
release
verification progress
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/shadPS4#3441
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/shadps4-emu/shadPS4/pull/3404
Author: @raphaelthegreat
Created: 8/8/2025
Status: 🔄 Open
Base:
main← Head:readback-opts📝 Commits (10+)
5f8fcc8Basic CPU fence detectiona37def0liverpool: Patch self modifying DispatchDirect to DispatchIndirect7c7b610Remove some rewind flushesa16a681buffer_cache: Preemptive downloads of frequently flushed pages43ba88bvk_scheduler: Remove pending op pop in wait9fb27a5amdgpu: Split fence detection code to header5ab3d8epm4_cmds: Use bit_cast34892a2buffer_cache: Attempt to fix readback off8ed2c07video_core: Add fence detection settingf86dc6cAvoid crash on fence detection📊 Changes
17 files changed (+481 additions, -139 deletions)
View changed files
📝
CMakeLists.txt(+1 -0)📝
src/common/config.cpp(+9 -0)📝
src/common/config.h(+13 -2)📝
src/emulator.cpp(+3 -1)➕
src/video_core/amdgpu/fence_detector.h(+148 -0)📝
src/video_core/amdgpu/liverpool.cpp(+65 -10)📝
src/video_core/amdgpu/pm4_cmds.h(+13 -13)📝
src/video_core/buffer_cache/buffer_cache.cpp(+132 -67)📝
src/video_core/buffer_cache/buffer_cache.h(+36 -19)📝
src/video_core/buffer_cache/memory_tracker.h(+23 -2)📝
src/video_core/buffer_cache/range_set.h(+10 -3)📝
src/video_core/buffer_cache/region_definitions.h(+1 -0)📝
src/video_core/buffer_cache/region_manager.h(+13 -2)📝
src/video_core/page_manager.h(+1 -0)📝
src/video_core/renderer_vulkan/vk_rasterizer.cpp(+12 -11)📝
src/video_core/renderer_vulkan/vk_rasterizer.h(+1 -0)📝
src/video_core/renderer_vulkan/vk_scheduler.cpp(+0 -9)📄 Description
Note there is still some cleanup to do, this is not final code. It can also cause freezes/bugs (hope not though)
General idea
If you were to write a vulkan program that generates some data on the GPU and later want to access that data on the host, you need a sync operation to ensure the GPU has finished its work. Said sync operation is called a fence because it makes the CPU wait for the GPU.
In a similar fashion the guest also has to use fences before reading GPU data on the host. Because of its unified memory, it doesn't have to copy said data to a host visible memory, but it still must sync with a fence operation before accessing it. The emulator can rely on that promise, that the guest will not overwrite nor read any GPU generated data before a fence operation has given it opportunity to sync with the GPU.
So the main idea of the PR is to attempt to detect these fence operations in the PM4 command stream and defer read-protecting GPU modified pages until right before them. If a page read/write then happens before a fence, it will pass through without a flush, because the emulator can be sure the guest cannot access the data yet.
The aforementioned detection is not trivial though, because there is little indication to the emulator about what sync operations are used for. AMD GCN uses labels, 4 or 8 byte memory addresses, where "signal" packets write to and "wait" packets can wait on, or the host can poll in case of fence. All that means its close to impossible to detect actual wait of a fence. Instead, this PR implements a prepass which scans input command lists and tries to "guess" which packets act as fences and which not.
Possible packets that can write labels are EventWriteEos, EventWriteEop, WriteData (GFX) and ReleaseMem (ACB). There is a simple heuristic where if the label of a signal packet is waited by the GPU with WaitRegMem packet it is considered a GPU->GPU sync (something akin to pipeline barriers). It is in fact possible for a label to act both as fence and pipeline barrier so the heuristic can fail but its very unlikely.
Deferring read protections allows for some powerful optimizations, 2 of which are implemented here.
Rewind indirect patch
The rewind packet has a misleading name, as it implies execution going back somewhere, but what it does is tell CP to drop all prefetched packets and reload them from memory. It is almost exclusively used for command list self modification (from a compute shader for example), which driveclub does for a dozen dispatches at the start of the frame. It uses a compute shader to patch the dimentions of the DispatchDirect PM4 packet before executing it. Why it didnt use an indirect dispatch I'm not sure.
Before readbacks this lead to launching a dispatch with garbage (often huge) dimensions, freezing the GPU. Readbacks, on the other hand, fixed it by read protecting the memory and flushing the modified data. That works but is very expensive; around a dozen flushes, one per patched dispatch.
Defering read protections allows emulator to reach rewind packet before a flush. Then the emulator can scan the pending GPU ranges inside the current command list, check they are dispatch dimention patches and convert the direct dispatch into an indirect dispatch. The latter reads dimentions from GPU buffers avoiding need for flushing memory.
Preemptive buffer downloads
This optimization is a lot more general then above one and should affect all games that rely on readbacks. It is possible to implement without CPU fence detection, but it makes the implementation more efficient because the emulator can batch preemptive download copies upon reaching the fence.
The idea is to track how many times a page has been flushed and if that number exceeds a threshold, any future GPU data inside it will be copied to host asynchronously. If a flush is triggered, the GPU thread simply has to wait for the GPU to finish and copy data to guest memory. The advatange here is the reduction or (in certain cases) elimination of the wait time, as GPU likely has had time to catch up to host. In addition, once the wait has been done, the rest of preemptive downloads become "free" and don't need further stalls.
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.