[PR #3184] [MERGED] vector_alu: Improve handling of mbcnt append/consume patterns #3293

Closed
opened 2026-02-27 22:03:10 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/shadps4-emu/shadPS4/pull/3184
Author: @raphaelthegreat
Created: 7/2/2025
Status: Merged
Merged: 7/3/2025
Merged by: @georgemoralis

Base: mainHead: mbcnt


📝 Commits (2)

  • 28b9189 vector_alu: Improve handling of mbcnt append/consume patterns
  • 753ca25 vk_rasterizer: Always sync DMA buffers

📊 Changes

2 files changed (+16 additions, -12 deletions)

View changed files

📝 src/shader_recompiler/frontend/translate/vector_alu.cpp (+15 -11)
📝 src/video_core/renderer_vulkan/vk_rasterizer.cpp (+1 -1)

📄 Description

This fixes missing grass in Driveclub (CUSA00093)
The existing implementation was written to handle a single pattern of mbcnt before the DS_APPEND instruction

v_mbcnt_hi_u32_b32 vX, exec_hi, 0
v_mbcnt_lo_u32_b32 vX, exec_lo, vX
ds_append       vY offset:4 gds
v_add_i32       vX, vcc, vY, vX

In this case however the DS_APPEND is before the mbcnt pattern (but is same functionality wise as above)

ds_append       vX gds
v_mbcnt_hi_u32_b32 vY, exec_hi, vX
v_mbcnt_lo_u32_b32 vZ, exec_lo, vY

The mbcnt instructions are always in pairs of hi/lo and in general are quite flexible. But they assume the subgroup size is 64 so they are not recompiled literally. Together with DS_APPEND they are used to derive a unique per thread index in a buffer (different from using thread_id as order could be random). DS_APPEND instruction works on per subgroup level, by adding number of active threads of subgroup to the GDS counter, essentially giving a multiple-of-64 base index to all threads. Then each thread executes the mbcnt pair which returns the number of active threads with id less than itself and adds it with the base.

The recompiler translates DS_APPEND into an atomic increment of a storage buffer counter, which already gives the desired unique index, so this pattern is a no-op. On main it was set to zero as per the first pattern to avoid altering the DS_APPEND result. The new handling passes through the initial value of the pattern instead, which has the same effect but works on either case.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/shadps4-emu/shadPS4/pull/3184 **Author:** [@raphaelthegreat](https://github.com/raphaelthegreat) **Created:** 7/2/2025 **Status:** ✅ Merged **Merged:** 7/3/2025 **Merged by:** [@georgemoralis](https://github.com/georgemoralis) **Base:** `main` ← **Head:** `mbcnt` --- ### 📝 Commits (2) - [`28b9189`](https://github.com/shadps4-emu/shadPS4/commit/28b9189ecfa4f035228aa734a3d1da21e738b090) vector_alu: Improve handling of mbcnt append/consume patterns - [`753ca25`](https://github.com/shadps4-emu/shadPS4/commit/753ca25757d25e9e09ac6b8adccf23d18a92196f) vk_rasterizer: Always sync DMA buffers ### 📊 Changes **2 files changed** (+16 additions, -12 deletions) <details> <summary>View changed files</summary> 📝 `src/shader_recompiler/frontend/translate/vector_alu.cpp` (+15 -11) 📝 `src/video_core/renderer_vulkan/vk_rasterizer.cpp` (+1 -1) </details> ### 📄 Description This fixes missing grass in Driveclub (CUSA00093) The existing implementation was written to handle a single pattern of mbcnt before the DS_APPEND instruction ``` v_mbcnt_hi_u32_b32 vX, exec_hi, 0 v_mbcnt_lo_u32_b32 vX, exec_lo, vX ds_append vY offset:4 gds v_add_i32 vX, vcc, vY, vX ``` In this case however the DS_APPEND is before the mbcnt pattern (but is same functionality wise as above) ``` ds_append vX gds v_mbcnt_hi_u32_b32 vY, exec_hi, vX v_mbcnt_lo_u32_b32 vZ, exec_lo, vY ``` The mbcnt instructions are always in pairs of hi/lo and in general are quite flexible. But they assume the subgroup size is 64 so they are not recompiled literally. Together with DS_APPEND they are used to derive a unique per thread index in a buffer (different from using thread_id as order could be random). DS_APPEND instruction works on per subgroup level, by adding number of active threads of subgroup to the GDS counter, essentially giving a multiple-of-64 base index to all threads. Then each thread executes the mbcnt pair which returns the number of active threads with id less than itself and adds it with the base. The recompiler translates DS_APPEND into an atomic increment of a storage buffer counter, which already gives the desired unique index, so this pattern is a no-op. On main it was set to zero as per the first pattern to avoid altering the DS_APPEND result. The new handling passes through the initial value of the pattern instead, which has the same effect but works on either case. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
kerem 2026-02-27 22:03:10 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/shadPS4#3293
No description provided.