[GH-ISSUE #2990] Ridge Racers (USJS00001) - CPU autodrive was Algorithm buggy #1227

Open
opened 2026-03-17 20:53:38 +03:00 by kerem · 57 comments
Owner

Originally created by @triglav1024 on GitHub (Jul 30, 2013).
Original GitHub issue: https://github.com/hrydgard/ppsspp/issues/2990

Options -> AV player mode
https://www.youtube.com/watch?v=hRrVBM2-OWc
https://www.youtube.com/watch?v=XJkM729PeeE
https://www.youtube.com/watch?v=3eQ7BlocmUo

Ridge Racers - JP 1.01 / USA 1.00 / EUR 1.00 / HK 1.00 / Asia 1.00

Originally created by @triglav1024 on GitHub (Jul 30, 2013). Original GitHub issue: https://github.com/hrydgard/ppsspp/issues/2990 Options -> AV player mode https://www.youtube.com/watch?v=hRrVBM2-OWc https://www.youtube.com/watch?v=XJkM729PeeE https://www.youtube.com/watch?v=3eQ7BlocmUo ## Ridge Racers - [JP 1.01](http://report.ppsspp.org/game/ULJS00001_1.01) / [USA 1.00](http://report.ppsspp.org/game/ULUS10001_1.00) / [EUR 1.00](http://report.ppsspp.org/game/UCES00002_1.00) / [HK 1.00](http://report.ppsspp.org/game/UCKS45002_1.00) / [Asia 1.00](http://report.ppsspp.org/game/UCAS40015_1.00)
Author
Owner

@thedax commented on GitHub (Jul 30, 2013):

This is probably some sort of CPU or VFPU bug, I'd guess. A similar behaviour occurs in Dolphin with replays in Mario Kart Wii.

One question though, does the bug occur immediately in every replay, like say you start the game and then launch a replay, or does it take say 10-20 minutes for it to appear? If yes, as a workaround, try using the Unlock CPU Speed option/hack since some games get buggy if the emulated PSP CPU speed is changed often. I'd check the debug log to see if it's using scePowerSetClockFrequency often(https://github.com/hrydgard/ppsspp/issues/2104).

<!-- gh-comment-id:21772035 --> @thedax commented on GitHub (Jul 30, 2013): This is probably some sort of CPU or VFPU bug, I'd guess. A similar behaviour occurs in Dolphin with replays in Mario Kart Wii. One question though, does the bug occur immediately in every replay, like say you start the game and then launch a replay, or does it take say 10-20 minutes for it to appear? If yes, as a workaround, try using the Unlock CPU Speed option/hack since some games get buggy if the emulated PSP CPU speed is changed often. I'd check the debug log to see if it's using scePowerSetClockFrequency often(https://github.com/hrydgard/ppsspp/issues/2104).
Author
Owner

@triglav1024 commented on GitHub (Jul 30, 2013):

I could not think, the CPU hack.I had set the clock to 333Khz.
However, it was the same whether you set the default clock.

This behavior has occurred immediately after startup. Every time, I will develop after 40 seconds from the start.
In addition, there is no randomness, the same car is always selected. And buggy ....

<!-- gh-comment-id:21779037 --> @triglav1024 commented on GitHub (Jul 30, 2013): I could not think, the CPU hack.I had set the clock to 333Khz. However, it was the same whether you set the default clock. This behavior has occurred immediately after startup. Every time, I will develop after 40 seconds from the start. In addition, there is no randomness, the same car is always selected. And buggy ....
Author
Owner

@unknownbrackets commented on GitHub (Dec 7, 2013):

Has this improved at all, or does it still do this? There were some timing fixes not that long ago.

-[Unknown]

<!-- gh-comment-id:30046922 --> @unknownbrackets commented on GitHub (Dec 7, 2013): Has this improved at all, or does it still do this? There were some timing fixes not that long ago. -[Unknown]
Author
Owner

@ppmeis commented on GitHub (May 15, 2014):

I just test this issue. All replays made by CPU are buggy: car makes strange things during race (like constantly hit the wall), But personal replays works fine.

Tested with latest build 0.9.8-676

<!-- gh-comment-id:43197893 --> @ppmeis commented on GitHub (May 15, 2014): I just test this issue. All replays made by CPU are buggy: car makes strange things during race (like constantly hit the wall), But personal replays works fine. Tested with latest build 0.9.8-676
Author
Owner

@unknownbrackets commented on GitHub (Jun 16, 2014):

Could this have possibly improved with the vrot fix?

Does having jit off affect it?

-[Unknown]

<!-- gh-comment-id:46144924 --> @unknownbrackets commented on GitHub (Jun 16, 2014): Could this have possibly improved with the vrot fix? Does having jit off affect it? -[Unknown]
Author
Owner

@ppmeis commented on GitHub (Jul 22, 2014):

Tested with latest build. CPU replays still buggy:
image

Jit off does not help:
image

<!-- gh-comment-id:49743222 --> @ppmeis commented on GitHub (Jul 22, 2014): Tested with latest build. CPU replays still buggy: ![image](https://cloud.githubusercontent.com/assets/4381277/3658935/1f020548-11a9-11e4-8a06-ace77713bc94.png) Jit off does not help: ![image](https://cloud.githubusercontent.com/assets/4381277/3658966/6cb6a5d2-11a9-11e4-9108-fccda16cca93.png)
Author
Owner

@unknownbrackets commented on GitHub (Aug 25, 2014):

Does this still happen in the latest git build?

Make sure you don't have that GEB save compat thing changed from the default.

-[Unknown]

<!-- gh-comment-id:53217933 --> @unknownbrackets commented on GitHub (Aug 25, 2014): Does this still happen in the latest git build? Make sure you don't have that GEB save compat thing changed from the default. -[Unknown]
Author
Owner

@ppmeis commented on GitHub (Aug 25, 2014):

Tested with latest build, bug is still present:
image

<!-- gh-comment-id:53260233 --> @ppmeis commented on GitHub (Aug 25, 2014): Tested with latest build, bug is still present: ![image](https://cloud.githubusercontent.com/assets/4381277/4029977/4eca67ac-2c57-11e4-9a5d-55ad6a74498a.png)
Author
Owner

@ppmeis commented on GitHub (Feb 1, 2015):

Tested with latest build. Same status:
image

<!-- gh-comment-id:72391624 --> @ppmeis commented on GitHub (Feb 1, 2015): Tested with latest build. Same status: ![image](https://cloud.githubusercontent.com/assets/4381277/5994438/a42738f2-aa73-11e4-994c-e41730fef800.png)
Author
Owner

@unknownbrackets commented on GitHub (Mar 2, 2015):

I have the US version of this game, but have not really played it much.

What's the easiest and fastest way to reproduce this issue from scratch (e.g. no savedata / blank slate)? I want to try to see if I can at least cause the autodrive to be wrong in different ways.

Edit: hmm, I think I can repro without savedata actually, n/m.

-[Unknown]

<!-- gh-comment-id:76655901 --> @unknownbrackets commented on GitHub (Mar 2, 2015): I have the US version of this game, but have not really played it much. What's the easiest and fastest way to reproduce this issue from scratch (e.g. no savedata / blank slate)? I want to try to see if I can at least cause the autodrive to be wrong in different ways. Edit: hmm, I think I can repro without savedata actually, n/m. -[Unknown]
Author
Owner

@unknownbrackets commented on GitHub (Mar 2, 2015):

Excluding alu and lsu like lv/sv/lwc/swc/mt*/mf* type instructions, here's a list of the ones this game does during the AV thing. The value in parens is number of times it was hit, I've moved all the super unlikely ones to the bottom.

mul.s     (54993288)  // Small error = major driving glitches.
add.s     (28356670)  // Small error = major driving glitches.
c.le      (13832384)  // Change to lt = driving glitches happen differently.
sub.s     (13597462)  // Small error = major driving glitches.
vdot      (12944555)  // MAYBE: Introducing a small error makes glitches happen quicker.
vadd      (9333422)   // Small error = major driving glitches.
vscl      (8506915)   // Small error = major driving glitches.
vsub      (5637652)   // Small error = driving glitches happen differently.
trunc.w.s (5548058)   // Small error = driving glitches happen differently.
cvt.s.w   (4223431)   // Small error = major driving glitches.
vsqrt     (3891304)   // MAYBE: Introducing a small error makes glitches MUCH worse.
div.s     (3243544)   // Small error = major driving glitches.
v(h)tfm4  (2253749)   // MAYBE: Introducing a small error makes glitches MUCH worse.
vpfxt     (2184294)   // Ignore = no driving change, but major gfx glitches.  Might still be wrong prefix handling.
vrsq      (971862)    // MAYBE: Introducing a small error makes glitches MUCH worse.
v(h)tfm3  (783310)    // Small error = major driving glitches.
vdiv      (572562)    // Small error = driving glitches happen differently.
sqrt.s    (533778)    // Small error = driving glitches happen differently.
vcrsp.t   (100890)    // MAYBE: Introducing a small error makes glitches happen quicker.

c.lt      (29268876)  // Any change = breaks everything, but unlikely.
mov.s     (15284356)
vone      (9659339)
neg.s     (3638275)
c.eq      (1284924)   // Any change = breaks everything, but unlikely.
abs.s     (562281)    // Small error = crash, unplayable... unlikely.
vmov      (452786)
vneg      (276140)
vmidt     (180679)
vmmov     (62622)
vzero     (14834)

vmul      (4225115)   // Not so small error = no difference.
vi2f      (1135038)   // Makes no difference.
vi2uc     (568000)    // Makes no difference.
vabs      (567519)    // Not so small error = no difference.
cvt.w.s   (379639)    // Not so small error = no difference.
vrot      (284752)    // Not so small error = no difference.
vmmul     (164305)    // Not so small error = no difference.
vqmul.q   (100890)    // Not so small error = no difference.
vcos      (21269)     // Not so small error = no difference.
vrcp      (15796)     // Makes no difference.
vsin      (8350)      // Not so small error = no difference.
vf2iz     (481)       // Makes no difference, even if hardcoded (but graphical glitches yes.)
vrndf1    (245)       // Makes no difference.

AFAICT, it does not change the rounding mode ever from the default.

If it's not a cpu instruction, then maybe it's timing somehow. But man, every almost instruction I try has a major impact on driving, so it could be anything...

-[Unknown]

<!-- gh-comment-id:76659348 --> @unknownbrackets commented on GitHub (Mar 2, 2015): Excluding alu and lsu like `lv/sv/lwc/swc/mt*/mf*` type instructions, here's a list of the ones this game does during the AV thing. The value in parens is number of times it was hit, I've moved all the super unlikely ones to the bottom. ``` c++ mul.s (54993288) // Small error = major driving glitches. add.s (28356670) // Small error = major driving glitches. c.le (13832384) // Change to lt = driving glitches happen differently. sub.s (13597462) // Small error = major driving glitches. vdot (12944555) // MAYBE: Introducing a small error makes glitches happen quicker. vadd (9333422) // Small error = major driving glitches. vscl (8506915) // Small error = major driving glitches. vsub (5637652) // Small error = driving glitches happen differently. trunc.w.s (5548058) // Small error = driving glitches happen differently. cvt.s.w (4223431) // Small error = major driving glitches. vsqrt (3891304) // MAYBE: Introducing a small error makes glitches MUCH worse. div.s (3243544) // Small error = major driving glitches. v(h)tfm4 (2253749) // MAYBE: Introducing a small error makes glitches MUCH worse. vpfxt (2184294) // Ignore = no driving change, but major gfx glitches. Might still be wrong prefix handling. vrsq (971862) // MAYBE: Introducing a small error makes glitches MUCH worse. v(h)tfm3 (783310) // Small error = major driving glitches. vdiv (572562) // Small error = driving glitches happen differently. sqrt.s (533778) // Small error = driving glitches happen differently. vcrsp.t (100890) // MAYBE: Introducing a small error makes glitches happen quicker. c.lt (29268876) // Any change = breaks everything, but unlikely. mov.s (15284356) vone (9659339) neg.s (3638275) c.eq (1284924) // Any change = breaks everything, but unlikely. abs.s (562281) // Small error = crash, unplayable... unlikely. vmov (452786) vneg (276140) vmidt (180679) vmmov (62622) vzero (14834) vmul (4225115) // Not so small error = no difference. vi2f (1135038) // Makes no difference. vi2uc (568000) // Makes no difference. vabs (567519) // Not so small error = no difference. cvt.w.s (379639) // Not so small error = no difference. vrot (284752) // Not so small error = no difference. vmmul (164305) // Not so small error = no difference. vqmul.q (100890) // Not so small error = no difference. vcos (21269) // Not so small error = no difference. vrcp (15796) // Makes no difference. vsin (8350) // Not so small error = no difference. vf2iz (481) // Makes no difference, even if hardcoded (but graphical glitches yes.) vrndf1 (245) // Makes no difference. ``` AFAICT, it does not change the rounding mode ever from the default. If it's not a cpu instruction, then maybe it's timing somehow. But man, every almost instruction I try has a major impact on driving, so it could be anything... -[Unknown]
Author
Owner

@ppmeis commented on GitHub (Mar 2, 2015):

@unknownbrackets as simple as navigate to Settings > AV Player and select Accept, then autodrive will start.

Tested with latest build. Same status.

<!-- gh-comment-id:76854440 --> @ppmeis commented on GitHub (Mar 2, 2015): @unknownbrackets as simple as navigate to Settings > AV Player and select Accept, then autodrive will start. Tested with latest build. Same status.
Author
Owner

@hrydgard commented on GitHub (Mar 2, 2015):

Hm, vrndf1 seems like a suspicious candidate - IIRC we don't reseed the random number generator when a game would write directly to the random context registers of the VFPU. But if it doesn't make a difference if you modify it, then unless the game depends on a particular sequence (that we can't repro anyway as we don't know how the PSP's rndgen works) it's probably not it...

<!-- gh-comment-id:76856910 --> @hrydgard commented on GitHub (Mar 2, 2015): Hm, vrndf1 seems like a suspicious candidate - IIRC we don't reseed the random number generator when a game would write directly to the random context registers of the VFPU. But if it doesn't make a difference if you modify it, then unless the game depends on a particular sequence (that we can't repro anyway as we don't know how the PSP's rndgen works) it's probably not it...
Author
Owner

@unknownbrackets commented on GitHub (Mar 3, 2015):

I thought so too, but no matter what result I make that generate (I tried statically generating 0, 0.5, and I think one other number), it is the same exact incorrect driving, so seems like it can't be that one...

-[Unknown]

<!-- gh-comment-id:76859829 --> @unknownbrackets commented on GitHub (Mar 3, 2015): I thought so too, but no matter what result I make that generate (I tried statically generating 0, 0.5, and I think one other number), it is the same exact incorrect driving, so seems like it can't be that one... -[Unknown]
Author
Owner

@unknownbrackets commented on GitHub (Apr 10, 2015):

Okay, well, I've eliminated as many instructions as I could:
https://github.com/hrydgard/ppsspp/issues/2990#issuecomment-76659348

Still not guaranteeed to be a cpu bug...

-[Unknown]

<!-- gh-comment-id:91670855 --> @unknownbrackets commented on GitHub (Apr 10, 2015): Okay, well, I've eliminated as many instructions as I could: https://github.com/hrydgard/ppsspp/issues/2990#issuecomment-76659348 Still not guaranteeed to be a cpu bug... -[Unknown]
Author
Owner

@ppmeis commented on GitHub (Jul 25, 2015):

Tested with latest build. Same status:
image

<!-- gh-comment-id:124882708 --> @ppmeis commented on GitHub (Jul 25, 2015): Tested with latest build. Same status: ![image](https://cloud.githubusercontent.com/assets/4381277/8891078/a7331a00-3319-11e5-96c5-344161ab324f.png)
Author
Owner

@unknownbrackets commented on GitHub (Jan 14, 2018):

Some stats (not sure if useful) showing float usage of various instructions from game start until after the game has clearly gone wrong.

Leftmost number is total floats processed. Then Infinity, NaN, negative zero, and subnormals/denormals.

Since it really goes off a cliff at one point, I was thinking it's possible this is subnormal related... it doesn't ever set the flush to zero flag.

mul.s:      128779215, INF:0     NAN:0     NZ:2239473 SUB:11966
neg.s:      5245302,   INF:0     NAN:0     NZ:79209   SUB:130  
mov.s:      22309050,  INF:0     NAN:5392  NZ:260968  SUB:528    NAN:7fffff-7fffff
vcos:       51246,     INF:0     NAN:0     NZ:0       SUB:0    
vi2f:       3481408,   INF:0     NAN:0     NZ:0       SUB:0    
vadd:       55346715,  INF:0     NAN:0     NZ:413464  SUB:12001
cvt.s.w:    3348734,   INF:0     NAN:0     NZ:0       SUB:0    
div.s:      7354260,   INF:0     NAN:0     NZ:7887    SUB:0    
c.le:       20980612,  INF:0     NAN:0     NZ:40030   SUB:2494 
add.s:      67100736,  INF:0     NAN:0     NZ:1448224 SUB:3376 
trunc.w.s:  4127106,   INF:0     NAN:0     NZ:0       SUB:2448 
sub.s:      30248208,  INF:0     NAN:0     NZ:349366  SUB:5221 
cvt.w.s:    268353,    INF:0     NAN:0     NZ:0       SUB:0    
vf2in:      1740704,   INF:0     NAN:0     NZ:0       SUB:0    
c.eq:       1906462,   INF:0     NAN:0     NZ:425     SUB:40   
abs.s:      874098,    INF:0     NAN:0     NZ:1214    SUB:0    
c.lt:       44685816,  INF:0     NAN:0     NZ:175307  SUB:315  
vdot:       77938675,  INF:0     NAN:0     NZ:26024   SUB:0    
vneg:       759288,    INF:0     NAN:0     NZ:23272   SUB:0    
vrsq:       2117025,   INF:0     NAN:0     NZ:0       SUB:0    
vsat0:      3216,      INF:0     NAN:0     NZ:0       SUB:0    
vscl:       42447592,  INF:0     NAN:0     NZ:124420  SUB:0    
vsub:       33862734,  INF:0     NAN:0     NZ:1483129 SUB:1050 
vsqrt:      7829373,   INF:0     NAN:0     NZ:0       SUB:0    
sqrt.s:     791694,    INF:0     NAN:0     NZ:0       SUB:0    
vcrsp/vqmu: 685962,    INF:0     NAN:0     NZ:11169   SUB:0    
v(h)tfm3:   8347410,   INF:0     NAN:0     NZ:37392   SUB:80   
vrot:       876736,    INF:0     NAN:0     NZ:66106   SUB:0    
vmmul:      5170227,   INF:0     NAN:0     NZ:82346   SUB:0    
vmov:       4018263,   INF:0     NAN:0     NZ:0       SUB:199248
vmmov:      811746,    INF:0     NAN:0     NZ:0       SUB:0    
v(h)tfm4:   40416408,  INF:0     NAN:0     NZ:11657   SUB:0    
vmul:       8497689,   INF:0     NAN:0     NZ:0       SUB:0    
vabs:       2611056,   INF:0     NAN:0     NZ:0       SUB:9    
vdiv:       1316103,   INF:0     NAN:0     NZ:0       SUB:0    
vrcp:       38124,     INF:0     NAN:0     NZ:0       SUB:0    
vsin:       17454,     INF:0     NAN:0     NZ:0       SUB:0    
vrndf1:     735,       INF:0     NAN:0     NZ:0       SUB:0    
vf2iz:      1072,      INF:0     NAN:0     NZ:0       SUB:0    

-[Unknown]

<!-- gh-comment-id:357483455 --> @unknownbrackets commented on GitHub (Jan 14, 2018): Some stats (not sure if useful) showing float usage of various instructions from game start until after the game has clearly gone wrong. Leftmost number is total floats processed. Then Infinity, NaN, negative zero, and subnormals/denormals. Since it really goes off a cliff at one point, I was thinking it's possible this is subnormal related... it doesn't ever set the flush to zero flag. ``` mul.s: 128779215, INF:0 NAN:0 NZ:2239473 SUB:11966 neg.s: 5245302, INF:0 NAN:0 NZ:79209 SUB:130 mov.s: 22309050, INF:0 NAN:5392 NZ:260968 SUB:528 NAN:7fffff-7fffff vcos: 51246, INF:0 NAN:0 NZ:0 SUB:0 vi2f: 3481408, INF:0 NAN:0 NZ:0 SUB:0 vadd: 55346715, INF:0 NAN:0 NZ:413464 SUB:12001 cvt.s.w: 3348734, INF:0 NAN:0 NZ:0 SUB:0 div.s: 7354260, INF:0 NAN:0 NZ:7887 SUB:0 c.le: 20980612, INF:0 NAN:0 NZ:40030 SUB:2494 add.s: 67100736, INF:0 NAN:0 NZ:1448224 SUB:3376 trunc.w.s: 4127106, INF:0 NAN:0 NZ:0 SUB:2448 sub.s: 30248208, INF:0 NAN:0 NZ:349366 SUB:5221 cvt.w.s: 268353, INF:0 NAN:0 NZ:0 SUB:0 vf2in: 1740704, INF:0 NAN:0 NZ:0 SUB:0 c.eq: 1906462, INF:0 NAN:0 NZ:425 SUB:40 abs.s: 874098, INF:0 NAN:0 NZ:1214 SUB:0 c.lt: 44685816, INF:0 NAN:0 NZ:175307 SUB:315 vdot: 77938675, INF:0 NAN:0 NZ:26024 SUB:0 vneg: 759288, INF:0 NAN:0 NZ:23272 SUB:0 vrsq: 2117025, INF:0 NAN:0 NZ:0 SUB:0 vsat0: 3216, INF:0 NAN:0 NZ:0 SUB:0 vscl: 42447592, INF:0 NAN:0 NZ:124420 SUB:0 vsub: 33862734, INF:0 NAN:0 NZ:1483129 SUB:1050 vsqrt: 7829373, INF:0 NAN:0 NZ:0 SUB:0 sqrt.s: 791694, INF:0 NAN:0 NZ:0 SUB:0 vcrsp/vqmu: 685962, INF:0 NAN:0 NZ:11169 SUB:0 v(h)tfm3: 8347410, INF:0 NAN:0 NZ:37392 SUB:80 vrot: 876736, INF:0 NAN:0 NZ:66106 SUB:0 vmmul: 5170227, INF:0 NAN:0 NZ:82346 SUB:0 vmov: 4018263, INF:0 NAN:0 NZ:0 SUB:199248 vmmov: 811746, INF:0 NAN:0 NZ:0 SUB:0 v(h)tfm4: 40416408, INF:0 NAN:0 NZ:11657 SUB:0 vmul: 8497689, INF:0 NAN:0 NZ:0 SUB:0 vabs: 2611056, INF:0 NAN:0 NZ:0 SUB:9 vdiv: 1316103, INF:0 NAN:0 NZ:0 SUB:0 vrcp: 38124, INF:0 NAN:0 NZ:0 SUB:0 vsin: 17454, INF:0 NAN:0 NZ:0 SUB:0 vrndf1: 735, INF:0 NAN:0 NZ:0 SUB:0 vf2iz: 1072, INF:0 NAN:0 NZ:0 SUB:0 ``` -[Unknown]
Author
Owner

@unknownbrackets commented on GitHub (Jun 8, 2019):

I had tried some things before, but just wanted to note that I've tried forcing subnormal results to 0 (as always seems to happen with many vfpu ops) for vmul/vadd/vsub/vtfm3/vhtfm3/etc., as well as forcing nan to 0x7f800001. There was no change in the failure.

I do think there's a good chance it's related to multiply accuracy.

-[Unknown]

<!-- gh-comment-id:500168341 --> @unknownbrackets commented on GitHub (Jun 8, 2019): I had tried some things before, but just wanted to note that I've tried forcing subnormal results to 0 (as always seems to happen with many vfpu ops) for vmul/vadd/vsub/vtfm3/vhtfm3/etc., as well as forcing nan to 0x7f800001. There was no change in the failure. I do think there's a good chance it's related to multiply accuracy. -[Unknown]
Author
Owner

@unknownbrackets commented on GitHub (Jun 9, 2019):

Update: as a very rough measure, I tried & 0xFFFFFFFE for all the results of vtfm, vadd, vsub, vdiv, and vmul.

Normally, things go wrong right before the second tunnel. With this change, things go wrong before the third tunnel, and it looks right for longer. So this is promising.

Trying to dig into which instruction gets tougher, though. Just disabling the masking for one op at once:

  • vtfm without mask: still lasts longer, but goes wrong slightly earlier than all masked.
  • vdiv without mask: goes wrong even earlier than normal.
  • vmul without mask: better than no masking, but breaks within the second tunnel.
  • vsub without mask: very similar to vmul disabled.
  • vadd without mask: goes wrong much earlier than with all masked.

A few other instructions didn't seem to matter, like vmmul or vdot. That said, obviously this doesn't implicate any of the above instructions - it could be that rounding at vsub masks a problem that is really in vmul, or even in vdot.

The important bit here is that rounding/precision is almost definitely at issue here.

For clarity, changing the rounding mode doesn't help things, so it's more complex than that.

-[Unknown]

<!-- gh-comment-id:500176329 --> @unknownbrackets commented on GitHub (Jun 9, 2019): Update: as a very rough measure, I tried `& 0xFFFFFFFE` for all the results of vtfm, vadd, vsub, vdiv, and vmul. Normally, things go wrong right before the second tunnel. With this change, things go wrong before the third tunnel, and it looks right for longer. So this is promising. Trying to dig into which instruction gets tougher, though. Just disabling the masking for one op at once: * vtfm without mask: still lasts longer, but goes wrong slightly earlier than all masked. * vdiv without mask: goes wrong even earlier than normal. * vmul without mask: better than no masking, but breaks within the second tunnel. * vsub without mask: very similar to vmul disabled. * vadd without mask: goes wrong much earlier than with all masked. A few other instructions didn't seem to matter, like vmmul or vdot. That said, obviously this doesn't implicate any of the above instructions - it could be that rounding at vsub masks a problem that is really in vmul, or even in vdot. The important bit here is that rounding/precision is almost definitely at issue here. For clarity, changing the rounding mode doesn't help things, so it's more complex than that. -[Unknown]
Author
Owner

@hrydgard commented on GitHub (Jun 9, 2019):

I think that indeed confirms that precision/rounding is the culprit. Masking like that is not likely to accurately simulate the issues though, of course.

I believe in the FTZ thing plus probably a slightly lower-precision dot product implemented in the VFPU hardware (in addition to approximations in vrot and similar). VTFM is very likely to use that hardware dot product.

I think the dot product precision issues could be shown by trying things like dotting a=(1.0, 1.0, 1.0, 1.0) and b = (0.000001, 0.000001, 0.000001, 1.0), and the reverse of b with 1.0 first. The 0.000001 constant should be adjusted so that the sum of three of them just breaks into the precision that's still available when the exponent is set to be able to represent 1.0. That way, if the dot product summing uses collective mantissa alignment and then summing up the mantissas, we'd get the same results if the 1.0 was first or last or whereever, whereas if it's computed like we do by simply summing up the products from left to right, we should get different results.

<!-- gh-comment-id:500224588 --> @hrydgard commented on GitHub (Jun 9, 2019): I think that indeed confirms that precision/rounding is the culprit. Masking like that is not likely to accurately simulate the issues though, of course. I believe in the FTZ thing plus probably a slightly lower-precision dot product implemented in the VFPU hardware (in addition to approximations in vrot and similar). VTFM is very likely to use that hardware dot product. I think the dot product precision issues could be shown by trying things like dotting a=(1.0, 1.0, 1.0, 1.0) and b = (0.000001, 0.000001, 0.000001, 1.0), and the reverse of b with 1.0 first. The 0.000001 constant should be adjusted so that the sum of three of them just breaks into the precision that's still available when the exponent is set to be able to represent 1.0. That way, if the dot product summing uses collective mantissa alignment and then summing up the mantissas, we'd get the same results if the 1.0 was first or last or whereever, whereas if it's computed like we do by simply summing up the products from left to right, we should get different results.
Author
Owner

@unknownbrackets commented on GitHub (Jun 9, 2019):

For posterity:

{ 0x3F800000, 0x33800000, 0x33800000, 0x33800000 }
{ 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000 }
= 0x3f800001

{ 0x33800000, 0x33800000, 0x33800000, 0x3F800000 }
{ 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000 }
= 0x3f800001

{ 0x3F800000, 0x34000000, 0x00000000, 0x00000000 }
{ 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000 }
= 0x3f800001

{ 0x100BF8FE, 0x581F4DA5, 0x00000000, 0x00000000 }
{ 0x3F800000, 0x0207F3ED, 0x00000000, 0x00000000 }
= 0x1aa9337c

Since order doesn't matter, potentially it's aligning the exponents first and the summing. It'll be interesting to find if vhdp, vfad, vavg, or other ops have similar behavior.

For clarity on anyone reading this, the first two above sums are (base 2):

1.000000000000000000000000 * 1 +
0.000000000000000000000001 * 1 +
0.000000000000000000000001 * 1 +
0.000000000000000000000001 * 1 =
--------------------------
1.000000000000000000000011 = 0x3f800001

Which becomes 1.00000000000000000000001 because of limited mantissa, therefore 0x3f800001. I also tried:

1.000000000000000000000000 * 1 +
0.000000000000000000000010 * 1 +
0.000000000000000000000010 * 1 +
0.000000000000000000000011 * 1 =
--------------------------
1.000000000000000000000111 = 0x3f800003

1.000000000000000000000000 * 1 +
0.000000000000000000000010 * 1 +
0.000000000000000000000010 * 1 +
0.000000000000000000000001 * 1 =
--------------------------
1.000000000000000000000101 = 0x3f800002

1.000000000000000000000000 * 1 +
0.000000000000000000000010 * 1 +
0.000000000000000000000010 * 1 +
0.000000000000000000000010 * 1 =
--------------------------
1.000000000000000000000110 = 0x3f800003

Which all truncated as expected (was trying to verify any rounding behavior.)

Also confirmed the behavior is identical (just with a flipped sign) if I flip the sign of the first vector (meaning it doesn't truncate differently for negative.)

-[Unknown]

<!-- gh-comment-id:500239555 --> @unknownbrackets commented on GitHub (Jun 9, 2019): For posterity: ``` { 0x3F800000, 0x33800000, 0x33800000, 0x33800000 } { 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000 } = 0x3f800001 { 0x33800000, 0x33800000, 0x33800000, 0x3F800000 } { 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000 } = 0x3f800001 { 0x3F800000, 0x34000000, 0x00000000, 0x00000000 } { 0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000 } = 0x3f800001 { 0x100BF8FE, 0x581F4DA5, 0x00000000, 0x00000000 } { 0x3F800000, 0x0207F3ED, 0x00000000, 0x00000000 } = 0x1aa9337c ``` Since order doesn't matter, potentially it's aligning the exponents first and the summing. It'll be interesting to find if vhdp, vfad, vavg, or other ops have similar behavior. For clarity on anyone reading this, the first two above sums are (base 2): ``` 1.000000000000000000000000 * 1 + 0.000000000000000000000001 * 1 + 0.000000000000000000000001 * 1 + 0.000000000000000000000001 * 1 = -------------------------- 1.000000000000000000000011 = 0x3f800001 ``` Which becomes `1.00000000000000000000001` because of limited mantissa, therefore 0x3f800001. I also tried: ``` 1.000000000000000000000000 * 1 + 0.000000000000000000000010 * 1 + 0.000000000000000000000010 * 1 + 0.000000000000000000000011 * 1 = -------------------------- 1.000000000000000000000111 = 0x3f800003 1.000000000000000000000000 * 1 + 0.000000000000000000000010 * 1 + 0.000000000000000000000010 * 1 + 0.000000000000000000000001 * 1 = -------------------------- 1.000000000000000000000101 = 0x3f800002 1.000000000000000000000000 * 1 + 0.000000000000000000000010 * 1 + 0.000000000000000000000010 * 1 + 0.000000000000000000000010 * 1 = -------------------------- 1.000000000000000000000110 = 0x3f800003 ``` Which all truncated as expected (was trying to verify any rounding behavior.) Also confirmed the behavior is identical (just with a flipped sign) if I flip the sign of the first vector (meaning it doesn't truncate differently for negative.) -[Unknown]
Author
Owner

@unknownbrackets commented on GitHub (Jun 9, 2019):

Okay, using this:
https://gist.github.com/unknownbrackets/e5bdd06cd8d85712fc51bd7b7707cfd1

Which gets pretty good results (note: multiplying to a temporary float[4] first):

  FMA error: CORRECT 1aa9337c / 0.000000
  1.0*1.0 + 1.0*1.0^-23: CORRECT 3f800001 / 1.000000
  1.0*1.0 + 1.0*1.0^-24 + 1.0*1.0^-24 + 1.0*1.0^-24: CORRECT 3f800001 / 1.000000
  1.0*1.0^-24 + 1.0*1.0^-24 + 1.0*1.0^-24 + 1.0*1.0: CORRECT 3f800001 / 1.000000
  1.0*1.0 + 1.0*1.0^-23 + 1.0*1.0^-23 + 1.0*1.0^-24: CORRECT 3f800002 / 1.000000
  1.0*1.0 + 1.0*1.0^-23 + 1.0*1.0^-23 + 1.0*1.0^-23: CORRECT 3f800003 / 1.000000
  1.0*1.0 + 1.0*1.0^-23 + 1.0*1.0^-23 + 1.1*1.0^-23: CORRECT 3f800003 / 1.000000
  1.0*-1.0 + 1.0*-1.0^-23 + 1.0*-1.0^-23 + 1.1*-1.0^-23: CORRECT bf800003 / -1.000000
  Simulate case 1: CORRECT c75864aa / -55396.664062
  Simulate case 2: CORRECT c7fb200f / -128576.117188
  Simulate case 3: CORRECT c5972dcb / -4837.724121
  Simulate case 4: CORRECT 42222309 / 40.534214
  Simulate case 5: WRONG 3d84e134 / 0.064883  vs  3d84e130 / 0.064883
  Simulate case 5 DEBUG: beb4194f + bdbb66eb + 3f0215ab + 00000000
  Simulate case 5 DEBUG: -0.351756 + -0.091505 + 0.508143 + 0.000000
  Simulate case 6: CORRECT 4136c004 / 11.421879

FWIW case 5 is (I sampled the most different results from Ridge Racer, and used them to debug the software float add):

	ScePspIVector4 dotsim5a = { 0x3f2dc5cb, 0x3e71855a, 0x3f3206af, 0x00000000 };
	ScePspIVector4 dotsim5b = { 0xbf04a8ed, 0xbec6a2ff, 0x3f3b0f83, 0x00000000 };
	testDot("  Simulate case 5", dotsim5a, dotsim5b);

This changes the results. It goes differently wrong right before the second tunnel, but doesn't work out from there. Pretty sure we're barking up the right tree, because everything up to where it goes crazy was right and the same - and the goes crazy point acted differently.

-[Unknown]

<!-- gh-comment-id:500252101 --> @unknownbrackets commented on GitHub (Jun 9, 2019): Okay, using this: https://gist.github.com/unknownbrackets/e5bdd06cd8d85712fc51bd7b7707cfd1 Which gets pretty good results (note: multiplying to a temporary `float[4]` first): ``` FMA error: CORRECT 1aa9337c / 0.000000 1.0*1.0 + 1.0*1.0^-23: CORRECT 3f800001 / 1.000000 1.0*1.0 + 1.0*1.0^-24 + 1.0*1.0^-24 + 1.0*1.0^-24: CORRECT 3f800001 / 1.000000 1.0*1.0^-24 + 1.0*1.0^-24 + 1.0*1.0^-24 + 1.0*1.0: CORRECT 3f800001 / 1.000000 1.0*1.0 + 1.0*1.0^-23 + 1.0*1.0^-23 + 1.0*1.0^-24: CORRECT 3f800002 / 1.000000 1.0*1.0 + 1.0*1.0^-23 + 1.0*1.0^-23 + 1.0*1.0^-23: CORRECT 3f800003 / 1.000000 1.0*1.0 + 1.0*1.0^-23 + 1.0*1.0^-23 + 1.1*1.0^-23: CORRECT 3f800003 / 1.000000 1.0*-1.0 + 1.0*-1.0^-23 + 1.0*-1.0^-23 + 1.1*-1.0^-23: CORRECT bf800003 / -1.000000 Simulate case 1: CORRECT c75864aa / -55396.664062 Simulate case 2: CORRECT c7fb200f / -128576.117188 Simulate case 3: CORRECT c5972dcb / -4837.724121 Simulate case 4: CORRECT 42222309 / 40.534214 Simulate case 5: WRONG 3d84e134 / 0.064883 vs 3d84e130 / 0.064883 Simulate case 5 DEBUG: beb4194f + bdbb66eb + 3f0215ab + 00000000 Simulate case 5 DEBUG: -0.351756 + -0.091505 + 0.508143 + 0.000000 Simulate case 6: CORRECT 4136c004 / 11.421879 ``` FWIW case 5 is (I sampled the most different results from Ridge Racer, and used them to debug the software float add): ```c++ ScePspIVector4 dotsim5a = { 0x3f2dc5cb, 0x3e71855a, 0x3f3206af, 0x00000000 }; ScePspIVector4 dotsim5b = { 0xbf04a8ed, 0xbec6a2ff, 0x3f3b0f83, 0x00000000 }; testDot(" Simulate case 5", dotsim5a, dotsim5b); ``` This changes the results. It goes differently wrong right before the second tunnel, but doesn't work out from there. Pretty sure we're barking up the right tree, because everything up to where it goes crazy was right and the same - and the goes crazy point acted differently. -[Unknown]
Author
Owner

@hrydgard commented on GitHub (Jun 10, 2019):

Cool. It's possible though that this sequence is so sensitive that it won't work all the way through until we've fixed both the FTZ issue and gotten this even more accurate...

Please as always feel free to push even very rough code to a branch or PR, would be interesting to try this on Tekken 6.

Also by the way the BSR instruction (CLZ on ARM) will let us get rid of those annoying while loops in the software add.

Additionally, floating point multiplication in software is actually even easier than addition since there's no realignment needed, just multiply the mantissas, shift down by a fixed amount, and add the exponents (with a bias to account for the 127 base).

Also it's very likely that vhdp, vfad, vavg have similar issues since they almost certainly are reusing the vdot hardware, kind of like the prefix hack ops.

<!-- gh-comment-id:500322370 --> @hrydgard commented on GitHub (Jun 10, 2019): Cool. It's possible though that this sequence is so sensitive that it won't work all the way through until we've fixed both the FTZ issue and gotten this even more accurate... Please as always feel free to push even very rough code to a branch or PR, would be interesting to try this on Tekken 6. Also by the way the BSR instruction (CLZ on ARM) will let us get rid of those annoying while loops in the software add. Additionally, floating point multiplication in software is actually even easier than addition since there's no realignment needed, just multiply the mantissas, shift down by a fixed amount, and add the exponents (with a bias to account for the 127 base). Also it's very likely that vhdp, vfad, vavg have similar issues since they almost certainly are reusing the vdot hardware, kind of like the prefix hack ops.
Author
Owner

@unknownbrackets commented on GitHub (Jun 11, 2019):

Here's the branch so far:
https://github.com/hrydgard/ppsspp/compare/master...unknownbrackets:vfpu-dot

-[Unknown]

<!-- gh-comment-id:501051997 --> @unknownbrackets commented on GitHub (Jun 11, 2019): Here's the branch so far: https://github.com/hrydgard/ppsspp/compare/master...unknownbrackets:vfpu-dot -[Unknown]
Author
Owner

@hrydgard commented on GitHub (Jun 12, 2019):

@unknownbrackets Thanks, I'll try it on Tekken tonight.

For now, I think it might be a good idea to add an 'n' argument to vdot so there's no requirement to make sure that unused elements are zeroed on shorter dot producted - feels like there could be a couple of bugs around that, although maybe ApplySwizzle takes care of it. (Also, in case vdot somehow would mishandle zero).

<!-- gh-comment-id:501153305 --> @hrydgard commented on GitHub (Jun 12, 2019): @unknownbrackets Thanks, I'll try it on Tekken tonight. For now, I think it might be a good idea to add an 'n' argument to vdot so there's no requirement to make sure that unused elements are zeroed on shorter dot producted - feels like there could be a couple of bugs around that, although maybe ApplySwizzle takes care of it. (Also, in case vdot somehow would mishandle zero).
Author
Owner

@hrydgard commented on GitHub (Jun 12, 2019):

Does EXTRA_BITS seem to be 2? It's also possible that we should apply some rounding to them before shifting them out at the end.

<!-- gh-comment-id:501153601 --> @hrydgard commented on GitHub (Jun 12, 2019): Does EXTRA_BITS seem to be 2? It's also possible that we should apply some rounding to them before shifting them out at the end.
Author
Owner

@hrydgard commented on GitHub (Jun 12, 2019):

@unknownbrackets Tekken is unfortunately very broken with this, just "disabling" VTFM (allowing interpreter fallback) screws up the graphics entirely. Hm...

<!-- gh-comment-id:501477933 --> @hrydgard commented on GitHub (Jun 12, 2019): @unknownbrackets Tekken is unfortunately very broken with this, just "disabling" VTFM (allowing interpreter fallback) screws up the graphics entirely. Hm...
Author
Owner

@unknownbrackets commented on GitHub (Jun 13, 2019):

Sorry, I cleaned up some debug code after testing and didn't actually test it again, made a really dumb mistake. Pushed the right version.

Zeros should work fine. Note that I'm applying this to interp, which mostly has to do dots across all four to handle prefixes correctly.

Also, this is an interesting one:

  +/- INF: WRONG 7f800001 / nan  vs  00000000 / 0.000000
  +/- INF DEBUG: 7f800000 + ff800000 + 00000000 + 00000000
  +/- INF DEBUG: inf + -inf + 0.000000 + 0.000000

The correct result is 7f800001 here (which makes sense mathematically...)

-[Unknown]

<!-- gh-comment-id:501552087 --> @unknownbrackets commented on GitHub (Jun 13, 2019): Sorry, I cleaned up some debug code after testing and didn't actually test it again, made a really dumb mistake. Pushed the right version. Zeros should work fine. Note that I'm applying this to interp, which mostly has to do dots across all four to handle prefixes correctly. Also, this is an interesting one: ``` +/- INF: WRONG 7f800001 / nan vs 00000000 / 0.000000 +/- INF DEBUG: 7f800000 + ff800000 + 00000000 + 00000000 +/- INF DEBUG: inf + -inf + 0.000000 + 0.000000 ``` The correct result is 7f800001 here (which makes sense mathematically...) -[Unknown]
Author
Owner

@hrydgard commented on GitHub (Jun 13, 2019):

Ah! Well then, I'm happy to report that this seems to fix leg shaking in Tekken 6 completely!

Not quite sure I understand your debug output there, are we or the PSP computing 7f800001? (And that's the dot product of (7f800000, ff800000, 00000000, 00000000) dot (inf, -inf, 0.000000, 0.000000) despite the plus signs?

<!-- gh-comment-id:501560748 --> @hrydgard commented on GitHub (Jun 13, 2019): Ah! Well then, I'm happy to report that this seems to fix leg shaking in Tekken 6 completely! Not quite sure I understand your debug output there, are we or the PSP computing 7f800001? (And that's the dot product of (7f800000, ff800000, 00000000, 00000000) dot (inf, -inf, 0.000000, 0.000000) despite the plus signs?
Author
Owner

@unknownbrackets commented on GitHub (Jun 13, 2019):

The format is WRONG %08x[correct.u] %f[correct.f] vs %08x[simulate.u] %f[simulate.f], though I already changed it to handle that correctly.

The debug output is premultiplied, so it's just the sum of (inf, -inf, 0, 0), or in other words inf - inf. It's output twice, once in hex and then in float. In this case, the other vector is just (1, 1, 1, 1) for simplicity.

We're still getting some cases wrong, but it improves the results of cpu/vfpu/vector too. It might be in the multiply as you suggested.

-[Unknown]

<!-- gh-comment-id:501561844 --> @unknownbrackets commented on GitHub (Jun 13, 2019): The format is `WRONG %08x[correct.u] %f[correct.f] vs %08x[simulate.u] %f[simulate.f]`, though I already changed it to handle that correctly. The debug output is premultiplied, so it's just the sum of (inf, -inf, 0, 0), or in other words `inf - inf`. It's output twice, once in hex and then in float. In this case, the other vector is just (1, 1, 1, 1) for simplicity. We're still getting some cases wrong, but it improves the results of cpu/vfpu/vector too. It might be in the multiply as you suggested. -[Unknown]
Author
Owner

@hrydgard commented on GitHub (Jun 13, 2019):

Ah, of course. Yeah, vfpu_dot still has some edge cases left, and yup, then there's the multiply... might want to try different rounding modes enabled during simulation to check if one happens to match?

If we do the multiplies in software too, at least they won't be affected by the current local rounding mode...

<!-- gh-comment-id:501562622 --> @hrydgard commented on GitHub (Jun 13, 2019): Ah, of course. Yeah, vfpu_dot still has some edge cases left, and yup, then there's the multiply... might want to try different rounding modes enabled during simulation to check if one happens to match? If we do the multiplies in software too, at least they won't be affected by the current local rounding mode...
Author
Owner

@unknownbrackets commented on GitHub (Jun 15, 2019):

With an integer multiply (branch updated), it gets much farther before going crazy. I probably have a mistake hiding in there somewhere, though. After it got much farther it actually eventually hit an invalid memory read (though maybe this is after it was "supposed" to have finished the track?)

It also does NOT match all the accuracy tests, so it's definitely not right still. But it does seem closer.

-[Unknown]

<!-- gh-comment-id:502336358 --> @unknownbrackets commented on GitHub (Jun 15, 2019): With an integer multiply (branch updated), it gets much farther before going crazy. I probably have a mistake hiding in there somewhere, though. After it got much farther it actually eventually hit an invalid memory read (though maybe this is after it was "supposed" to have finished the track?) It also does NOT match all the accuracy tests, so it's definitely not right still. But it does seem closer. -[Unknown]
Author
Owner

@hrydgard commented on GitHub (Jun 15, 2019):

Very cool. Of course, it going off track can also be caused by other instructions but seems this indeed has a big influence. I see you switched to clz, nice.

<!-- gh-comment-id:502342766 --> @hrydgard commented on GitHub (Jun 15, 2019): Very cool. Of course, it going off track can also be caused by other instructions but seems this indeed has a big influence. I see you switched to clz, nice.
Author
Owner

@unknownbrackets commented on GitHub (Jun 15, 2019):

We're currently using 2 extra bits of precision - I wonder if it still uses a sticky bit (seems annoying to emulate) prior to normalization, or if multiply doesn't actually truncate...

Also for division, this is interesting, though probably not how it actually calculates it:
https://www.pvk.ca/Blog/LowLevel/software-reciprocal.html

-[Unknown]

<!-- gh-comment-id:502379033 --> @unknownbrackets commented on GitHub (Jun 15, 2019): We're currently using 2 extra bits of precision - I wonder if it still uses a sticky bit (seems annoying to emulate) prior to normalization, or if multiply doesn't actually truncate... Also for division, this is interesting, though probably not how it actually calculates it: https://www.pvk.ca/Blog/LowLevel/software-reciprocal.html -[Unknown]
Author
Owner

@hrydgard commented on GitHub (Jun 15, 2019):

I would expect them to use some standard blocks of gates implementing stuff like that, so it's very possible that the sticky bit is there. But of course it's also possible that they designed a very minimal implementation just to make dot products as cheap as possible .. who knows...

Yeah I highly doubt it's done that way..

<!-- gh-comment-id:502383487 --> @hrydgard commented on GitHub (Jun 15, 2019): I would expect them to use some standard blocks of gates implementing stuff like that, so it's very possible that the sticky bit is there. But of course it's also possible that they designed a very minimal implementation just to make dot products as cheap as possible .. who knows... Yeah I highly doubt it's done that way..
Author
Owner

@unknownbrackets commented on GitHub (Jul 8, 2019):

Okay, the software dot now matches all our tests and other cherry picked values:
https://github.com/hrydgard/ppsspp/compare/master...unknownbrackets:vfpu-dot?expand=1

Turned out to use relatively simple rounding, but I ended up running exhaustive searches on the PSP for test values (by just checking software implementation directly on the PSP, since it calculated the same there.)

The bad news is that this new implementation, despite matching fairly well, makes Ridge Racer go crazy even earlier than it does on master (before the turn it starts doing weird stuff.) It's definitely caused by the rounding.

As far as I could tell, changing the rounding mode has no effect on the vdot results.

I guess it must be other instructions. I'm replicating the list of most used instructions above here, removing ones that have no effect or use vdot internally:

vadd      (9333422)   // Small error = major driving glitches.
vscl      (8506915)   // Small error = major driving glitches.
vsub      (5637652)   // Small error = driving glitches happen differently.
vsqrt     (3891304)   // MAYBE: Introducing a small error makes glitches MUCH worse.
vrsq      (971862)    // MAYBE: Introducing a small error makes glitches MUCH worse.
vdiv      (572562)    // Small error = driving glitches happen differently.
vcrsp.t   (100890)    // MAYBE: Introducing a small error makes glitches happen quicker.

mul.s     (54993288)  // Small error = major driving glitches.
add.s     (28356670)  // Small error = major driving glitches.
c.le      (13832384)  // Change to lt = driving glitches happen differently.
sub.s     (13597462)  // Small error = major driving glitches.
trunc.w.s (5548058)   // Small error = driving glitches happen differently.
cvt.s.w   (4223431)   // Small error = major driving glitches.
div.s     (3243544)   // Small error = major driving glitches.
vpfxt     (2184294)   // Ignore = no driving change, but major gfx glitches.  Might still be wrong prefix handling.
sqrt.s    (533778)    // Small error = driving glitches happen differently.

c.lt      (29268876)  // Any change = breaks everything, but unlikely.
vone      (9659339)
neg.s     (3638275)
c.eq      (1284924)   // Any change = breaks everything, but unlikely.
abs.s     (562281)    // Small error = crash, unplayable... unlikely.
vneg      (276140)
vmidt     (180679)
vmmov     (62622)
vzero     (14834)

Hmm maybe vcrsp.t...

-[Unknown]

<!-- gh-comment-id:509071066 --> @unknownbrackets commented on GitHub (Jul 8, 2019): Okay, the software dot now matches all our tests and other cherry picked values: https://github.com/hrydgard/ppsspp/compare/master...unknownbrackets:vfpu-dot?expand=1 Turned out to use relatively simple rounding, but I ended up running exhaustive searches on the PSP for test values (by just checking software implementation directly on the PSP, since it calculated the same there.) The bad news is that this new implementation, despite matching fairly well, makes Ridge Racer go crazy even earlier than it does on master (before the turn it starts doing weird stuff.) It's definitely caused by the rounding. As far as I could tell, changing the rounding mode has no effect on the vdot results. I guess it must be other instructions. I'm replicating the list of most used instructions above here, removing ones that have no effect or use vdot internally: ``` c++ vadd (9333422) // Small error = major driving glitches. vscl (8506915) // Small error = major driving glitches. vsub (5637652) // Small error = driving glitches happen differently. vsqrt (3891304) // MAYBE: Introducing a small error makes glitches MUCH worse. vrsq (971862) // MAYBE: Introducing a small error makes glitches MUCH worse. vdiv (572562) // Small error = driving glitches happen differently. vcrsp.t (100890) // MAYBE: Introducing a small error makes glitches happen quicker. mul.s (54993288) // Small error = major driving glitches. add.s (28356670) // Small error = major driving glitches. c.le (13832384) // Change to lt = driving glitches happen differently. sub.s (13597462) // Small error = major driving glitches. trunc.w.s (5548058) // Small error = driving glitches happen differently. cvt.s.w (4223431) // Small error = major driving glitches. div.s (3243544) // Small error = major driving glitches. vpfxt (2184294) // Ignore = no driving change, but major gfx glitches. Might still be wrong prefix handling. sqrt.s (533778) // Small error = driving glitches happen differently. c.lt (29268876) // Any change = breaks everything, but unlikely. vone (9659339) neg.s (3638275) c.eq (1284924) // Any change = breaks everything, but unlikely. abs.s (562281) // Small error = crash, unplayable... unlikely. vneg (276140) vmidt (180679) vmmov (62622) vzero (14834) ``` Hmm maybe vcrsp.t... -[Unknown]
Author
Owner

@unknownbrackets commented on GitHub (Jul 8, 2019):

It definitely is more accurate applying the same dot operation in vcrsp, though there's something odd happening with inf there. It affected Ridge Racer in probably a good way, but it still goes crazy a bit earlier than before.

-[Unknown]

<!-- gh-comment-id:509075450 --> @unknownbrackets commented on GitHub (Jul 8, 2019): It definitely is more accurate applying the same dot operation in vcrsp, though there's something odd happening with inf there. It affected Ridge Racer in probably a good way, but it still goes crazy a bit earlier than before. -[Unknown]
Author
Owner

@unknownbrackets commented on GitHub (Jul 9, 2019):

So, it's probably not sqrt.

I wrote a software sqrt, which matches vsqrt much better (sqrtf = exact match 3% of the time, vfpu_sqrt = exact match 84% of the time.) There was no change or improvement to the driving, though.

It could be hiding in the remaining 16% (seems to be a rounding issue, but I can't figure out the right logic for it), but I'd have expected some improvement if the accuracy mattered.

-[Unknown]

<!-- gh-comment-id:509478434 --> @unknownbrackets commented on GitHub (Jul 9, 2019): So, it's probably not sqrt. I wrote a software sqrt, which matches vsqrt much better (sqrtf = exact match 3% of the time, vfpu_sqrt = exact match 84% of the time.) There was no change or improvement to the driving, though. It could be hiding in the remaining 16% (seems to be a rounding issue, but I can't figure out the right logic for it), but I'd have expected some improvement if the accuracy mattered. -[Unknown]
Author
Owner

@unknownbrackets commented on GitHub (Jul 9, 2019):

Oops, had a stupid mistake disabling the sqrt. It does improve things. But it also mysteriously makes the game crash (well, it was before if it ran far enough without winning, but now it does it earlier...)

-[Unknown]

<!-- gh-comment-id:509492917 --> @unknownbrackets commented on GitHub (Jul 9, 2019): Oops, had a stupid mistake disabling the sqrt. It does improve things. But it also mysteriously makes the game crash (well, it was before if it ran far enough without winning, but now it does it earlier...) -[Unknown]
Author
Owner

@unknownbrackets commented on GitHub (Jul 9, 2019):

Okay, sorry for the many comments. Found the bug (max_exp == 0 vs max_exp <= 0) causing the crash, so now this is the version that gets the farthest:

https://github.com/hrydgard/ppsspp/compare/master...unknownbrackets:vfpu-dot?expand=1

It still goes crazy eventually. Maybe it's the remaining 16% of sqrt - any ideas what might be wrong there? I tried rounding up or rounding even instead of masking, but maybe wrong...

-[Unknown]

<!-- gh-comment-id:509501081 --> @unknownbrackets commented on GitHub (Jul 9, 2019): Okay, sorry for the many comments. Found the bug (max_exp == 0 vs max_exp <= 0) causing the crash, so now this is the version that gets the farthest: https://github.com/hrydgard/ppsspp/compare/master...unknownbrackets:vfpu-dot?expand=1 It still goes crazy eventually. Maybe it's the remaining 16% of sqrt - any ideas what might be wrong there? I tried rounding up or rounding even instead of masking, but maybe wrong... -[Unknown]
Author
Owner

@hrydgard commented on GitHub (Jul 9, 2019):

Cool. But I don't think Ridge Racer is going to suddenly be fixed 100% after a single instruction is used - it's clear that its "physics" simulation uses a lot of different instructions and any of them can introduce a tiny error, which will get amplified over time and cause the simulation to fall out of sync with the replay data. It's not even certain that a single precision fix will cause the simulation results to be closer to the real thing (although as we fix more things, that does get more likely). And we still don't force FTZ on for VFPU instructions, which we really should if we don't just software emulate them all.

Anyway, this is very good progress already even if Ridge Racer isn't fixed. Who knows what other games might be helped. Unfortunately this stuff is not easy to enable globally, for fear of slowdowns...

<!-- gh-comment-id:509510883 --> @hrydgard commented on GitHub (Jul 9, 2019): Cool. But I don't think Ridge Racer is going to suddenly be fixed 100% after a single instruction is used - it's clear that its "physics" simulation uses a lot of different instructions and any of them can introduce a tiny error, which will get amplified over time and cause the simulation to fall out of sync with the replay data. It's not even certain that a single precision fix will cause the simulation results to be closer to the real thing (although as we fix more things, that does get more likely). And we still don't force FTZ on for VFPU instructions, which we really should if we don't just software emulate them all. Anyway, this is very good progress already even if Ridge Racer isn't fixed. Who knows what other games might be helped. Unfortunately this stuff is not easy to enable globally, for fear of slowdowns...
Author
Owner

@unknownbrackets commented on GitHub (Jul 9, 2019):

Sure, of course. But there aren't that many instructions left unless it's FPU too. See the list. It's not like it uses sin/cos/etc. I assume Dissidia replays are affected by the same problem, but iirc they use a lot more VFPU instructions.

Also, there's some masking already applying FTZ in that branch. But if you look above, Ridge Racer isn't really sending any subnormals through most of these instructions anyway.

-[Unknown]

<!-- gh-comment-id:509512914 --> @unknownbrackets commented on GitHub (Jul 9, 2019): Sure, of course. But there aren't that many instructions left unless it's FPU too. See the list. It's not like it uses sin/cos/etc. I assume Dissidia replays are affected by the same problem, but iirc they use a lot more VFPU instructions. Also, there's some masking already applying FTZ in that branch. But if you look above, Ridge Racer isn't really sending any subnormals through most of these instructions anyway. -[Unknown]
Author
Owner

@hrydgard commented on GitHub (Jul 9, 2019):

Well there's vrot, vrsq and vdiv, and vsin and vcos are actually in the list you posted above? (actually never mind about the latter, I see you posted a revised list further down)

<!-- gh-comment-id:509527881 --> @hrydgard commented on GitHub (Jul 9, 2019): Well there's vrot, vrsq and vdiv, and vsin and vcos are actually in the list you posted above? (actually never mind about the latter, I see you posted a revised list further down)
Author
Owner

@ghost commented on GitHub (Jun 15, 2020):

the same thing happens on Ridge Racer 7 when played on RPCS3... The autodrive is also buggy.. and i also found another bug... My saved replays is starting to bug also...

<!-- gh-comment-id:644111790 --> @ghost commented on GitHub (Jun 15, 2020): the same thing happens on Ridge Racer 7 when played on RPCS3... The autodrive is also buggy.. and i also found another bug... My saved replays is starting to bug also...
Author
Owner

@hrydgard commented on GitHub (Jun 15, 2020):

Yeah, tiny, tiny math inaccuracies can result in this kind of thing, no surprise it happens on RPCS3 as well.

<!-- gh-comment-id:644113019 --> @hrydgard commented on GitHub (Jun 15, 2020): Yeah, tiny, tiny math inaccuracies can result in this kind of thing, no surprise it happens on RPCS3 as well.
Author
Owner

@ghost commented on GitHub (Jun 15, 2020):

I noticed that when i use a cheat that will alter the car's performance on Ridge Racer, the AV Player CPU car's performance would also change .. So if someone makes a cheat code that will alter the cars performance, probably we would have no Algorithm bug...

<!-- gh-comment-id:644125945 --> @ghost commented on GitHub (Jun 15, 2020): I noticed that when i use a cheat that will alter the car's performance on Ridge Racer, the AV Player CPU car's performance would also change .. So if someone makes a cheat code that will alter the cars performance, probably we would have no Algorithm bug...
Author
Owner

@hrydgard commented on GitHub (Jun 15, 2020):

Nah, you can't conclude that. Your cheat will just be another input that will throw the algorithm off even more, while it's already definitely broken in other ways....

<!-- gh-comment-id:644128531 --> @hrydgard commented on GitHub (Jun 15, 2020): Nah, you can't conclude that. Your cheat will just be another input that will throw the algorithm off even more, while it's already definitely broken in other ways....
Author
Owner

@ghost commented on GitHub (Jun 15, 2020):

I tried replicating the replays and I broke my fingers halfway on SR765...

<!-- gh-comment-id:644142987 --> @ghost commented on GitHub (Jun 15, 2020): I tried replicating the replays and I broke my fingers halfway on SR765...
Author
Owner

@ghost commented on GitHub (Jun 21, 2020):

Actually, i managed to replicate half of the Seaside Route 765 CPU replay where you drive a Blue Raggio while racing the Angelus... I actually screwed halfway when im supposed to trigger the 2nd NOS... The Raggio drifted on the turn that im not supposed to drift then the Angelus passed me... And since the Raggio is a Dynamic car, i cant control it properly.. Also, when replicating the replays, you got to be precise on the turns or the A.I. Opponents will mess your rhythm... Anyways, here are the 6 tracks with no CPU bugs whatsoever:

Seaside Route 765: https://www.youtube.com/watch?v=kQyHEo4S4wg
Sunset Drive: https://www.youtube.com/watch?v=LsFrQ9JJ9T4
Union Hill District: https://www.youtube.com/watch?v=CgpGzMnA_54
Crismonrock Pass: https://www.youtube.com/watch?v=RURjK13Odgk
Midtown Expressway: https://www.youtube.com/watch?v=_iOCyYokMco
Greenpeak Highlands: https://www.youtube.com/watch?v=kydwDBr9MoA&t

<!-- gh-comment-id:647124040 --> @ghost commented on GitHub (Jun 21, 2020): Actually, i managed to replicate half of the Seaside Route 765 CPU replay where you drive a Blue Raggio while racing the Angelus... I actually screwed halfway when im supposed to trigger the 2nd NOS... The Raggio drifted on the turn that im not supposed to drift then the Angelus passed me... And since the Raggio is a Dynamic car, i cant control it properly.. Also, when replicating the replays, you got to be precise on the turns or the A.I. Opponents will mess your rhythm... Anyways, here are the 6 tracks with no CPU bugs whatsoever: Seaside Route 765: https://www.youtube.com/watch?v=kQyHEo4S4wg Sunset Drive: https://www.youtube.com/watch?v=LsFrQ9JJ9T4 Union Hill District: https://www.youtube.com/watch?v=CgpGzMnA_54 Crismonrock Pass: https://www.youtube.com/watch?v=RURjK13Odgk Midtown Expressway: https://www.youtube.com/watch?v=_iOCyYokMco Greenpeak Highlands: https://www.youtube.com/watch?v=kydwDBr9MoA&t
Author
Owner

@ghost commented on GitHub (Jun 25, 2020):

I tried to run Ridge Racer 6 on the Xenia emulator to test the AV Player, while hoping that it won't crash.. But despite Ridge Racer 6 just being an upgraded version of Ridge Racer PSP, I was surprised Ridge Racer 6 AV Player never bugged whatsoever... The course I played was called "Surfside Resort"...

<!-- gh-comment-id:649471820 --> @ghost commented on GitHub (Jun 25, 2020): I tried to run Ridge Racer 6 on the Xenia emulator to test the AV Player, while hoping that it won't crash.. But despite Ridge Racer 6 just being an upgraded version of Ridge Racer PSP, I was surprised Ridge Racer 6 AV Player never bugged whatsoever... The course I played was called "Surfside Resort"...
Author
Owner

@ghost commented on GitHub (Jul 7, 2020):

When i played this on Android, I noticed that JIT, IR Interpreter, and Interpreter executes the CPU Autodrive differently, causing different algorithms to happen... Try and list out the differences when using those 3 CPU Cores, and you might find that one mathematical error...

<!-- gh-comment-id:654718159 --> @ghost commented on GitHub (Jul 7, 2020): When i played this on Android, I noticed that JIT, IR Interpreter, and Interpreter executes the CPU Autodrive differently, causing different algorithms to happen... Try and list out the differences when using those 3 CPU Cores, and you might find that one mathematical error...
Author
Owner

@ghost commented on GitHub (Jul 23, 2020):

The fact that it desyncs makes me sad because I kept watching those replays on my real PSP when the Wi-Fi dies, brown out, or if im getting bored after I finished the game... The desyncing kinda represents how this game series is getting forgotten because Ridge Racer 8 never got released and the fact that the bug is still here represents that the game got left unfinished and forgotten....

<!-- gh-comment-id:663150779 --> @ghost commented on GitHub (Jul 23, 2020): The fact that it desyncs makes me sad because I kept watching those replays on my real PSP when the Wi-Fi dies, brown out, or if im getting bored after I finished the game... The desyncing kinda represents how this game series is getting forgotten because Ridge Racer 8 never got released and the fact that the bug is still here represents that the game got left unfinished and forgotten....
Author
Owner

@unknownbrackets commented on GitHub (Sep 27, 2020):

The xbox 360 and PSP have different CPUs, which is why they have different problems.

The heart of this problem is math. Games use what are called "vector" or "simd" instructions to calculate math in speed critical situations and 3D formulas. If you look here, the Xbox 360 CPU had special modifications to do dot products on the CPU faster:

https://en.wikipedia.org/wiki/Xbox_360_technical_specifications

To help you understand, let's say I was adding up these two numbers:

6628451234
984726456

Google says the result is 7613177690, which is probably accurate. But what if someone did it by hand, and got it wrong? What if they thought it was 7613177609? It's a small difference, but the small differences add up - like a "hyperspace jump" in slightly the wrong direction.

Some (but not all) of the PSP CPU's calculations were wrong - bad math. Crucially, unless we get the math wrong and get it wrong in exactly the same way - these replays won't play correctly.

Xenia probably doesn't have this problem because the Xbox 360 got high marks on its maths. Just like a modern PC or a phone, it can add, multiply, divide, and subtract correctly. So there's no need to simulate the errors.

Notably, it's probably the same reason again for RPCS3. The 7 SPEs also use inaccurate maths.

The reason these calculations were wrong? Most likely speed, power, or cost. Doing math correctly might've required more silicon, more battery juice, or might've made games run slower. These errors are at the hardware level and we don't fully understand them. We don't know exactly how it calculates square roots, and what shortcuts it's using to get a close, but wrong, value.

It's not that anyone doesn't care or wants to see the series dwindle by any means. Several people have spent hours debugging, working on, and trying to fix this very issue.

-[Unknown]

<!-- gh-comment-id:699572129 --> @unknownbrackets commented on GitHub (Sep 27, 2020): The xbox 360 and PSP have different CPUs, which is why they have different problems. The heart of this problem is math. Games use what are called "vector" or "simd" instructions to calculate math in speed critical situations and 3D formulas. If you look here, the Xbox 360 CPU had special modifications to do dot products on the CPU faster: https://en.wikipedia.org/wiki/Xbox_360_technical_specifications To help you understand, let's say I was adding up these two numbers: 6628451234 984726456 Google says the result is 7613177690, which is probably accurate. But what if someone did it by hand, and got it wrong? What if they thought it was 7613177609? It's a small difference, but the small differences add up - like a "hyperspace jump" in slightly the wrong direction. Some (but not all) of the PSP CPU's calculations were wrong - bad math. Crucially, unless we get the math *wrong* and get it wrong in exactly the same way - these replays won't play correctly. Xenia probably doesn't have this problem because the Xbox 360 got high marks on its maths. Just like a modern PC or a phone, it can add, multiply, divide, and subtract correctly. So there's no need to simulate the errors. Notably, it's probably the same reason again for RPCS3. The 7 SPEs also use inaccurate maths. The reason these calculations were wrong? Most likely speed, power, or cost. Doing math correctly might've required more silicon, more battery juice, or might've made games run slower. These errors are at the hardware level and we don't fully understand them. We don't know exactly how it calculates square roots, and what shortcuts it's using to get a close, but wrong, value. It's not that anyone doesn't care or wants to see the series dwindle by any means. Several people have spent hours debugging, working on, and trying to fix this very issue. -[Unknown]
Author
Owner

@ghost commented on GitHub (Mar 29, 2021):

PPSSPP 1.11.3. Issue still persists.

<!-- gh-comment-id:809450818 --> @ghost commented on GitHub (Mar 29, 2021): PPSSPP 1.11.3. Issue still persists.
Author
Owner

@ghost commented on GitHub (Sep 22, 2022):

I discovered something odd with the desyncing replays. It wouldn't just bug out pre-recorded replays, it could also bug out your own replays. If you save a replay and then update or downgrade PPSSPP to another version, that replay may bug out and desync like the pre-recorded ones. I have some replays saved on an old PPSSPP version and the car just desyncs. Backtracking to an older version fixes the bug on some replays and some of them get fixed somehow.

<!-- gh-comment-id:1255657912 --> @ghost commented on GitHub (Sep 22, 2022): I discovered something odd with the desyncing replays. It wouldn't just bug out pre-recorded replays, it could also bug out your own replays. If you save a replay and then update or downgrade PPSSPP to another version, that replay may bug out and desync like the pre-recorded ones. I have some replays saved on an old PPSSPP version and the car just desyncs. Backtracking to an older version fixes the bug on some replays and some of them get fixed somehow.
Author
Owner

@unknownbrackets commented on GitHub (Sep 23, 2022):

This is because we've made some updates to improve accuracy in some CPU instructions. It hasn't been enough to make the pre-recorded missions play correctly, but it means that recordings from previous versions no longer play the way they used to.

This issue basically relies on specific and very accurate mathematical results, matching the same mathematical errors that the PSP CPU makes. Or at least, so we think.

-[Unknown]

<!-- gh-comment-id:1255684999 --> @unknownbrackets commented on GitHub (Sep 23, 2022): This is because we've made some updates to improve accuracy in some CPU instructions. It hasn't been enough to make the pre-recorded missions play correctly, but it means that recordings from previous versions no longer play the way they used to. This issue basically relies on specific and very accurate mathematical results, matching the same mathematical errors that the PSP CPU makes. Or at least, so we think. -[Unknown]
Author
Owner

@unknownbrackets commented on GitHub (Sep 27, 2023):

Just an idea that I thought of just now but have not pursued on this:

It could be that it isn't just accuracy, but that there's some actual bug in the math equation, but things work out as long as the replay replicates it because it's small. Specifically, I don't think anyone has ever checked if there is any suspicious vector overlap cases. There's been evidence to suggest that unlike PPSSPP's code, the actual VFPU doesn't guarantee overlap safety in all cases (and when it does, it seems to do so by performing operations in reverse order.)

Probably not likely, but I have already tried flushing everything to zero, adjusting rounding modes, forcing things to the decently accurate vdot, etc.

-[Unknown]

<!-- gh-comment-id:1736813870 --> @unknownbrackets commented on GitHub (Sep 27, 2023): Just an idea that I thought of just now but have not pursued on this: It could be that it isn't just accuracy, but that there's some actual bug in the math equation, but things work out as long as the replay replicates it because it's small. Specifically, I don't think anyone has ever checked if there is any suspicious vector overlap cases. There's been evidence to suggest that *unlike* PPSSPP's code, the actual VFPU doesn't guarantee overlap safety in all cases (and when it does, it seems to do so by performing operations in reverse order.) Probably not likely, but I have already tried flushing everything to zero, adjusting rounding modes, forcing things to the decently accurate vdot, etc. -[Unknown]
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ppsspp#1227
No description provided.