mirror of
https://github.com/devgianlu/go-librespot.git
synced 2026-04-26 05:15:49 +03:00
[GH-ISSUE #88] Shannon connection is not re-established on failure #62
Labels
No labels
bug
enhancement
pull-request
spotify-side
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/go-librespot#62
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @aykevl on GitHub (Sep 10, 2024).
Original GitHub issue: https://github.com/devgianlu/go-librespot/issues/88
Sometimes the Shannon connection (whatever that is) breaks, and go-librespot becomes unusable:
Here is another error:
(this time without the "did not receive last pong from dealer" messages, though it did start to spew those messages when I tried, and failed, to connect to the go-librespot instance).
I inserted some debugging code, and the stack trace where this error is generated in both cases looks like this:
Looking at the code (
AccessPoint.recvLoop) it looks like it is intended to reestablish the connection when an error occurs, but clearly it doesn't for me.@devgianlu commented on GitHub (Sep 10, 2024):
I have observed this myself too, it was working fine before, I think it's a regression introduced by https://github.com/devgianlu/go-librespot/pull/69. I did not debug this yet because I am not able to reproduce it consistenly, just happens randomly on my device.
@aykevl commented on GitHub (Sep 10, 2024):
Can you describe how this is supposed to work? Then I might be able to develop a fix (I normally have a Spotifyd/librespot instance running 24/7 so I hit this frequently).
@devgianlu commented on GitHub (Sep 10, 2024):
There are two components called accesspoint (AP) and dealer that are persistent connections (TCP and Websocket). They both rely on a ping/pong (120 seconds for AP, 30 for dealer) mechanism to keep the server/client connected.
The idea behind that code is that if no ping/pong is received then the connection is closed causing the recv loop to fail and start reconnecting.
I think this is somewhat related to #48.
@aykevl commented on GitHub (Sep 19, 2024):
Investigated the issue a bit more.
This time the issue started exactly at the time I tried to connect to it from the Android app. It showed up in the list, I clicked it, and then it disappeared and after a bit (probably 60s) it started logging those error messages
did not receive last pong from dealer.Here the error message is generated, and the code that's supposed to close the connection:
github.com/devgianlu/go-librespot@b95e6acdcc/dealer/dealer.go (L134-L139)The
recvLoopmeanwhile is stuck here:So apparently once the connection breaks it gets stuck in this channel receive operation?
github.com/devgianlu/go-librespot@b95e6acdcc/dealer/recv.go (L236-L237)I'll add some debugging code to try to figure out why it gets stuck there.
@devgianlu commented on GitHub (Sep 21, 2024):
That's interesting, I've been looking at the wrong place for some time. I wonder where
AppPlayer.Runis stuck then, since it should always reply to the dealer requests.@aykevl Are you using
dlv attachto debug the client when it breaks?@aykevl commented on GitHub (Sep 21, 2024):
No, I simply sent the process a SIGABRT when it got stuck and pulled out the stack trace of the right goroutine.
(SIGABRT will print the stack trace of every goroutine that's active).
@aykevl commented on GitHub (Sep 21, 2024):
This is the
AppPlayer.Runstack trace (that I still had in my terminal history):So apparently it's stuck here:
github.com/devgianlu/go-librespot@9e41b6cb7d/audio/provider.go (L110)@aykevl commented on GitHub (Sep 21, 2024):
Going further down the rabbit hole, here is the stack trace for the KeyProvider:
So it's just waiting in the
select. Looks like the response there was just never sent from Spotify, so the request never got something on the response channel?I'm very much out of my depth here, but I hope this gives some ideas how to fix the issue.
@devgianlu commented on GitHub (Sep 21, 2024):
Surely there is potential for a bug in
KeyProviderbecause if the connection breaks while making the request no response is ever sent back and it will just wait forever. I have added a timeout of 15 seconds with117b929858, let's see if that helps with this too.@aykevl commented on GitHub (Sep 22, 2024):
Thanks! I'll be trying this patch, see if it improves anything. I should know whether it helps in a day or so.
@aykevl commented on GitHub (Sep 22, 2024):
The fix appears to be working! When I connected after a few hours I got "authenticated Login5 as (...)" which I guess means the connection was re-established. I'll monitor the daemon some more and will let you know if there's any issue.
I normally run a Spotify daemon (go-librespot or otherwise) 24/7 and use it frequently, so I should easily find issues that only happen after running for a while.
@devgianlu commented on GitHub (Sep 23, 2024):
While the issue with the "deadlock" is fixed, I seem to be still having issues:
All seems good, but the device disappears from Spotify Connect.
@aykevl commented on GitHub (Sep 24, 2024):
Yeah, I'm also still hitting problems. Just slightly different ones.
Here is one:
Here is another:
I'm not exactly what happened the second time, but it seems to have triggered something on Spotify's side. When it automatically restarted the process, it could not authenticate anymore:
I had to reset my password to use Spotify again. Maybe unrelated, but it seems like a weird coincidence.
@aykevl commented on GitHub (Sep 25, 2024):
Somewhat unrelated, but I got this error in my log at some point and found it interesting:
go-librespot was already kinda stuck due to previous errors so I don't know whether it would normally have recovered from this.
@devgianlu commented on GitHub (Sep 26, 2024):
I know one person that had to it, but in 5 years of messing with Spotify stuff I don't think it ever happened to me. Strange.
Anyway, The deamon is a lot more stable for me. Do you mind if we close this and create new issues dedicated to each specific problem? @aykevl
@aykevl commented on GitHub (Sep 26, 2024):
🤷♀️ As long as it doesn't keep happening.
It doesn't seem fixed to me? Or did I miss something?
github.com/devgianlu/go-librespot@117b929858changes things, but doesn't seem to fix the issue. The error message is different but the Spotify Connect connection still breaks. I suspect the bug isn't fully fixed yet.I also made a quick patch where I retried to send the message on timeout but the 2nd also timed out. So whatever the problem is, it doesn't fix itself after the timeout.
@devgianlu commented on GitHub (Sep 27, 2024):
Ah ok, didn't think there were still problems with AP specifically. Perhaps it's better to track AP and Dealer bugs separately.
@aykevl commented on GitHub (Sep 28, 2024):
Maybe when the dealer connection breaks, it also needs to disconnect and reconnect the AP?
I'm going to test this theory now, see if it improves anything.
@devgianlu commented on GitHub (Sep 28, 2024):
I think I found the culprit for the double AP timeout: when reconnecting the AP would throw away all the registered receivers making the
KeyProviderlisten on a one-ended channel.Should be fixed with
516de6899c@aykevl commented on GitHub (Sep 29, 2024):
Yes! I was really close to figuring it out, you found it slightly earlier :)
I noticed that an AES response packet was received, but somehow the timeout still happened. So I was suspecting the AP somehow lost the channel to the KeyProvider, or used the wrong channel or something. The bug you found is certainly consistent with that.
I'll test the patch and let you know whether it fixes the issue for me!
@aykevl commented on GitHub (Sep 29, 2024):
So far holding steady (it would usually have errored out by now).
@aykevl commented on GitHub (Sep 30, 2024):
Still running after 38 hours 💪 even surviving the Spotify outage yesterday (though of course it wasn't usable during the outage). I'd say this bug is fixed.
Notably, the timeout in
github.com/devgianlu/go-librespot@117b929858was never triggered during this period. Apparently it managed to reconnect within 15 seconds each time.@linknum23 commented on GitHub (Nov 20, 2024):
One of AmpliPi's customers appears to be having this issue with go-librespot_linux_armv6_rpi.tar.gz running on a Pi3 compute module with a similar log
@aykevl commented on GitHub (Nov 21, 2024):
What I did was I sent the process a
SIGABRTwhich causes it to print the stack trace of every goroutine, and look for ones that seem to be blocked. Especially those in a "channel recv" state or similar (where that isn't the intention).Perhaps it's better to create a new issue? Because this issue was already solved a while ago.
@devgianlu commented on GitHub (Nov 21, 2024):
Agree, I thought we eradicated this bug once for all (sigh).
@linknum23 Can you open a new issue? Including the stacktraces would be a real bonus.