mirror of
https://github.com/librespot-org/librespot.git
synced 2026-04-27 00:05:55 +03:00
[GH-ISSUE #1490] librespot crashes when one of defined DNS servers is not working #674
Labels
No labels
A-Alsa
SpotifyAPI
Tokio 1.0
audio
bug
can't reproduce
compilation
dependencies
duplicate
enhancement
good first issue
help wanted
high priority
imported
imported
invalid
new api
pull-request
question
reverse engineering
wiki
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/librespot#674
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @povserok on GitHub (Apr 20, 2025).
Original GitHub issue: https://github.com/librespot-org/librespot/issues/1490
Description
Spotify did not connect. In logs I noticed errors that watchdog detected Spotify crash and restarted it. Enabled logs, and noticed in librespot the attached log, when connecting to Spotify connect. It was strange because some of connections in log worked, and some did not. Then I checked resolv.conf and noticed that first DNS provided by DHCP was incorrect (192.168.10.10 - currently not working), second one was fine (192.168.10.1). I removed the first DNS, and Spotify connect worked.
Version
librespot 0.6.0
383a6f6(Built on 2024-11-02, Build ID: E4uuQIK1, Profile: release)Moodeaudio 9.3.2
How to reproduce
Set incorrect primary and correct secondary DNS, like:
Log
Host (what you are running
librespoton):Moodeaudio 9.3.2
RPI 3B+
Additional context
@kingosticks commented on GitHub (Apr 20, 2025):
We'd need to see an actual crash in the log to take any action here.
@povserok commented on GitHub (Apr 20, 2025):
@kingosticks commented on GitHub (Apr 22, 2025):
Right ok, so there's no crash shown in the first log because there's no actual "crash" occurring at all. This is just librespot exiting with error code 1 because the system is badly configured.
This seems reasonable behaviour to me. What would you prefer librespot to do in the case when the network is broken and it cannot reliably connect??
@povserok commented on GitHub (Apr 22, 2025):
I assumed it was crash, because of the word 'crash' in this line:
20250420 113527 watchdog: Started Spotify Connect after crash detectedI don’t think the network is “broken.” There are almost always configured at least two DNS servers for redundancy. If the primary DNS server goes down, the system should simply fall back to the secondary one. I’ve had this mis‑configured setup running for several months on multiple PCs, Macs, and phones, and all networking has worked fine—otherwise I’d have noticed sooner.
So this may not be a librespot issue at all, but rather a problem in the underlying connectivity library or the operating system itself.
@kingosticks commented on GitHub (Apr 22, 2025):
Perhaps it's related to the different way we connect to HTTP and APs (access point) servers. Your log shows it takes ~3 seconds for each of the first two HTTP requests to work, they use Hyper's HTTP client. There must be a 3 second timeout somewhere before it tries to use the secondary DNS. Our Spotify AP connections are different, we directly use Tokio sockets and enforce a specific 5 second timeout before giving up and trying the next AP. Before we implemented this timeout we would get stuck on broken servers, a 5 second timeout seemed reasonable (to me) but it was pretty arbitrary.
I agree with you, the AP connections should have used the secondary DNS and it appears like they didn't. I'd have assumed the DNS timeout and retry was done by the OS, and it looks to be 3 seconds, so our 5 seconds should have been enough. Maybe that's a bad assumption.
@povserok commented on GitHub (Apr 22, 2025):
I’m not sure.
What’s odd is that, for example, the pre‑configured radio stations streamed just fine—they’re saved by domain name, not by IP—while Spotify Connect didn’t work at all, as though it were using a different DNS resolver. For example, AFAIK Firefox, doesn’t use system level DNS resolver.
Note that the incorrect DNS server was listed first in resolv.conf. As far as I know, Linux nameservers are used in the order they’re defined.
@kingosticks commented on GitHub (Apr 23, 2025):
I'm not sure I understand, is this something outside of librespot? Some other moode feature?
librespot just uses the normal libc DNS stuff. The DNS servers will be used in the order they are specified. There's some options in resolv.conf to change the timeout and number of retries. I think the default timeout is 5 seconds, not 3, so that's a bit odd. Maybe my theory is wrong. I think we need to employ wireshark and trace exactly what's going on, I'll give it a go at the weekend. Maybe there's something we can improve here around the AP connections.
@sierra-alpha commented on GitHub (May 3, 2025):
Just adding another data point of this issue. I originally thought it was this https://github.com/librespot-org/librespot/issues/1481#issuecomment-2781062014 but I never got it to reconnect.
Similar to povserok I had a primary DNS server in
/etc/resolv.confdown and network functionality has been fine for everything except librespot. For what it's worth I'm using it through snapserver (snapcast).I've also had this experience:
This was also the case for me:
Removing the temporarily down DNS server from
resolv.confis an acceptable workaround for me, and I'll also investigateto see if I can come up with a timeout value that works with the
otherwise maybe I'll investigate upping the timeout in librespot to see if it can just work™ by default on my system (debian 12, AMD64)
@sierra-alpha commented on GitHub (May 3, 2025):
Ahh it seems debian defaults to 5 seconds for the timeout.
From the docs (emphasis mine):
@sierra-alpha commented on GitHub (May 3, 2025):
In my
resolv.confsetting the following did indeed work (proving @kingosticks assumption that 5 seconds would be enough if the DNS timeout was 3s, and you only had the first DNS server not working, more DNS servers not working would take longer to loop through each server and timeout on them):I wonder if we could increase librespots arbitrary timeout to maybe 12s (allowing for 2 default timeouts, also pretty arbitrary). I have no idea what the other side of this problem is with having this value too large apart from this comment:
I'm happy to have a crack at contributing/submitting a PR if increasing the timeout is something the project thinks is worth doing?
@sierra-alpha commented on GitHub (May 3, 2025):
Or perhaps it is actually a good idea to (from the original PR):
@kingosticks commented on GitHub (May 3, 2025):
Personally I think adding extra code/config to workaround broken systems isn't worthwhile but that's just my opinion.
@roderickvd commented on GitHub (May 3, 2025):
I agree with @kingosticks. I've been down that road some time ago with pleezer and it just wasn't worth it. The moment you think you've worked around one DNS configuration quirk, you run into another.
@sierra-alpha commented on GitHub (May 4, 2025):
TLDR:
librespotto support broken systems is undesirable, however this problem can occur for many reasons, not just alibrespotservers config, but transient upstream DNS server issues.librespot's AP arbitrary timeout too large?I 100% agree with this statement:
But I'm thinking beyond broken systems, I'm thinking about generalising this issue, robustness for
librespot, and to avoid thelibrespotproject getting unneeded issues like this raised in the future. I think it was a mistake for me to quote the section that used the term mis-configured for theresolve.confin my original message on this thread, as DNS timeouts can happen for reasons beyond thelibrespotserver's DNS config and for other issues that need a retry.DNS is a service that can have timeouts for a myriad reasons. In my case on the server running
librespotthere was nothing wrong with myreslov.confconfig, there was an issue on the primary DNS server it was pointing too, but I happen to know that because I control that server. It is a feature to be able to provide multiple nameservers in DNS to be able to still have connectivity when a name server is down, among other reasons such as specifying the closest DNS server. For example on my gateway router I specify the ISP's DNS server as the primary because it's the closest/fastest response and then googles8.8.8.8as a secondary for the times when my ISP DNS server is having issues, this allows me to be more robust to DNS server issues.The internet is a scary place and routers drop packets all the time, DNS timeouts can occur in transient ways for other reasons beyond a systems
resolv.confsettings.I think it would be reasonable to at least increase
librespot's timeout to a value above the default DNS timeout on Linux systems. I've tried to look to see any system that has a default DNS timeout of 3 seconds and can't, I'm happy to be pointed to documentation though.From this comment by @kingosticks it seems that retrying at least once and an algorithm of default DNS timeout plus 2 seconds is desired. In this case that would mean 7s. I'm also suggesting perhaps 2 timeouts plus 2 seconds maybe desirable so the 12s I was suggesting in an earlier message. Even better would be to let the system OS handle the timeout and use it's own defaults (if that is at even possible) so this would work for any system it was running on (but maybe that's a bigger change, so I'm not suggesting that as a solution for now).
If the default timeout is found to be different on other systems then that is a case for making
librespot's timeout be configurable by the end user.I'm also interested in what the impacts of having this arbitrary AP timeout too large are? Unfortunately I'd have to read a lot of code to understand that, so I'm relying on somebody to make me aware of what the problem of having
librespots's AP timeout too large would be.Failing any of the above there should at least be some tips around this in a FAQ or otherwise in the docs.
I'm happy to pull together a PR for increasing
librespot's AP timeout, making it customisable by the end user and/or adding a FAQ/Troubleshooting entry to the docs.Also happy to jump on a call to discuss any of this as sometime these convos can come across the wrong way via text. I'm really thankful for this project and all the work that goes into it and would like to help if I can.
@kingosticks commented on GitHub (May 4, 2025):
In the original port above I was speculating a 3 second timeout based on the log we got. I said the 3 seconds in the log didn't match the documented Linux default of 5 seconds. We never worked out why there was a discrepancy. I am still curious why the earlier http requests seem to work differently, I didn't understand that and I didn't look further.
A long timeout on each AP connection attempt can make starting a Spotify session slow. With a long-lived session created at startup that's likely OK but librespot is a library and not all use-cases are like that. I primarily use librespot within gst-plugin-spotify where the architecture requires a session per song and slow session connection means playback starts slowly and it's noticeable. I also use it behind a corporate firewall where some ports are blocked, causing APs using those ports to get stuck during their connection. So I have been relying on the short timeout, but I appreciate this use-case is niche. Maybe a mechanism to better control/filter APs would be nicer but that's another issue.
In general, applications should be able to avoid thinking too much about this stuff and let lower levels handle this correctly and consistently. Our case here is a little bit different, we've been given a list of unreliable AP servers to try, they can even have perfect connectivity but still be unusable as an AP (returning a Spotify error code). Failing fast is the right thing to do in this context but you're right, we also need to go slow enough for IP connections to (appear to) be reliable. So I guess the balance here isn't quite right. However, just fiddling the numbers a bit to avoid this particular OS default just moves the problem.
So yeh, we could again revisit making the timeout an optional parameter, keeping a short default, and using a longer value in the librespot binary. Or the opposite, I guess (and make it a breaking change). Just keep in mind the algorithm we currently have wasn't focused at all on supporting user DNS problems, it was to find a working Spotify AP from the list assuming a working network. Most of the time the same thing will work for both but be careful on doing anything too specific for the secondary case.
I'd still prefer not to expose an extra config option to librespot binary users. Just having it configurable at the library level is far simpler than spending our cycles trying to come up with a single algorithm that works for all cases.
@sierra-alpha commented on GitHub (May 4, 2025):
Ahh sorry I missed that about the log above showing the 3s timeout.
Okay that makes sense to me, so we have a situation where some library use cases need the timeout to be as quick as possible and others as robust/retryable as possible. And also where the failure modes of the AP aren't always obvious in a programmatic sense. I'll have a think a bit more about any other ideas but at this stage I'm thinking adding something to the docs to deal with DNS issues like mine might be the best scenario (or even this ticket if people search for it may be enough).
This is a good hint:
At a guess it's regular TCP connections versus websocket connections?
@kingosticks commented on GitHub (May 4, 2025):
They should be regular http connections, not even websocket. But they do use a library and maybe that library has some different handling somewhere. I'm not sure it is that interesting unless you're really interested!