mirror of
https://github.com/librespot-org/librespot.git
synced 2026-04-27 08:15:50 +03:00
[GH-ISSUE #524] Feature Request: More Spotify like volume normalization #334
Labels
No labels
A-Alsa
SpotifyAPI
Tokio 1.0
audio
bug
can't reproduce
compilation
dependencies
duplicate
enhancement
good first issue
help wanted
high priority
imported
imported
invalid
new api
pull-request
question
reverse engineering
wiki
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/librespot#334
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @JasonLG1979 on GitHub (Sep 20, 2020).
Original GitHub issue: https://github.com/librespot-org/librespot/issues/524
Librespot already has volume normalization which I would assume (hopefully) follows the ReplayGain spec since that's what Spotify uses. But unlike Spotify it seems to use gain reduction as it's clipping prevention method whereas Spotify uses limiting. There also seems to be nothing in the librespot docs about how to currently approximate Spotify's 3 different volume normalisation options.
From what I can tell to approximate the 3 Spotify volume options the args are:
Loud
--enable-volume-normalisation --normalisation-pregain 6Normal (Default)
--enable-volume-normalisation --normalisation-pregain 3Quiet
--enable-volume-normalisation --normalisation-pregain -5The problem is that with gain reduction as the clipping prevention method setting a positive pregain value basically breaks volume normalization. A drop in the pregain of a track that would clip can possibly make for a huge drop in perceived volume compared to other tracks.
What I would like to see is a choice of clipping prevention methods one being a limiter like what Spotify uses (Threshold -1dB, Attack 5ms, Release 100ms [for bonus points you could make it a look-ahead limiter so the Attack would be 0]) and the other being the current gain reduction method.
It would also be nice to have a set of args that would directly map to the 3 Spotify presets, applying the appropriate pregain and using the limiter.
For reference:
Here is the ReplayGain spec:
http://wiki.hydrogenaud.io/index.php?title=ReplayGain_specification
This explains Spotify's definition of volume normalization and the specs of their limiter:
https://artists.spotify.com/faq/mastering-and-loudness#what-is-loudness-normalization-and-why-is-it-used
This explains the volume normalization options in the official clients:
https://artists.spotify.com/faq/mastering-and-loudness#can-users-adjust-the-levels-of-my-music
@JasonLG1979 commented on GitHub (Sep 20, 2020):
I'm also curious if the audio processing is currently done in 16 bit? I ask because I notice that the output of librespot is 16 bit. Lowering the gain of 16 bit audio by several dB with ReplayGain throws away bits. It would be advantageous audio quality wise to do audio processing in 24 or 32 bit mode and if the sound card will accept it just give it to them or if not truncate it to 16 bit.
@JasonLG1979 commented on GitHub (Sep 20, 2020):
Not accounting for noise shaping and other tricks 16 bits gets you a theoretical dynamic range of 96.33dB and 24 bit gets you 144.49dB so that would mean if you did processing in 24 bit mode you could lower the gain by up to 48.16dB before you had to start throwing away bits.
@JasonLG1979 commented on GitHub (Sep 22, 2020):
Another option of course is to do the gain reduction in hardware for sound cards that have a hardware volume control. Because that's basically all that your implementation of volume normalization does. Turn the volume up and down. The replay spec mentions that as an option. Doing it in hardware at least doesn't throw bits away.
@sashahilton00 commented on GitHub (Sep 24, 2020):
@JasonLG1979 iirc the audio Is processed in 16 bits, since the Spotify files are 44,100, 16bit. If you want to examine the processing logic and potentially change it to 24/32 then that could potentially be worth having. My concern is that it would probably want to have a usage flag as I imagine that 32 bit processing will put a strain on some of the more memory constrained devices that librespot supports.
Hardware based normalisation would be good to have, ideally we would just offload to the hardware where possible, otherwise fallback to a software implementation
@JasonLG1979 commented on GitHub (Sep 25, 2020):
That's not how lossy audio like vorbis works. The source file may have been 16 bit but in the process of converting it it was transformed into the frequency domain sorta like converting PCM to PWM. Lossy audio does not have a bit depth. The bit depth of the resulting PCM is decided by the decoder. I would think the decoder does it's work internally in at least 32 bit float if it's worth a crap anyway.
It would use more memory but no more CPU really. All you're doing is bit shifting. If the decoder won't output anything but 16 bit basically you just pad the bottom 8 or 16 bits with zeros and then do your gain adjustment just like before, Except now you're not throwing away bits.
That would also have to imply fixed or softvol volume, as in librespot is the only thing that should be turning the hardware volume up or down.
@JasonLG1979 commented on GitHub (Sep 26, 2020):
Only outputting 16 bit also affects the quality of librespot's software volume implementation. The same thing happens when you turn the volume down in 16 bit mode. You're throwing away bits. It would be nice to have "lossless" software volume control also.
Turning S16_LE to S32_LE would be trivial I would think since i32 is a native rust data type. It should give you more than enough space to lower the volume to below the physical noise floor of a device before you have to start throwing away bits even with gain adjustment. S24_LE and S24_3LE might be a little tricky though.
@roderickvd commented on GitHub (Feb 21, 2021):
Interested in this point, I dug around the source code.
It seems that gain normalization is applied in 32 bit, then converted to 16 bit output:
github.com/librespot-org/librespot@7f705ed148/playback/src/player.rs (L1098)Same for the software volume control:
github.com/librespot-org/librespot@7f705ed148/playback/src/mixer/softmixer.rs (L42)For the ALSA sink this is even done in 64 bit:
github.com/librespot-org/librespot@7f705ed148/playback/src/mixer/alsamixer.rs (L168)So while this does not answer your feature request, at least the volume controls seem to be in HQ order!
@JasonLG1979 commented on GitHub (Feb 21, 2021):
No. The audio is spit out as 16bit 44.1 by the decoder and then processed. Converting a 16bit int into 32bit float then doing some math on it and then converting it back to a 16bit int is in no way HQ and you gain nothing. You're still throwing away bits, Best case you're wasting time converting back and forth, worst case you're introducing rounding errors/distortion converting an int to a float and then back to an int.
The solution is to do the gain normalization in 24 or 32bit (or 64bit or whatever the decoder natively works in) during the decoding process and just leave it 24 or 32bit. That way you can still fit the whole 16bits inside the 24/32bits with room for gain normalization without throwing away bits.
@roderickvd commented on GitHub (Feb 21, 2021):
You are absolutely right. Blame on me for missing that glaring point at such a late hour. Output should remain at high bit depth after processing, not casted back to 16 bit.
@sashahilton00 commented on GitHub (Feb 22, 2021):
Feel free to create a PR if you want to/have time to. I'd be curious to see if the difference is noticeable or if this ends up as more of a case of 'doing it properly'.
@JasonLG1979 commented on GitHub (Feb 22, 2021):
It might be a while. I'd need to learn rust.
The difference would certainly be measurable I would think, but with all things audio, depending on the person and/or audio gear it may or may not be preservable? IMHO It never hurts to do things right though.
@roderickvd commented on GitHub (Feb 22, 2021):
Why 24 bit resolution matters for volume control and normalization is described here: http://archimago.blogspot.com/2019/02/musings-why-bother-with-24-bit-dacs.html
Dialing the volume down to -25 dB in 16 bit decreases dynamic range from 98,9 dBA (CD quality) to 73,7 dBA (3,7 dB higher than vinyl). In comparison, doing the same in 24 bit pretty much maintains CD quality at 96,6 dBA. This is within the 120 dB dynamic range of human hearing and so practically observable.
I am enthusiastic about investing my time in this for the ALSA and Rodio backends. It would mean:
For 24 bit output, the following looks promising: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=3d233fedc8ed595a1e88e815d23cd009
Is this something of interest?
@ashthespy commented on GitHub (Feb 22, 2021):
I was under the impression that the ogg stream from Spotify was encoded in 16bit 44.1 to begin with? Or do I misunderstand?
I am not well versed with these things -- but now that we use a new version of
lewtonit lets you prescribe what format you want to read the samples out as. So you could useread_dec_packet_genericinstead ofread_dec_packet_itlingithub.com/librespot-org/librespot@ed20f357dc/audio/src/lewton_decoder.rs (L32-L34)@roderickvd commented on GitHub (Feb 22, 2021):
That's true, it's encoded at 16 bit 44,1 kHz so that gives a dynamic range of 96,3 dB at 0 dBFS. Now if you go under 0 dBFS (such as when attenuating volume or applying negative replay gain) you are adjusting the magnitude of the encoded wave. For every 6 dB attenuation you lose 1 bit.
Intuitively: at one point in the signal is encoded at 65535 (maximum amplitude). This is encoded as 1111 1111 1111 1111. Now you halve the volume. The signal should then be 32767 (half amplitude). This is encoded as 0111 1111 1111 1111. You have just lost one bit of information to reconstruct the same signal.
This can be circumvented by taking the 16 bit Ogg Vorbis stream, padding it with 8 or 16 zeros to 24 or 32 bit, then do volume control and normalization on it and keep it at that bit depth. You now have 48 respectively another 96 dB of headroom to do volume control in without losing dynamic range.
Staying with the example, 1111 1111 1111 1111 padded to 32 bit is 1111 1111 1111 1111 0000 0000 0000 0000. Halving the volume makes it 0111 1111 1111 1111 1000 0000 0000 0000. No more information lost.
(This does not really concern the title of this issue, should we open a new one?)