[GH-ISSUE #524] Feature Request: More Spotify like volume normalization

kerem commented

2026-02-27 19:30:05 +03:00

Owner

Originally created by @JasonLG1979 on GitHub (Sep 20, 2020).
Original GitHub issue: https://github.com/librespot-org/librespot/issues/524

Librespot already has volume normalization which I would assume (hopefully) follows the ReplayGain spec since that's what Spotify uses. But unlike Spotify it seems to use gain reduction as it's clipping prevention method whereas Spotify uses limiting. There also seems to be nothing in the librespot docs about how to currently approximate Spotify's 3 different volume normalisation options.

From what I can tell to approximate the 3 Spotify volume options the args are:

Loud
--enable-volume-normalisation --normalisation-pregain 6

Normal (Default)
--enable-volume-normalisation --normalisation-pregain 3

Quiet
--enable-volume-normalisation --normalisation-pregain -5

The problem is that with gain reduction as the clipping prevention method setting a positive pregain value basically breaks volume normalization. A drop in the pregain of a track that would clip can possibly make for a huge drop in perceived volume compared to other tracks.

What I would like to see is a choice of clipping prevention methods one being a limiter like what Spotify uses (Threshold -1dB, Attack 5ms, Release 100ms [for bonus points you could make it a look-ahead limiter so the Attack would be 0]) and the other being the current gain reduction method.

It would also be nice to have a set of args that would directly map to the 3 Spotify presets, applying the appropriate pregain and using the limiter.

For reference:

Here is the ReplayGain spec:
http://wiki.hydrogenaud.io/index.php?title=ReplayGain_specification

This explains Spotify's definition of volume normalization and the specs of their limiter:
https://artists.spotify.com/faq/mastering-and-loudness#what-is-loudness-normalization-and-why-is-it-used

This explains the volume normalization options in the official clients:
https://artists.spotify.com/faq/mastering-and-loudness#can-users-adjust-the-levels-of-my-music

Originally created by @JasonLG1979 on GitHub (Sep 20, 2020). Original GitHub issue: https://github.com/librespot-org/librespot/issues/524 Librespot already has volume normalization which I would assume (hopefully) follows the ReplayGain spec since that's what Spotify uses. But unlike Spotify it seems to use gain reduction as it's clipping prevention method whereas Spotify uses limiting. There also seems to be nothing in the librespot docs about how to currently approximate Spotify's 3 different volume normalisation options. From what I can tell to approximate the 3 Spotify volume options the args are: Loud ```--enable-volume-normalisation --normalisation-pregain 6``` Normal (Default) ```--enable-volume-normalisation --normalisation-pregain 3``` Quiet ```--enable-volume-normalisation --normalisation-pregain -5``` The problem is that with gain reduction as the clipping prevention method setting a positive pregain value basically breaks volume normalization. A drop in the pregain of a track that would clip can possibly make for a huge drop in perceived volume compared to other tracks. What I would like to see is a choice of clipping prevention methods one being a limiter like what Spotify uses (Threshold -1dB, Attack 5ms, Release 100ms [for bonus points you could make it a look-ahead limiter so the Attack would be 0]) and the other being the current gain reduction method. It would also be nice to have a set of args that would directly map to the 3 Spotify presets, applying the appropriate pregain and using the limiter. For reference: Here is the ReplayGain spec: http://wiki.hydrogenaud.io/index.php?title=ReplayGain_specification This explains Spotify's definition of volume normalization and the specs of their limiter: https://artists.spotify.com/faq/mastering-and-loudness#what-is-loudness-normalization-and-why-is-it-used This explains the volume normalization options in the official clients: https://artists.spotify.com/faq/mastering-and-loudness#can-users-adjust-the-levels-of-my-music

kerem

2026-02-27 19:30:05 +03:00

closed this issue
added the
enhancement

audio
labels

kerem commented

2026-02-27 19:30:05 +03:00

Author

Owner

@JasonLG1979 commented on GitHub (Sep 20, 2020):

I'm also curious if the audio processing is currently done in 16 bit? I ask because I notice that the output of librespot is 16 bit. Lowering the gain of 16 bit audio by several dB with ReplayGain throws away bits. It would be advantageous audio quality wise to do audio processing in 24 or 32 bit mode and if the sound card will accept it just give it to them or if not truncate it to 16 bit.

@JasonLG1979 commented on GitHub (Sep 20, 2020): I'm also curious if the audio processing is currently done in 16 bit? I ask because I notice that the output of librespot is 16 bit. Lowering the gain of 16 bit audio by several dB with ReplayGain throws away bits. It would be advantageous audio quality wise to do audio processing in 24 or 32 bit mode and if the sound card will accept it just give it to them or if not truncate it to 16 bit.

kerem commented

2026-02-27 19:30:05 +03:00

Author

Owner

@JasonLG1979 commented on GitHub (Sep 20, 2020):

Not accounting for noise shaping and other tricks 16 bits gets you a theoretical dynamic range of 96.33dB and 24 bit gets you 144.49dB so that would mean if you did processing in 24 bit mode you could lower the gain by up to 48.16dB before you had to start throwing away bits.

@JasonLG1979 commented on GitHub (Sep 20, 2020): Not accounting for noise shaping and other tricks 16 bits gets you a theoretical dynamic range of 96.33dB and 24 bit gets you 144.49dB so that would mean if you did processing in 24 bit mode you could lower the gain by up to 48.16dB before you had to start throwing away bits.

kerem commented

2026-02-27 19:30:05 +03:00

Author

Owner

@JasonLG1979 commented on GitHub (Sep 22, 2020):

Another option of course is to do the gain reduction in hardware for sound cards that have a hardware volume control. Because that's basically all that your implementation of volume normalization does. Turn the volume up and down. The replay spec mentions that as an option. Doing it in hardware at least doesn't throw bits away.

@JasonLG1979 commented on GitHub (Sep 22, 2020): Another option of course is to do the gain reduction in hardware for sound cards that have a hardware volume control. Because that's basically all that your implementation of volume normalization does. Turn the volume up and down. The replay spec mentions that as an option. Doing it in hardware at least doesn't throw bits away.

kerem commented

2026-02-27 19:30:05 +03:00

Author

Owner

@sashahilton00 commented on GitHub (Sep 24, 2020):

@JasonLG1979 iirc the audio Is processed in 16 bits, since the Spotify files are 44,100, 16bit. If you want to examine the processing logic and potentially change it to 24/32 then that could potentially be worth having. My concern is that it would probably want to have a usage flag as I imagine that 32 bit processing will put a strain on some of the more memory constrained devices that librespot supports.

Hardware based normalisation would be good to have, ideally we would just offload to the hardware where possible, otherwise fallback to a software implementation

@sashahilton00 commented on GitHub (Sep 24, 2020): @JasonLG1979 iirc the audio Is processed in 16 bits, since the Spotify files are 44,100, 16bit. If you want to examine the processing logic and potentially change it to 24/32 then that could potentially be worth having. My concern is that it would probably want to have a usage flag as I imagine that 32 bit processing will put a strain on some of the more memory constrained devices that librespot supports. Hardware based normalisation would be good to have, ideally we would just offload to the hardware where possible, otherwise fallback to a software implementation

kerem commented

2026-02-27 19:30:05 +03:00

Author

Owner

@JasonLG1979 commented on GitHub (Sep 25, 2020):

iirc the audio Is processed in 16 bits, since the Spotify files are 44,100, 16bit.

That's not how lossy audio like vorbis works. The source file may have been 16 bit but in the process of converting it it was transformed into the frequency domain sorta like converting PCM to PWM. Lossy audio does not have a bit depth. The bit depth of the resulting PCM is decided by the decoder. I would think the decoder does it's work internally in at least 32 bit float if it's worth a crap anyway.

My concern is that it would probably want to have a usage flag as I imagine that 32 bit processing will put a strain on some of the more memory constrained devices that librespot supports.

It would use more memory but no more CPU really. All you're doing is bit shifting. If the decoder won't output anything but 16 bit basically you just pad the bottom 8 or 16 bits with zeros and then do your gain adjustment just like before, Except now you're not throwing away bits.

Hardware based normalisation would be good to have, ideally we would just offload to the hardware where possible, otherwise fallback to a software implementation

That would also have to imply fixed or softvol volume, as in librespot is the only thing that should be turning the hardware volume up or down.

@JasonLG1979 commented on GitHub (Sep 25, 2020): > iirc the audio Is processed in 16 bits, since the Spotify files are 44,100, 16bit. That's not how lossy audio like vorbis works. The source file may have been 16 bit but in the process of converting it it was transformed into the frequency domain sorta like converting PCM to PWM. Lossy audio does not have a bit depth. The bit depth of the resulting PCM is decided by the decoder. I would think the decoder does it's work internally in at least 32 bit float if it's worth a crap anyway. > My concern is that it would probably want to have a usage flag as I imagine that 32 bit processing will put a strain on some of the more memory constrained devices that librespot supports. It would use more memory but no more CPU really. All you're doing is bit shifting. If the decoder won't output anything but 16 bit basically you just pad the bottom 8 or 16 bits with zeros and then do your gain adjustment just like before, Except now you're not throwing away bits. > Hardware based normalisation would be good to have, ideally we would just offload to the hardware where possible, otherwise fallback to a software implementation That would also have to imply fixed or softvol volume, as in librespot is the only thing that should be turning the hardware volume up or down.

kerem commented

2026-02-27 19:30:05 +03:00

Author

Owner

@JasonLG1979 commented on GitHub (Sep 26, 2020):

Only outputting 16 bit also affects the quality of librespot's software volume implementation. The same thing happens when you turn the volume down in 16 bit mode. You're throwing away bits. It would be nice to have "lossless" software volume control also.

Turning S16_LE to S32_LE would be trivial I would think since i32 is a native rust data type. It should give you more than enough space to lower the volume to below the physical noise floor of a device before you have to start throwing away bits even with gain adjustment. S24_LE and S24_3LE might be a little tricky though.

@JasonLG1979 commented on GitHub (Sep 26, 2020): Only outputting 16 bit also affects the quality of librespot's software volume implementation. The same thing happens when you turn the volume down in 16 bit mode. You're throwing away bits. It would be nice to have "lossless" software volume control also. Turning S16_LE to S32_LE would be trivial I would think since i32 is a native rust data type. It should give you more than enough space to lower the volume to below the physical noise floor of a device before you have to start throwing away bits even with gain adjustment. S24_LE and S24_3LE might be a little tricky though.

kerem commented

2026-02-27 19:30:05 +03:00

Author

Owner

@roderickvd commented on GitHub (Feb 21, 2021):

I'm also curious if the audio processing is currently done in 16 bit? I ask because I notice that the output of librespot is 16 bit. Lowering the gain of 16 bit audio by several dB with ReplayGain throws away bits. It would be advantageous audio quality wise to do audio processing in 24 or 32 bit mode and if the sound card will accept it just give it to them or if not truncate it to 16 bit.

Interested in this point, I dug around the source code.

It seems that gain normalization is applied in 32 bit, then converted to 16 bit output:
github.com/librespot-org/librespot@7f705ed148/playback/src/player.rs (L1098)

Same for the software volume control:
github.com/librespot-org/librespot@7f705ed148/playback/src/mixer/softmixer.rs (L42)

For the ALSA sink this is even done in 64 bit:
github.com/librespot-org/librespot@7f705ed148/playback/src/mixer/alsamixer.rs (L168)

So while this does not answer your feature request, at least the volume controls seem to be in HQ order!

@roderickvd commented on GitHub (Feb 21, 2021): > I'm also curious if the audio processing is currently done in 16 bit? I ask because I notice that the output of librespot is 16 bit. Lowering the gain of 16 bit audio by several dB with ReplayGain throws away bits. It would be advantageous audio quality wise to do audio processing in 24 or 32 bit mode and if the sound card will accept it just give it to them or if not truncate it to 16 bit. Interested in this point, I dug around the source code. It seems that gain normalization is applied in 32 bit, then converted to 16 bit output: https://github.com/librespot-org/librespot/blob/7f705ed148e0f858fb1bb802b68083e0781e6f04/playback/src/player.rs#L1098 Same for the software volume control: https://github.com/librespot-org/librespot/blob/7f705ed148e0f858fb1bb802b68083e0781e6f04/playback/src/mixer/softmixer.rs#L42 For the ALSA sink this is even done in 64 bit: https://github.com/librespot-org/librespot/blob/7f705ed148e0f858fb1bb802b68083e0781e6f04/playback/src/mixer/alsamixer.rs#L168 So while this does not answer your feature request, at least the volume controls seem to be in HQ order!

kerem commented

2026-02-27 19:30:05 +03:00

Author

Owner

@JasonLG1979 commented on GitHub (Feb 21, 2021):

So while this does not answer your feature request, at least the volume controls seem to be in HQ order!

No. The audio is spit out as 16bit 44.1 by the decoder and then processed. Converting a 16bit int into 32bit float then doing some math on it and then converting it back to a 16bit int is in no way HQ and you gain nothing. You're still throwing away bits, Best case you're wasting time converting back and forth, worst case you're introducing rounding errors/distortion converting an int to a float and then back to an int.

The solution is to do the gain normalization in 24 or 32bit (or 64bit or whatever the decoder natively works in) during the decoding process and just leave it 24 or 32bit. That way you can still fit the whole 16bits inside the 24/32bits with room for gain normalization without throwing away bits.

@JasonLG1979 commented on GitHub (Feb 21, 2021): > So while this does not answer your feature request, at least the volume controls seem to be in HQ order! No. The audio is spit out as 16bit 44.1 by the decoder and then processed. Converting a 16bit int into 32bit float then doing some math on it and then converting it back to a 16bit int is in no way HQ and you gain nothing. You're still throwing away bits, Best case you're wasting time converting back and forth, worst case you're introducing rounding errors/distortion converting an int to a float and then back to an int. The solution is to do the gain normalization in 24 or 32bit (or 64bit or whatever the decoder natively works in) during the decoding process and just leave it 24 or 32bit. That way you can still fit the whole 16bits inside the 24/32bits with room for gain normalization without throwing away bits.

kerem commented

2026-02-27 19:30:05 +03:00

Author

Owner

@roderickvd commented on GitHub (Feb 21, 2021):

You are absolutely right. Blame on me for missing that glaring point at such a late hour. Output should remain at high bit depth after processing, not casted back to 16 bit.

@roderickvd commented on GitHub (Feb 21, 2021): You are absolutely right. Blame on me for missing that glaring point at such a late hour. Output should remain at high bit depth after processing, not casted back to 16 bit.

kerem commented

2026-02-27 19:30:05 +03:00

Author

Owner

@sashahilton00 commented on GitHub (Feb 22, 2021):

Feel free to create a PR if you want to/have time to. I'd be curious to see if the difference is noticeable or if this ends up as more of a case of 'doing it properly'.

@sashahilton00 commented on GitHub (Feb 22, 2021): Feel free to create a PR if you want to/have time to. I'd be curious to see if the difference is noticeable or if this ends up as more of a case of 'doing it properly'.

kerem commented

2026-02-27 19:30:05 +03:00

Author

Owner

@JasonLG1979 commented on GitHub (Feb 22, 2021):

Feel free to create a PR if you want to/have time to.

It might be a while. I'd need to learn rust.

I'd be curious to see if the difference is noticeable or if this ends up as more of a case of 'doing it properly'.

The difference would certainly be measurable I would think, but with all things audio, depending on the person and/or audio gear it may or may not be preservable? IMHO It never hurts to do things right though.

@JasonLG1979 commented on GitHub (Feb 22, 2021): > Feel free to create a PR if you want to/have time to. It might be a while. I'd need to learn rust. > I'd be curious to see if the difference is noticeable or if this ends up as more of a case of 'doing it properly'. The difference would certainly be measurable I would think, but with all things audio, depending on the person and/or audio gear it may or may not be preservable? IMHO It never hurts to do things right though.

kerem commented

2026-02-27 19:30:05 +03:00

Author

Owner

@roderickvd commented on GitHub (Feb 22, 2021):

Why 24 bit resolution matters for volume control and normalization is described here: http://archimago.blogspot.com/2019/02/musings-why-bother-with-24-bit-dacs.html

Dialing the volume down to -25 dB in 16 bit decreases dynamic range from 98,9 dBA (CD quality) to 73,7 dBA (3,7 dB higher than vinyl). In comparison, doing the same in 24 bit pretty much maintains CD quality at 96,6 dBA. This is within the 120 dB dynamic range of human hearing and so practically observable.

I am enthusiastic about investing my time in this for the ALSA and Rodio backends. It would mean:

moving all internal audio operations to i32
adding a command-line option for 16 (default), 24 or 32 bit output
casting i32 audio to the configured output bit depth

For 24 bit output, the following looks promising: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=3d233fedc8ed595a1e88e815d23cd009

Is this something of interest?

@roderickvd commented on GitHub (Feb 22, 2021): Why 24 bit resolution matters for volume control and normalization is described here: http://archimago.blogspot.com/2019/02/musings-why-bother-with-24-bit-dacs.html Dialing the volume down to -25 dB in 16 bit decreases dynamic range from 98,9 dBA (CD quality) to 73,7 dBA (3,7 dB higher than vinyl). In comparison, doing the same in 24 bit pretty much maintains CD quality at 96,6 dBA. This is within the 120 dB dynamic range of human hearing and so practically observable. I am enthusiastic about investing my time in this for the ALSA and Rodio backends. It would mean: - moving all internal audio operations to i32 - adding a command-line option for 16 (default), 24 or 32 bit output - casting i32 audio to the configured output bit depth For 24 bit output, the following looks promising: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=3d233fedc8ed595a1e88e815d23cd009 Is this something of interest?

kerem commented

2026-02-27 19:30:05 +03:00

Author

Owner

@ashthespy commented on GitHub (Feb 22, 2021):

No. The audio is spit out as 16bit 44.1 by the decoder and then processed. Converting a 16bit int into 32bit float then doing some math on it and then converting it back to a 16bit int is in no way HQ and you gain nothing. You're still throwing away bits, Best case you're wasting time converting back and forth, worst case you're introducing rounding errors/distortion converting an int to a float and then back to an int.

I was under the impression that the ogg stream from Spotify was encoded in 16bit 44.1 to begin with? Or do I misunderstand?
I am not well versed with these things -- but now that we use a new version of lewton it lets you prescribe what format you want to read the samples out as. So you could use read_dec_packet_generic instead of read_dec_packet_itl in
github.com/librespot-org/librespot@ed20f357dc/audio/src/lewton_decoder.rs (L32-L34)

@ashthespy commented on GitHub (Feb 22, 2021): > No. The audio is spit out as 16bit 44.1 by the decoder and then processed. Converting a 16bit int into 32bit float then doing some math on it and then converting it back to a 16bit int is in no way HQ and you gain nothing. You're still throwing away bits, Best case you're wasting time converting back and forth, worst case you're introducing rounding errors/distortion converting an int to a float and then back to an int. I was under the impression that the ogg stream from Spotify was encoded in 16bit 44.1 to begin with? Or do I misunderstand? I am not well versed with these things -- but now that we use a new version of `lewton` it lets you prescribe what format you want to read the samples out as. So you could use [`read_dec_packet_generic`](https://docs.rs/lewton/0.10.2/lewton/inside_ogg/struct.OggStreamReader.html#method.read_dec_packet_generic) instead of [`read_dec_packet_itl`](https://docs.rs/lewton/0.10.2/lewton/inside_ogg/struct.OggStreamReader.html#method.read_dec_packet_itl) in https://github.com/librespot-org/librespot/blob/ed20f357dc64afd62b8e2464e9984d9db31b12e0/audio/src/lewton_decoder.rs#L32-L34

kerem commented

2026-02-27 19:30:05 +03:00

Author

Owner

@roderickvd commented on GitHub (Feb 22, 2021):

That's true, it's encoded at 16 bit 44,1 kHz so that gives a dynamic range of 96,3 dB at 0 dBFS. Now if you go under 0 dBFS (such as when attenuating volume or applying negative replay gain) you are adjusting the magnitude of the encoded wave. For every 6 dB attenuation you lose 1 bit.

Intuitively: at one point in the signal is encoded at 65535 (maximum amplitude). This is encoded as 1111 1111 1111 1111. Now you halve the volume. The signal should then be 32767 (half amplitude). This is encoded as 0111 1111 1111 1111. You have just lost one bit of information to reconstruct the same signal.

This can be circumvented by taking the 16 bit Ogg Vorbis stream, padding it with 8 or 16 zeros to 24 or 32 bit, then do volume control and normalization on it and keep it at that bit depth. You now have 48 respectively another 96 dB of headroom to do volume control in without losing dynamic range.

Staying with the example, 1111 1111 1111 1111 padded to 32 bit is 1111 1111 1111 1111 0000 0000 0000 0000. Halving the volume makes it 0111 1111 1111 1111 1000 0000 0000 0000. No more information lost.

(This does not really concern the title of this issue, should we open a new one?)

@roderickvd commented on GitHub (Feb 22, 2021): That's true, it's encoded at 16 bit 44,1 kHz so that gives a dynamic range of 96,3 dB at 0 dBFS. Now if you go under 0 dBFS (such as when attenuating volume or applying negative replay gain) you are adjusting the magnitude of the encoded wave. For every 6 dB attenuation you lose 1 bit. Intuitively: at one point in the signal is encoded at 65535 (maximum amplitude). This is encoded as 1111 1111 1111 1111. Now you halve the volume. The signal should then be 32767 (half amplitude). This is encoded as 0111 1111 1111 1111. You have just lost one bit of information to reconstruct the same signal. This can be circumvented by taking the 16 bit Ogg Vorbis stream, padding it with 8 or 16 zeros to 24 or 32 bit, then do volume control and normalization on it and keep it at that bit depth. You now have 48 respectively another 96 dB of headroom to do volume control in without losing dynamic range. Staying with the example, 1111 1111 1111 1111 padded to 32 bit is 1111 1111 1111 1111 0000 0000 0000 0000. Halving the volume makes it 0111 1111 1111 1111 1000 0000 0000 0000. No more information lost. (This does not really concern the title of this issue, should we open a new one?)

Rows
Columns

[GH-ISSUE #524] Feature Request: More Spotify like volume normalization #334