[GH-ISSUE #164] Use modern compression algorithm #131

New issue

Open

opened 2026-02-26 21:34:33 +03:00 by kerem · 1 comment

kerem commented

2026-02-26 21:34:33 +03:00

Owner

Originally created by @jinnatar on GitHub (Dec 26, 2025).
Original GitHub issue: https://github.com/eduardolat/pgbackweb/issues/164

In short: Please consider switching to using xz archives which uses the lzma compression algorithm. By my tests this can reduce total backup size by 30%.

At current dumps are stored as zip archives. The old and venerable zip primarily uses the deflate compression algorithm from 1990. Zips in theory may optionally support lzma but most tooling for it does not, including Info-ZIP (which most Linux distros use) which was last updated in 2008. There is support for bzip2 in Info-ZIP and while it's better than deflate, it's only slightly better.

The solution I'm proposing is to switch to a better container with better algorithm support. By my quick tests xz is the winner as it uses the lzma algorithm with a robust and modern container format. For comparison I've also included the legacy lzma container format below¹. There's potentially further gains to be had by increasing the compression factor from the default 6 up to 7..9 but that increases the memory requirements. A 7 might be a good compromise.

Sample files, first is dump as stored by pgbackweb, then the same deflated. The rest are different ways of compressing the same dump:

3.0M dump.sql
319k dump.sql.gz
317k dump-20251226-020000-6cbade9f-0ef5-45cf-9778-9b6aa4dc7d0a.zip
312k dump.sql.zip
307k dump.sql.bz2
277k dump.sql.zst
218k dump.sql.xz
218k dump.sql.lzma

The file types of the same tests:

dump-20251226-020000-6cbade9f-0ef5-45cf-9778-9b6aa4dc7d0a.zip: Zip archive data, at least v2.0 to extract, compression method=deflate
dump.sql: ASCII text
dump.sql.bz2: bzip2 compressed data, block size = 900k
dump.sql.gz: gzip compressed data, was "dump.sql", last modified: Sun Dec 30 22:00:00 1979, from Unix, original size modulo 2^32 3004661
dump.sql.lzma: LZMA compressed data, streamed
dump.sql.xz: XZ compressed data, checksum CRC64
dump.sql.zip: Zip archive data, at least v4.6 to extract, compression method=bzip2
dump.sql.zst: Zstandard compressed data (v0.8+), Dictionary ID: None

While technically legacy lzma is the smallest. it's purely by having a smaller header than the modern xz by a couple of bytes. xz is superior in all other ways. ↩︎

Originally created by @jinnatar on GitHub (Dec 26, 2025). Original GitHub issue: https://github.com/eduardolat/pgbackweb/issues/164 In short: Please consider switching to using `xz` archives which uses the `lzma` compression algorithm. By my tests this can reduce total backup size by 30%. At current dumps are stored as zip archives. The old and venerable zip primarily uses the `deflate` compression algorithm from 1990. Zips in theory may optionally support `lzma` but most tooling for it does not, including Info-ZIP (which most Linux distros use) which was last updated in 2008. There is support for `bzip2` in Info-ZIP and while it's better than `deflate`, it's only slightly better. The solution I'm proposing is to switch to a better container with better algorithm support. By my quick tests `xz` is the winner as it uses the `lzma` algorithm with a robust and modern container format. For comparison I've also included the legacy lzma container format below[^1]. There's potentially further gains to be had by increasing the compression factor from the default 6 up to 7..9 but that increases the memory requirements. A 7 might be a good compromise. Sample files, first is dump as stored by pgbackweb, then the same deflated. The rest are different ways of compressing the same dump: > 3.0M dump.sql 319k dump.sql.gz 317k dump-20251226-020000-6cbade9f-0ef5-45cf-9778-9b6aa4dc7d0a.zip 312k dump.sql.zip 307k dump.sql.bz2 277k dump.sql.zst 218k dump.sql.xz 218k dump.sql.lzma The file types of the same tests: > dump-20251226-020000-6cbade9f-0ef5-45cf-9778-9b6aa4dc7d0a.zip: Zip archive data, at least v2.0 to extract, compression method=deflate dump.sql: ASCII text dump.sql.bz2: bzip2 compressed data, block size = 900k dump.sql.gz: gzip compressed data, was "dump.sql", last modified: Sun Dec 30 22:00:00 1979, from Unix, original size modulo 2^32 3004661 dump.sql.lzma: LZMA compressed data, streamed dump.sql.xz: XZ compressed data, checksum CRC64 dump.sql.zip: Zip archive data, at least v4.6 to extract, compression method=bzip2 dump.sql.zst: Zstandard compressed data (v0.8+), Dictionary ID: None [^1]: While technically legacy lzma is the smallest. it's purely by having a smaller header than the modern xz by a couple of bytes. xz is superior in all other ways.

kerem commented

2026-02-26 21:34:34 +03:00

Author

Owner

@jinnatar commented on GitHub (Dec 29, 2025):

Tests with a much larger database reveal some of the considerations. Performing a single-threaded level 6 compression on a 2.8G dump took almost 19 minutes. Doing the same with 42 threads with hyperthreaded Xeon cores instead takes only 53 seconds. I'd imagine a setting for thread count would be required since it will be unique to every admin how many cores they can spare and how that aligns with the cron schedule.

Size comparison:

2.8G dump.sql
769M dump-20251226-010000-3658b071-6a06-4e65-a88f-e8e73844ba62.zip
552M dump.sql.xz

@jinnatar commented on GitHub (Dec 29, 2025): Tests with a much larger database reveal some of the considerations. Performing a single-threaded level 6 compression on a 2.8G dump took almost 19 minutes. Doing the same with 42 threads with hyperthreaded Xeon cores instead takes only 53 seconds. I'd imagine a setting for thread count would be required since it will be unique to every admin how many cores they can spare and how that aligns with the cron schedule. Size comparison: > 2.8G dump.sql 769M dump-20251226-010000-3658b071-6a06-4e65-a88f-e8e73844ba62.zip 552M dump.sql.xz