RFC: Switch source and release tarballs from .xz to .zstd

Hello,

I propose switching our source and release tarballs to be compressed with zstd instead of xz. Most bigger distributions have already done this transition since zstd is faster (especially for decompression) and has similar compression rates.

I suggest that for the next 17 release (17.0.3) I will upload both xz and zstd source packages and we will continue to do that for the rest of the 17 releases. Then for 18 we switch to zstd only. This should give people time to convert their scripts and it will be non-intrusive for the current release.

Unless there are any objections I will propose the patches to export.sh and test_release.sh later this week.

4 Likes

Generally, Iā€™d be wary of such a change; in my experience, zstd isnā€™t usually quite as readily available out of the box on various systems, compared to xz. I donā€™t have hard data on this though, itā€™s just my anecdotal evidence.

On the other hand, I donā€™t use the tarballs directly myself, so I donā€™t have much stake in the matter anyway.

I am not worried about the availability, considering that Ubuntu, Debian, Arch, and Fedora distributions already use ZSTD as their default compression for packages.

Right; what about macOS - is it available out of the box there?

Thatā€™s a good question. I donā€™t know, someone with a Mac would have to chime in and how complicated that would be.

On Windows you need to download everything anyway! :slight_smile:

On macOS, at least in Ventura (13.6), neither xz nor zstd are available in the base system. So you have to install these from Homebrew or MacPorts anyway.

That said, usually xz compression is a lot better than zstd, even at its highest settings, so take into account that this will cost more bandwidth and storage. I normally use xz for archiving purposes, zstd for more ā€œrealtimeā€ compression.

I just checked as well, and came to the same conclusion.

Sure, it can be argued that one needs to install third party tools like CMake anyway. But requiring non-default tools just to extract a tarball is kind of annoying in my experience.

Anyway, I donā€™t have a stake in the matter - just wanted to add in my PoV.

I tested this and with zstd compression level 12 the sizes increase by 8% over our current xz tarballs. To me thatā€™s a good tradeoff considering the speed of zstd.

For me, testing the 17.0.1 arm64-apple tarballs with zstd -19 (the highest level without resorting to ā€˜ultraā€™ compression):

% ls -l clang+llvm-17.0.1-arm64-apple-darwin22.0.tar.*
-rw-r--r--  1 dim  staff   803708084 Sep 25 21:32:33 2023 clang+llvm-17.0.1-arm64-apple-darwin22.0.tar.xz
-rw-r--r--  1 dim  staff  1038167033 Oct  3 10:31:34 2023 clang+llvm-17.0.1-arm64-apple-darwin22.0.tar.zstd

So that is about 29% largerā€¦ Not insignificant, I would say? About 224 MiB more to download. :slight_smile:

Ah! I only tested the source packages and not the binary packages which explains the difference in compression levels.

On Windows, 7zip is commonly used as the multi-format (de)compression tool. Since that doesnā€™t support zstd, thatā€™s another reason not to switch.

So far on this thread it also sounds like this would be a regression in terms of tarball size.

If we want to do this, we should at least have some data on the impact on size and decompression speed.

7-Zip also uses LZMA compression (actually xz borrowed that from the 7-Zip author), and its compression ratios have been superior to most other tools since quite some time. However, for compression it takes a LOT more time and memory. That is the tradeoff: more pain for the people who make installers, and less download size for the other 99% of users :slight_smile:

Debian doesnā€™t use zstd for packages. It didnā€™t even support it as an option until January this year; Ubuntu had forked dpkg to add support in a non-upstreamable manner.

Thanks for the correction!

1 Like

Given the evidence in this thread, I think better compression is more important than fast decompression. I havenā€™t heard anyone complain that xz is slow for either compression or decompression, but perhaps this is a pain point for Tobias.

Yes it was something that annoyed me and I thought 8% bigger files in the case of the source files was a good trade off. Binaries being 30% bigger is another case entirely. Iā€™ll table this for now.

FWIW, XZ(1) talks a bit about memory usage:

For example, decompressing a file created with xz -9 currently requires 65 MiB of memory.

I think that for users intending to run LLVM tools after such decompression, 65 MiB of memory should never be an issue. :slight_smile:

Agreed that it seems to add additional annoyances without a clear benefit.
@tobiashieta as I assume that you are the person most impacted by this change, what are the decompression numbers for you? Have you looked into speeding up xz decompression instead? Iā€™m pretty sure that xz (at least the one bundled with most distribution) doesnā€™t support multithreaded decompression, but 7zip and pixz do. Perhaps adjusting xz workflows slightly might bring enough improvements to make the switch completely unnecessary.

:100:

Thatā€™s not great for the release managers, but I think if you multiply that by 1000s and 1000s of users, the cumulated extra storage and download footprint is a much bigger factor than (de)compression speed, which only happens once anyway.