Why are tar archive formats switching to xz compression to replace bzip2 and what about gzip?
More and more
tar archives use the
xz format based on LZMA2 for compression instead of the traditional
bzip2(bz2) compression. In fact kernel.org made a late “Good-bye bzip2” announcement, 27th Dec. 2013, indicating kernel sources would from this point on be released in both tar.gz and tar.xz format – and on the main page of the website what’s directly offered is in
Are there any specific reasons explaining why this is happening and what is the relevance of
gzip in this context?
First of all, this question is not directly related to
tar. Tar just creates an uncompressed archive, the compression is then applied later on.
Gzip is known to be relatively fast when compared to LZMA2 and bzip2. If speed matters,
gzip (especially the multithreaded implementation
pigz) is often a good compromise between compression speed and compression ratio. Although there are alternatives if speed is an issue (e.g. LZ4).
However, if a high compression ratio is desired LZMA2 beats
bzip2 in almost every aspect. The compression speed is often slower, but it decompresses much faster and provides a much better compression ratio at the cost of higher memory usage.
There is not much reason to use
bzip2 any more, except of backwards compatibility. Furthermore, LZMA2 was desiged with multithreading in mind and many implementations by default make use of multicore CPUs (unfortunately
xz on Linux does not do this, yet). This makes sense since the clock speeds won’t increase any more but the number of cores will.
There are multithreaded
bzip2 implementations (e.g.
pbzip), but they are often not installed by default. Also note that multithreaded
bzip2 only really pay off while compressing whereas decompression uses a single thread if the file was compress using a single threaded
bzip2, in contrast to LZMA2. Parallel
bzip2 variants can only leverage multicore CPUs if the file was compressed using a parallel
bzip2 version, which is often not the case.
For distributing archives over the Internet, the following things are generally a priority:
- Compression ratio (i.e., how small the compressor makes the data);
- Decompression time (CPU requirements);
- Decompression memory requirements; and
- Compatibility (how wide-spread the decompression program is)
Compression memory & CPU requirements aren’t very important, because you can use a large fast machine for that, and you only have to do it once.
Compared to bzip2, xz has a better compression ratio and lower (better) decompression time. It, however—at the compression settings typically used—requires more memory to decompress and is somewhat less widespread. Gzip uses less memory than either.
So, both gzip and xz format archives are posted, allowing you to pick:
- Need to decompress on a machine with very limited memory (<32 MB): gzip. Given, not very likely when talking about kernel sources.
- Need to decompress minimal tools available: gzip
- Want to save download time and/or bandwidth: xz
There isn’t really a realistic combination of factors that’d get you to pick bzip2. So its being phased out.
I looked at compression comparisons in a blog post. I didn’t attempt to replicate the results, and I suspect some of it has changed (mostly, I expect
xz has improved, as its the newest.)
(There are some specific scenarios where a good bzip2 implementation may be preferable to xz: bzip2 can compresses a file with lots of zeros and genome DNA sequences better than xz. Newer versions of xz now have an (optional) block mode which allows data recovery after the point of corruption and parallel compression and [in theory] decompression. Previously, only bzip2 offered these. However none of these are relevant for kernel distribution)
1: In archive size,
xz -3 is around
bzip -9. Then xz uses less memory to decompress. But
xz -9 (as, e.g., used for Linux kernel tarballs) uses much more than
bzip -9. (And even
xz -0 needs more than
Short answer: xz is more efficient in terms of compression ratio.
So it saves disk space and optimizes the transfer through the network.
You can see this Quick Benchmark
so as to discover the difference by practical tests.
Update: The “Quick Benchmark” web page is an elusive moving target.
LZMA2 is a block compression system whereas gzip is not. This means that LZMA2 lends itself to multi-threading. Also, if corruption occurs in an archive, you can generally recover data from subsequent blocks with LZMA2 but you cannot do this with gzip. In practice, you lose the entire archive with gzip subsequent to the corrupted block. With an LZMA2 archive, you only lose the file(s) affected by the corrupted block(s). This can be important in larger archives with multiple files.