Why are tar archive formats switching to xz compression to replace bzip2 and what about gzip?

More and more tar archives use the xz format based on LZMA2 for compression instead of the traditional bzip2(bz2) compression. In fact kernel.org made a late “Good-bye bzip2announcement, 27th Dec. 2013, indicating kernel sources would from this point on be released in both tar.gz and tar.xz format – and on the main page of the website what’s directly offered is in tar.xz.

Are there any specific reasons explaining why this is happening and what is the relevance of gzip in this context?

Asked By: user44370

||

First of all, this question is not directly related to tar. Tar just creates an uncompressed archive, the compression is then applied later on.

Gzip is known to be relatively fast when compared to LZMA2 and bzip2. If speed matters, gzip (especially the multithreaded implementation pigz) is often a good compromise between compression speed and compression ratio. Although there are alternatives if speed is an issue (e.g. LZ4).

However, if a high compression ratio is desired LZMA2 beats bzip2 in almost every aspect. The compression speed is often slower, but it decompresses much faster and provides a much better compression ratio at the cost of higher memory usage.

There is not much reason to use bzip2 any more, except of backwards compatibility. Furthermore, LZMA2 was desiged with multithreading in mind and many implementations by default make use of multicore CPUs (unfortunately xz on Linux does not do this, yet). This makes sense since the clock speeds won’t increase any more but the number of cores will.

There are multithreaded bzip2 implementations (e.g. pbzip), but they are often not installed by default. Also note that multithreaded bzip2 only really pay off while compressing whereas decompression uses a single thread if the file was compress using a single threaded bzip2, in contrast to LZMA2. Parallel bzip2 variants can only leverage multicore CPUs if the file was compressed using a parallel bzip2 version, which is often not the case.

Answered By: Marco

For distributing archives over the Internet, the following things are generally a priority:

  1. Compression ratio (i.e., how small the compressor makes the data);
  2. Decompression time (CPU requirements);
  3. Decompression memory requirements; and
  4. Compatibility (how wide-spread the decompression program is)

Compression memory & CPU requirements aren’t very important, because you can use a large fast machine for that, and you only have to do it once.

Compared to bzip2, xz has a better compression ratio and lower (better) decompression time. It, however—at the compression settings typically used—requires more memory to decompress[1] and is somewhat less widespread. Gzip uses less memory than either.

So, both gzip and xz format archives are posted, allowing you to pick:

  • Need to decompress on a machine with very limited memory (<32 MB): gzip. Given, not very likely when talking about kernel sources.
  • Need to decompress minimal tools available: gzip
  • Want to save download time and/or bandwidth: xz

There isn’t really a realistic combination of factors that’d get you to pick bzip2. So its being phased out.

I looked at compression comparisons in a blog post. I didn’t attempt to replicate the results, and I suspect some of it has changed (mostly, I expect xz has improved, as its the newest.)

(There are some specific scenarios where a good bzip2 implementation may be preferable to xz: bzip2 can compresses a file with lots of zeros and genome DNA sequences better than xz. Newer versions of xz now have an (optional) block mode which allows data recovery after the point of corruption and parallel compression and [in theory] decompression. Previously, only bzip2 offered these.[2] However none of these are relevant for kernel distribution)


1: In archive size, xz -3 is around bzip -9. Then xz uses less memory to decompress. But xz -9 (as, e.g., used for Linux kernel tarballs) uses much more than bzip -9. (And even xz -0 needs more than gzip -9).

2: F21 System Wide Change: lbzip2 as default bzip2 implementation

Answered By: derobert

Short answer: xz is more efficient in terms of compression ratio. 
So it saves disk space and optimizes the transfer through the network.

You can see this Quick Benchmark
so as to discover the difference by practical tests.

Update: The “Quick Benchmark” web page is an elusive moving target.

Answered By: Slyx

LZMA2 is a block compression system whereas gzip is not. This means that LZMA2 lends itself to multi-threading. Also, if corruption occurs in an archive, you can generally recover data from subsequent blocks with LZMA2 but you cannot do this with gzip. In practice, you lose the entire archive with gzip subsequent to the corrupted block. With an LZMA2 archive, you only lose the file(s) affected by the corrupted block(s). This can be important in larger archives with multiple files.

Answered By: Mark Warburton
Categories: Answers Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.