Does gz compression ratio improve over time?

I have some process that creates a stream of millions of highly similar lines. I’m piping this to gz. Does the compression ratio improve over time in such a setup? I.e. is the compression ratio better for 1 million similar lines, than say 10,000?

Asked By: Gx1sptDTDa

||

It does up to a certain point and this evens out. The compression algorithms have a restriction on the size of the blocks they look at (bzip2) and/or on the tables they keep with information on previous patterns (gzip).

In the case of gzip, once a table is full old entries get pushed out, and compression no further improves. Depending on the your compression quality factor (-0 to -9) and the repetitiveness of your input this filling up can of course can take a while and you might not notice.

Answered By: Anthon

Not much. The “distance” covered by the DEFLATE algorithm which gzip uses is limited to 32 KB.

Wikipedia link -> DEFLATE

It is worth benchmarking against the various gzip compression levels and also considering bzip2 and xz.

Answered By: steve

Here is an overview of gzip’s algorithm.

Short answer is that it will not improve significantly after the initial data needed for the hashes is taken into account.

Answered By: Pieter
Categories: Answers Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.