How to modify a gzipped file with sed and then zip again the file?

I have a .vcf.gz file, with the following aspect:

#CHROM  POS     ID      REF     ALT          
chr1    10894   chr1:10894:G:A  G       A         
chr1    10915   chr1:10915:G:A  G       A          
chr1    10930   chr1:10930:G:A  G       A 

I want to modify the CHROM column to remove ‘chr’ and to replace it with nothing, so I want to get a result like the following:

#CHROM  POS     ID      REF     ALT          
1    10894   chr1:10894:G:A  G       A         
1    10915   chr1:10915:G:A  G       A          
1    10930   chr1:10930:G:A  G       A 

Therefore, I wrote the following command line:

zcat input.vcf.gz | sed 's/^chr//' > output.vcf.gz

and it worked. The problem is that I want to save the output file as a zipped one, with the vcf.gz extension. Even if I wrote ‘output.vcf.gz’, the output file is not zipped.

How can I modify a zipped file and then save it as a zipped file again?

Many thanks!

Asked By: Khaleesi95


Simply add gzip in the pipe:

zcat input.vcf.gz | sed 's/^chr//' | gzip > output.vcf.gz
Answered By: treuss

zcat is really but a convenience function of gzip; to cite the gzip/gunzip/zcat manual page (man zcat):

The zcat command is identical to gunzip -c.

Just as you can use gunzip -c (or zcat) in a piped chain of programs, you can use gzip to compress again:

zcat input.vcf.gz | sed 's/^chr//' | gzip > output.vcf.gz
#                                    ^^^^


gunzip -c input.vcf.gz | sed 's/^chr//' | gzip > output.vcf.gz
#^^^^^^^^                                 ^^^^

if you like consistency.

That’s it. That’s all there is to it.

Oh, taking a bet here: you’re doing bioinformatics, and your vcf file is actually a "Variant Call Format" file, and can be rather large. gzip isn’t a very fast decompressor, and a rather slow compressor. If you’re stuck using the gzip compression file format,

unpigz -c input.vcf.gz | sed 's/^chr//' | pigz > output.vcf.gz
#^^^^^^^^                                 ^^^^

pigz does exactly the same as gzip, but scales to multiple CPU cores. Use it.

If you’re not bound to keep these files in a gzip container, but are free to choose a more modern format,

unpigz -c input.vcf.gz | sed 's/^chr//' | zstd   -T0   -8 > output.vcf.zst
# decompress using     |                | ^^^^   ^^^   ^^
# unpigz instead of    |     modify     |  --   -   \  compression ratio
# gzip/zcat            |                |                -0=very fast 18=very compressed
#                      |                |                 -8 is much better compressed
#                      |                |                 than gzip --best, but faster
#                      |                |              
#                      |                |               - Use as many threads as CPU cores
#                      |                |           
#                      |                |            
#                      |                |             ---- Use zstd instead of gzip
Answered By: Marcus Müller
Categories: Answers Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.