How to modify a gzipped file with sed and then zip again the file?
I have a .vcf.gz file, with the following aspect:
#CHROM POS ID REF ALT
chr1 10894 chr1:10894:G:A G A
chr1 10915 chr1:10915:G:A G A
chr1 10930 chr1:10930:G:A G A
I want to modify the CHROM column to remove ‘chr’ and to replace it with nothing, so I want to get a result like the following:
#CHROM POS ID REF ALT
1 10894 chr1:10894:G:A G A
1 10915 chr1:10915:G:A G A
1 10930 chr1:10930:G:A G A
Therefore, I wrote the following command line:
zcat input.vcf.gz | sed 's/^chr//' > output.vcf.gz
and it worked. The problem is that I want to save the output file as a zipped one, with the vcf.gz extension. Even if I wrote ‘output.vcf.gz’, the output file is not zipped.
How can I modify a zipped file and then save it as a zipped file again?
Many thanks!
Simply add gzip in the pipe:
zcat input.vcf.gz | sed 's/^chr//' | gzip > output.vcf.gz
zcat
is really but a convenience function of gzip
; to cite the gzip
/gunzip
/zcat
manual page (man zcat
):
The
zcat
command is identical togunzip -c
.
Just as you can use gunzip -c
(or zcat
) in a piped chain of programs, you can use gzip
to compress again:
zcat input.vcf.gz | sed 's/^chr//' | gzip > output.vcf.gz
# ^^^^
or
gunzip -c input.vcf.gz | sed 's/^chr//' | gzip > output.vcf.gz
#^^^^^^^^ ^^^^
if you like consistency.
That’s it. That’s all there is to it.
Oh, taking a bet here: you’re doing bioinformatics, and your vcf file is actually a "Variant Call Format" file, and can be rather large. gzip
isn’t a very fast decompressor, and a rather slow compressor. If you’re stuck using the gzip compression file format,
unpigz -c input.vcf.gz | sed 's/^chr//' | pigz > output.vcf.gz
#^^^^^^^^ ^^^^
pigz
does exactly the same as gzip
, but scales to multiple CPU cores. Use it.
If you’re not bound to keep these files in a gzip container, but are free to choose a more modern format,
unpigz -c input.vcf.gz | sed 's/^chr//' | zstd -T0 -8 > output.vcf.zst
# decompress using | | ^^^^ ^^^ ^^
# unpigz instead of | modify | -- - \ compression ratio
# gzip/zcat | | -0=very fast 18=very compressed
# | | -8 is much better compressed
# | | than gzip --best, but faster
# | |
# | | - Use as many threads as CPU cores
# | |
# | |
# | | ---- Use zstd instead of gzip