Pipe find list of files into xargs gzip and pipe again into pigz

I need to find files newer than x days, and then turn it into a gzip, but I want to do it using pigz.

For now I’m now doing it the slow way; this works:

find /path/to/src -type f -mtime -90 | xargs tar -zcf archive.tar.gz

But pigz is tremendously faster, so I want to run this gzip using pigz instead. I tried this but it isn’t working:

find /path/to/src -type f -mtime -90 | xargs tar -zcf | pigz > archive.tar.gz

It returns an error because I just guessed what to do (and tried a couple ways):

tar (child): /path/to/src: Cannot open: Is a directory
tar (child): Error is not recoverable: exiting now

How to take the first line that works and pipe that into pigz?

Asked By: user7211

||

Assuming GNU or libarchive’s tar:

find /path/to/src -type f -mtime -90 -print0 |
  tar -cf - --no-recursion --null -T - |
  pigz > archive.tar.gz

(--no-recursion not strictly necessary here as files reported by find are meant not to be of type directory).

Don’t use xargs (which anyway can only be used on find‘s output if you use -0 and find‘s -print0) as it could end up running more than one tar so you’d end up with the archive containing only the last batch.

Here, we’re passing the list of files to tar directly via a pipe with -T - so there’s no limit on how many files may be passed that way. That also means tar can start archiving the files as soon as they’re found.

star (@schily‘s (RIP) tar) also has built-in find functionality:

star cf - -find /path/to/src -type f -mtime -90 |
  pigz > archive.tar.gz

Though, you can also take the same approach as for the other two above with this syntax:

find /path/to/src -type f -mtime -90 -print0 |
  star cf - -read0 list=- |
  pigz > archive.tar.gz

tar is a very unportable command. Even the tar formats are unportable. X/Open / SUSv2 used to specify a tar command (and cpio), but they eventually gave up on it as it was impossible to conciliate the tars from different vendors, and instead POSIX / SUS came up with pax as a replacement for both.

pax takes the list of files from stdin, but unfortunately, newline delimited instead of NUL delimited which means it can’t archive arbitrary file names, though some pax implementations support a -0 extension for that (find‘s -print0 is also not POSIX though can be replaced with -exec printf '%s' {} +). So, with those:

find /path/to/src -type f -mtime -90 -print0 |
  pax -0w |
  pigz > archive.tar.gz

(note that the default output format is undefined per POSIX which is another weakness of pax. Its worst weakness being its very low adoption).

Answered By: Stéphane Chazelas

With GNU tar on any shell that supports process substitution (e.g. bash, ksh, zsh):

tar cf archive.tar.gz -I pigz --null -T <(find /path/to/src  -type f -mtime -90 -print0)

This uses pigz to do the compression, and takes the (NUL-separated) list of files to include in the archive from the output of find ... -print0, via the -T or --files-from=FILE option and process substitution.

Alternatively, if you are using a minimalist POSIX-features-only shell (e.g. ash or dash, or bash running as /bin/sh or with --posix or set -o posix or with the POSIXLY_CORRECT environment variable set) you can pipe a NUL-separated list of filenames into GNU tar. The - following the -T option tells tar to read the file list from stdin.

find /path/to/src  -type f -mtime -90 -print0 | tar cf archive.tar.gz -I pigz --null -T -

Either of these work with any valid filename, even those containing spaces, newlines and shell metacharacters. It also avoids the problem of too-many-filenames mentioned by @Kusalananda in his comment.

BTW, you may want to investigate using pixz instead of pigz. It does xz compression (which generally does much better compression than gzip, but is slower), and pixz will add an index to speed up extraction of specific files if it detects tar-like input. BTW, both pixz and xz-utils are packaged for most common Linux distribtions so should be easy to install.

Answered By: cas
Categories: Answers Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.