How to get both the number of bytes and the sha1sum with single pass?

I want to get both the number of bytes and the sha1sum of a command’s output.

In principle, one can always do something like:

BYTES="$( somecommand | wc -c )"
DIGEST="$( somecommand | sha1sum | sed 's/ .*//' )"

…but, for the use-case I am interested in, somecommand is rather time-consuming, and produces a ton of output, so I’d prefer to call it only once.

One way that comes to mind would be something like

evil() {
  {
    somecommand | 
      tee >( wc -c | sed 's/^/BYTES=/' ) | 
      sha1sum | 
      sed 's/ .*//; s/^/DIGEST=/'
  } 2>&1
}

eval "$( evil )"

…which seems to work, but makes me die a little inside.

I wonder if there is a better (more robust, more general) way to capture the output of different segments of a pipeline into separate variables.

EDIT: The problem I am working on at the moment is in bash, so I am mostly interested in solutions for this shell, but I do a lot of zsh programming also, so I have some interest in those solutions as well.

EDIT2: I tried to port Stéphane Chazelas’ solution to bash, but it didn’t quite work:

#!/bin/bash

cmd() {
    printf -- '%1000s'
}

bytes_and_checksum() {
    local IFS
    cmd | tee >(sha1sum > $1) | wc -c | read bytes || return
    read checksum rest_ignored < $1 || return
}

set -o pipefail
unset bytes checksum
bytes_and_checksum "$(mktemp)"
printf -- 'bytes=%sn' $bytes
printf -- 'checksum=%sn' $checksum

When I run the script above, the output I get is

bytes=
checksum=96d89030c1473585f16ec7a52050b410e44dd332

The value of checksum is correct. I can’t figure out why the value of bytes is not set.

EDIT3: OK, thanks to @muru’s tip, I fixed the problem:

#!/bin/bash

cmd() {
    printf -- '%1000s'
}

bytes_and_checksum() {
    local IFS
    read bytes < <( cmd | tee >(sha1sum > $1) | wc -c ) || return
    read checksum rest_ignored < $1 || return
}

set -o pipefail
unset bytes checksum
bytes_and_checksum "$(mktemp)"
printf -- 'bytes=%sn' $bytes
printf -- 'checksum=%sn' $checksum

Now:

bytes=1000
checksum=96d89030c1473585f16ec7a52050b410e44dd332

UNFORTUNATELY

…my bytes_and_checksum function stalls (deadlock?) when cmd produces a lot more output than was the case in my toy example above.

Back to the drawing board…

Asked By: kjo

||

Would be easier to use temp files. In zsh:

(){set -o localoptions -o pipefail; local IFS
  {cmd} > >(sha1sum > $1) | wc -c | read bytes || return
  read checksum rest_ignored < $1 || return
} =()

Beware many wc implementations include whitespace around the number they output. read with the default value of $IFS strips them.

Note that the exit status of sha1sum is lost.

There =() creates the temp file with the non-output of nothing at all. That temp file is removed automatically when the command it’s given to (here an anonymous function) returns.

In cmd > file | other-cmd, cmd‘s output is teed internally by zsh since it’s redirected twice, so here both to sha1sum and to wc. We wrap cmd in {...} to make sure zsh waits for the process-redirections to finish.

Here as the output of both sha1sum and wc is guaranteed to be no larger than a few bytes, they could also be sent to pipes, and you would not have to read from those pipes concurrently (which zsh can do as it has an interface to select()/poll() but not bash). That could be done sequentially without causing deadlocks, so it’s an easy version of tee into different variables.

On Linux-based systems (where /dev/fd/x when x is a fd to a pipe behaves like a named pipe):

{
  IFS=$' t' read bytes < <(cmd 3<&- | tee >(sha1sum > /dev/fd/3) | wc -c)
  IFS=$' t' read sum rest <&3
} 3< <(:)

(would even work in bash).

For details about the deadlocks you’d run into with larger outputs, see also tee + cat: use an output several times and then concatenate results.

Answered By: Stéphane Chazelas

I am using a backup bash script which has the following helper "in-between" functions which take a "supposed filename" as an argument (see tar.gz example below):

function pipesum
{
  tee >(sha1sum | awk --assign F="${1##*/}" '$2=F' > "${1?}.sha1")
}
function pipelen
{
  tee >(wc -c > "${1?}.len")
}
function pipesumlen
{
  tee >(sha1sum | awk --assign F="${1##*/}" '$2=F' > "${1?}.sha1") >(wc -c > "${1?}.len")
}
function pipechecksum
{
  tee >(sha1sum --quiet -c <(awk '$2="-"' "${1?}") >&2)
}

Example:

$ echo 123 | pipesumlen filename
123
$ ls filename*
filename.len  filename.sha1
$ cat filename*
4
a8fdc205a9f19cc1c7507a60c4f01b13d11d7fd0 filename
$ echo 123 | pipechecksum filename.sha1
123
$ echo 1234 | pipechecksum filename.sha1
1234
-: FAILED
sha1sum: WARNING: 1 computed checksum did NOT match

I am using it in a script which is very time, CPU and IO consuming, something like that:

tar | 
  pipesumlen mybackup.tar | 
  gzip > mybackup.tar.gz
<mybackup.tar.gz gunzip | 
  pipechecksum mybackup.tar.sha1 | 
  xz > mybackup.tar.xz

Thus I have my backup checked against random memory/disk bit flips. It creates the "mybackup.tar.sha1" file as if the "mybackup.tar" was actually created and checksummed while in truth in this example the uncompressed data is never written on disk.

Caveat: the pipechecksum does not terminates the script on error even with set -euo pipefail. The alternative pipechecksum which does return nonzero on checksum mismatch:

function pipechecksum
{
  { tee /dev/fd/$N | sha1sum --quiet -c <(awk '$2="-"' "${1?}") >&2; } {N}>&1
}

It seems fine, but I’ve came with it just today and cannot consider it proven.

Answered By: legolegs
Categories: Answers Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.