shell: keep trailing newlines ('n') in command substitution
I want to be able to capture the exact output of a command substitution, including the trailing new line characters.
I realise that they are stripped by default, so some manipulation may be required to keep them, and I want to keep the original exit code.
For example, given a command with a variable number of trailing newlines and exit code:
f(){ for i in $(seq "$((RANDOM % 3))"); do echo; done; return $((RANDOM % 256));}
export -f f
I want to run something like:
exact_output f
And have the output be:
Output: $'nn'
Exit: 5
I’m interested in both bash
and POSIX sh
.
You can output a character after the normal output and then strip it:
#capture the output of "$@" (arguments run as a command)
#into the exact_output` variable
exact_output()
{
exact_output=$( "$@" && printf X ) &&
exact_output=${exact_output%X}
}
This is a POSIX compliant solution.
For the new question, this script works:
#!/bin/bash
f() { for i in $(seq "$((RANDOM % 3 ))"); do
echo;
done; return $((RANDOM % 256));
}
exact_output(){ out=$( $1; ret=$?; echo x; exit "$ret" );
unset OldLC_ALL ; [ "${LC_ALL+set}" ] && OldLC_ALL=$LC_ALL
LC_ALL=C ; out=${out%x};
unset LC_ALL ; [ "${OldLC_ALL+set}" ] && LC_ALL=$OldLC_ALL
printf 'Output:%10qnExit :%2sn' "${out}" "$?"
}
exact_output f
echo Done
On execution:
Output:$'nnn'
Exit :25
Done
The longer description
The usual wisdom for POSIX shells to deal with the removal of n
is:
add an
x
s=$(printf "%s" "${1}x"); s=${s%?}
That is required because the last new line(S) are removed by the command expansion per POSIX specification:
removing sequences of one or more characters at the end of the substitution.
About a trailing x
.
It has been said in this question that an x
could be confused with the trailing byte of some character in some encoding. But how are we going to guess what or which character is better in some language in some possible encoding, that is a difficult proposition, to say the least.
However; That is simply incorrect.
The only rule that we need to follow is to add exactly what we remove.
It should be easy to understand that if we add something to an existing string (or byte sequence) and later we remove exactly the same something, the original string (or byte sequence) must be the same.
Where do we go wrong? When we mix characters and bytes.
If we add a byte, we must remove a byte, if we add a character we must remove the exact same character.
The second option, adding a character (and later removing the exact same character) may become convoluted and complex, and, yes, code pages and encodings may get in the way.
However, the first option is quite possible, and, after explaining it, it will become plain simple.
Lets add a byte, an ASCII byte (<127), and to keep things as less convoluted as possible, let’s say an ASCII character in the range of a-z. Or as we should be saying it, a byte in the hex range 0x61
– 0x7a
. Lets choose any of those, maybe an x (really a byte of value 0x78
). We can add such byte with by concatenating an x to an string (lets assume an é
):
$ a=é
$ b=${a}x
If we look at the string as a sequence of bytes, we see:
$ printf '%s' "$b" | od -vAn -tx1c
c3 a9 78
303 251 x
An string sequence that ends in an x.
If we remove that x (byte value 0x78
), we get:
$ printf '%s' "${b%x}" | od -vAn -tx1c
c3 a9
303 251
It works without a problem.
A little more difficult example.
Lets say that the string we are interested in ends in byte 0xc3
:
$ a=$'x61x20x74x65x73x74x20x73x74x72x69x6ex67x20xc3'
And lets add a byte of value 0xa9
$ b=$a$'xa9'
The string has become this now:
$ echo "$b"
a test string é
Exactly what I wanted, the last two bytes are one character in utf8 (so anyone could reproduce this results in their utf8 console).
If we remove a character, the original string will be changed. But that is not what we added, we added a byte value, which happens to be written as an x, but a byte anyway.
What we need to avoid misinterpreting bytes as characters. What we need is an action that removes the byte we used 0xa9
. In fact, ash, bash, lksh and mksh all seem to do exactly that:
$ c=$'xa9'
$ echo ${b%$c} | od -vAn -tx1c
61 20 74 65 73 74 20 73 74 72 69 6e 67 20 c3 0a
a t e s t s t r i n g 303 n
But not ksh or zsh.
However, that is very easy to solve, lets tell all those shells to do byte removal:
$ LC_ALL=C; echo ${b%$c} | od -vAn -tx1c
that’s it, all shells tested work (except yash) (for the last part of the string):
ash : s t r i n g 303 n
dash : s t r i n g 303 n
zsh/sh : s t r i n g 303 n
b203sh : s t r i n g 303 n
b204sh : s t r i n g 303 n
b205sh : s t r i n g 303 n
b30sh : s t r i n g 303 n
b32sh : s t r i n g 303 n
b41sh : s t r i n g 303 n
b42sh : s t r i n g 303 n
b43sh : s t r i n g 303 n
b44sh : s t r i n g 303 n
lksh : s t r i n g 303 n
mksh : s t r i n g 303 n
ksh93 : s t r i n g 303 n
attsh : s t r i n g 303 n
zsh/ksh : s t r i n g 303 n
zsh : s t r i n g 303 n
Just that simple, tell the shell to remove a LC_ALL=C character,which is exactly one byte for all byte values from 0x00
to 0xff
.
Beware that some shells don’t support changing the locale during runtime (despite this is required by POSIX).
Solution that should generally work without changing the locale
While the above should work with any (except newline or null) byte as sentinel value, it can be made easier, without changing the locale:
Using .
or /
should be generally fine, as POSIX requires:
- “The encoded values associated with
<period>
,<slash>
,<newline>
, and<carriage-return>
shall be invariant across all locales supported by the implementation.”, which means that these will have the same binary represenation in any locale/encoding. - “Likewise, the byte values used to encode
<period>
,<slash>
,<newline>
, and<carriage-return>
shall not occur as part of any other character in any locale.”, which means that the above cannot happen, as no partial byte sequence could be completed by these bytes/characters to a valid character in any locale/encoding.
(see 6.1 Portable Character Set)
The above does not apply to other characters of the Portable Character Set.
Solution for comments:
For the example discussed in the comments, one possible solution (which fails in zsh) is:
#!/bin/bash
LC_ALL=zh_HK.big5hkscs
a=$(printf '210170');
b=$(printf '170');
unset OldLC_ALL ; [ "${LC_ALL+set}" ] && OldLC_ALL=$LC_ALL
LC_ALL=C ; a=${a%"$b"};
unset LC_ALL ; [ "${OldLC_ALL+set}" ] && LC_ALL=$OldLC_ALL
printf '%s' "$a" | od -vAn -c
That will remove the problem of encoding.
POSIX shells
The usual (1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
) trick to get the complete stdout of a command is to do:
output=$(cmd; ret=$?; echo .; exit "$ret")
ret=$?
output=${output%.}
The idea is to add an extra .n
. Command substitution will only strip that n
. And you strip the .
with ${output%.}
.
Note that in shells other than zsh
, that will still not work if the output has NUL bytes. With yash
, that won’t work if the output is not text.
Also note that in some locales, it matters what character you use to insert at the end. .
should generally be fine (see below), but some other might not. For instance x
(as used in some other answers) or @
would not work in a locale using the BIG5, GB18030 or BIG5HKSCS charsets. In those charsets, the encoding of a number of characters ends in the same byte as the encoding of x
or @
(0x78, 0x40)
For instance, ū
in BIG5HKSCS is 0x88 0x78 (and x
is 0x78 like in ASCII, all charsets on a system must have the same encoding for all the characters of the portable character set which includes English letters, @
and .
). So if cmd
was printf 'x88'
(which by itself is not a valid character in that encoding, but just a byte-sequence) and we inserted x
after it, ${output%x}
would fail to strip that x
as $output
would actually contain ū
(the two bytes making up a byte sequence that is a valid character in that encoding).
Using .
or /
should be generally fine, as POSIX requires:
- “The encoded values associated with
<period>
,<slash>
,<newline>
, and<carriage-return>
shall be invariant across all locales supported by the implementation.”, which means that these will have the same binary represenation in any locale/encoding. - “Likewise, the byte values used to encode
<period>
,<slash>
,<newline>
, and<carriage-return>
shall not occur as part of any other character in any locale.”, which means that the above cannot happen, as no partial byte sequence could be completed by these bytes/characters to a valid character in any locale/encoding.
(see 6.1 Portable Character Set)
The above does not apply to other characters of the Portable Character Set.
Another approach, as discussed by @Isaac, would be to change the locale to C
(which would also guarantee that any single byte can be correctly stripped), only for the stripping of the last character (${output%.}
).
It would be typically necessary to use LC_ALL
for that (in principle LC_CTYPE
would be enough, but that could be accidentally overridden by any already set LC_ALL
). Also it would be necessary to restore the original value (or e.g. the non-POSIX compliant locale
be used in a function). But beware, that some shells don’t support changing the locale while running (though this is required by POSIX).
By using .
or /
, all that can be avoided.
bash/zsh alternatives
With bash
and zsh
, assuming the output has no NULs, you can also do:
IFS= read -rd '' output < <(cmd)
To get the exit status of cmd
, you can do wait "$!"; ret=$?
in bash
but not in zsh
.
rc/es/akanaga
For completeness, note that rc
/es
/akanga
have an operator for that. In them, command substitution, expressed as `cmd
(or `{cmd}
for more complex commands) returns a list (by splitting on $ifs
, space-tab-newline by default). In those shells (as opposed to Bourne-like shells), the stripping of newline is only done as part of that $ifs
splitting. So you can either empty $ifs
or use the ``(seps){cmd}
form where you specify the separators:
ifs = ''; output = `cmd
or:
output = ``()cmd
In any case, the exit status of the command is lost. You’d need to embed it in the output and extract it afterwards which would become ugly.
fish
In fish, command substitution is with (cmd)
and doesn’t involve a subshell.
set var (cmd)
Creates a $var
array with all the lines in the output of cmd
if $IFS
is non-empty, or with the output of cmd
stripped of up to one (as opposed to all in most other shells) newline character if $IFS
is empty.
So there’s still an issue in that (printf 'anb')
and (printf 'anbn')
expand to the same thing even with an empty $IFS
.
To work around that, the best I could come up with was:
function exact_output
set -l IFS . # non-empty IFS
set -l ret
set -l lines (
cmd
set ret $status
echo
)
set -g output ''
set -l line
test (count $lines) -le 1; or for line in $lines[1..-2]
set output $output$linen
end
set output $output$lines[-1]
return $ret
end
Since version 3.4.0 (released in March 2022), you can do instead:
set output (cmd | string collect --allow-empty --no-trim-newlines)
With older versions, you could do:
read -z output < (begin; cmd; set ret $status; end | psub)
With the caveat that $output
is an empty list instead of a list with one empty element if there’s no output.
Version 3.4.0 also added support for $(...)
which behaves like (...)
except that it can also be used inside double quotes in which case it behaves like in the POSIX shell: the output is not split on lines but all trailing newline characters are removed.
Bourne shell
The Bourne shell did not support the $(...)
form nor the ${var%pattern}
operator, so it can be quite hard to achieve there. One approach is to use eval and quoting:
eval "
output='`
exec 4>&1
ret=`
exec 3>&1 >&4 4>&-
(cmd 3>&-; echo "$?" >&3; printf "'") |
awk 3>&- -v RS=\\' -v ORS= -v b='\\\\' '
NR > 1 {print RS b RS RS}; {print}; END {print RS}'
`
echo ";ret=$ret"
`"
Here, we’re generating a
output='output of cmd
with the single quotes escaped as '''
';ret=X
to be passed to eval
. As for the POSIX approach, if '
was one of those characters whose encoding can be found at the end of other characters, we’d have a problem (a much worse one as it would become a command injection vulnerability), but thankfully, like .
, it’s not one of those, and that quoting technique is generally the one that is used by anything that quotes shell code (note that has the issue, so shouldn’t be used (also excludes
"..."
inside which you need to use backslashes for some characters). Here, we’re only using it after a '
which is OK).
tcsh
See tcsh preserve newlines in command substitution `…`
(not taking care of the exit status, which you could address by saving it in a temporary file (echo $status > $tempfile:q
after the command))
Here’s a bash function that encapsulates the LC_ALL=C technique described by @Isaac.
# This function provides a general solution to the problem of preserving
# trailing newlines in a command substitution.
#
# cmdsub <command goes here>
#
# If the command succeeded, the result will be found in variable CMDSUB_RESULT.
cmdsub() {
local -r BYTE=$'x78'
local result
if result=$("$@"; ret=$?; echo "$BYTE"; exit "$ret"); then
local LC_ALL=C
CMDSUB_RESULT=${result%"$BYTE"}
else
return "$?"
fi
}
Notes:
$'x78'
was chosen for the dummy byte in order to test the corner case discussed in this Q&A discussion, but any byte could have been used except newline (0x0A
) and NUL (0x00
).- Encapsulating it within a function had the added benefit that we could make LC_ALL a local variable, thus avoiding the need to save and restore its value.
- I considered using bash 4.3’s nameref feature to allow the caller to supply the name of the variable into which the result should be stored, but decided it would be better to support older bash.
- In principle setting,
LC_CTYPE
should be enough, however if “externally”LC_ALL
were already set, that would override the former.
Successfully tested the BIG5HKSCS corner case using bash 4.1:
#!/bin/bash
LC_ALL=zh_HK.big5hkscs
cmdsub() {
local -r BYTE=$'x78'
local result
if result=$("$@"; ret=$?; echo "$BYTE"; exit "$ret"); then
local LC_ALL=C
CMDSUB_RESULT=${result%"$BYTE"}
else
return "$?"
fi
}
cmd() { echo -n $'x88'; }
if cmdsub cmd; then
v=$CMDSUB_RESULT
printf '%s' "$v" | od -An -tx1
else
printf "The command substitution had a non-zero status code of %sn" "$?"
fi
Result was 88
as expected.