Can awk be told to count the character string length rather than byte string length for '%10s' printf formats?
Try this for an output of |Ü| X|
:
echo 'Ü X' | awk '{printf("|% 2s|% 2s|n", $1, $2)}'
Obviously awk
counts the byte length, not the character length of the Ü
, so the count is 2 and no left padding with space is needed, as is for the X
.
Is it possible to run awk
in a mode which counts character lengths for the %<count>s
printf
pattern, not byte length?
The same question exists for bash
‘s printf
. I hope the answer is not the same: "passthrough to libc printf" :-/
I was not using gawk
, but whatever version Ubuntu 22.04 (Jammy Jellyfish) had installed for me. It did not occur to me that anything but gawk
could be installed these days :-/
GNU awk (and possibly some other awk variants):
$ echo 'Ü X' | LC_ALL='en_US.UTF-8' awk '{printf "|% 2s|% 2s|n", $1, $2}'
| Ü| X|
Bash 3.0+ (and possibly some other shells, possibly with tweaks):
$ LC_ALL='en_US.UTF-8' a='Ü' b='X'
$ printf '|%*s%s|%*s%s|n' "$(( 2 - ${#a} ))" '' "$a" "$(( 2 - ${#b} ))" '' "$b"
| Ü| X|
Note that the bash version has to set LC_ALL
in the shell that is executing ${#a}
, not just in printf
s environment as is happening with the awk
version, and so if you don’t want LC_ALL
to change in the calling shell you need to save/restore it, i.e. o="$LC_ALL"; LC_ALL='en_US.UTF-8' ... "$b"; LC_ALL="$o"
, or do everything in a subshell, i.e. ( LC_ALL='en_US.UTF-8' ... "$b" )
.
Explanations:
From the GNU awk documentation:
-b --characters-as-bytes
Cause gawk to treat all input data as single-byte characters. In addition, all output written with print or printf is treated as
single-byte characters.Normally, gawk follows the POSIX standard and attempts to process its
input data according to the current locale (see Where You Are Makes a
Difference). This can often involve converting multibyte characters
into wide characters (internally), and can lead to problems or
confusion if the input data does not contain valid multibyte
characters. This option is an easy way to tell gawk, “Hands off my
data!”
Using GNU awk 5.2.2 setting an appropriate locale will treat multi-byte characters as single multi-byte characters:
$ echo 'Ü X' | LC_ALL='en_US.UTF-8' awk '{printf "|% 2s|% 2s|n", $1, $2}'
| Ü| X|
whereas using a different locale, or using -b
, will treat all input as single-byte characters:
$ echo 'Ü X' | LC_ALL='C' awk '{printf "|% 2s|% 2s|n", $1, $2}'
|Ü| X|
$ echo 'Ü X' | awk -b '{printf "|% 2s|% 2s|n", $1, $2}'
|Ü| X|
When -b
is used the result is independent of your locale:
$ echo 'Ü X' | LC_ALL='en_US.UTF-8' awk -b '{printf "|% 2s|% 2s|n", $1, $2}'
|Ü| X|
$ echo 'Ü X' | LC_ALL='C' awk -b '{printf "|% 2s|% 2s|n", $1, $2}'
|Ü| X|
As @StéphaneChazelas mentioned in a comment, see Why is printf "shrinking" umlaut? for the related behavior of printf
in shell where @Léa Gris’s answer suggests this will get the character counts, and so the formatted output, correct in bash 3.0 and later:
$ a='Ü' b='X' LC_ALL='en_US.UTF-8'
$ printf '|%*s%s|%*s%s|n' "$(( 2 - ${#a} ))" '' "$a" "$(( 2 - ${#b} ))" '' "$b"
| Ü| X|
and that functionality is also affected by locale:
$ LC_ALL='C'
$ printf "|%*s%s|%*s%s|n" "$(( 2 - ${#a} ))" '' "$a" "$(( 2 - ${#b} ))" '' "$b"
|Ü| X|
See also length-of-string-in-bash for more information on getting the length of characters in bash.