Can awk be told to count the character string length rather than byte string length for '%10s' printf formats?

Try this for an output of |Ü| X|:

echo 'Ü X' | awk '{printf("|% 2s|% 2s|n", $1, $2)}'

Obviously awk counts the byte length, not the character length of the Ü, so the count is 2 and no left padding with space is needed, as is for the X.

Is it possible to run awk in a mode which counts character lengths for the %<count>s printf pattern, not byte length?

The same question exists for bash‘s printf. I hope the answer is not the same: "passthrough to libc printf" :-/

I was not using gawk, but whatever version Ubuntu 22.04 (Jammy Jellyfish) had installed for me. It did not occur to me that anything but gawk could be installed these days :-/

Asked By: Harald

||

GNU awk (and possibly some other awk variants):

$ echo 'Ü X' | LC_ALL='en_US.UTF-8' awk '{printf "|% 2s|% 2s|n", $1, $2}'
| Ü| X|

Bash 3.0+ (and possibly some other shells, possibly with tweaks):

$ LC_ALL='en_US.UTF-8' a='Ü' b='X'
$ printf '|%*s%s|%*s%s|n' "$(( 2 - ${#a} ))" '' "$a" "$(( 2 - ${#b} ))" '' "$b"
| Ü| X|

Note that the bash version has to set LC_ALL in the shell that is executing ${#a}, not just in printfs environment as is happening with the awk version, and so if you don’t want LC_ALL to change in the calling shell you need to save/restore it, i.e. o="$LC_ALL"; LC_ALL='en_US.UTF-8' ... "$b"; LC_ALL="$o", or do everything in a subshell, i.e. ( LC_ALL='en_US.UTF-8' ... "$b" ).

Explanations:

From the GNU awk documentation:

-b
--characters-as-bytes

Cause gawk to treat all input data as single-byte characters. In addition, all output written with print or printf is treated as
single-byte characters.

Normally, gawk follows the POSIX standard and attempts to process its
input data according to the current locale (see Where You Are Makes a
Difference
). This can often involve converting multibyte characters
into wide characters (internally), and can lead to problems or
confusion if the input data does not contain valid multibyte
characters. This option is an easy way to tell gawk, “Hands off my
data!”

Using GNU awk 5.2.2 setting an appropriate locale will treat multi-byte characters as single multi-byte characters:

$ echo 'Ü X' | LC_ALL='en_US.UTF-8' awk '{printf "|% 2s|% 2s|n", $1, $2}'
| Ü| X|

whereas using a different locale, or using -b, will treat all input as single-byte characters:

$ echo 'Ü X' | LC_ALL='C' awk '{printf "|% 2s|% 2s|n", $1, $2}'
|Ü| X|

$ echo 'Ü X' | awk -b '{printf "|% 2s|% 2s|n", $1, $2}'
|Ü| X|

When -b is used the result is independent of your locale:

$ echo 'Ü X' | LC_ALL='en_US.UTF-8' awk -b '{printf "|% 2s|% 2s|n", $1, $2}'
|Ü| X|

$ echo 'Ü X' | LC_ALL='C' awk -b '{printf "|% 2s|% 2s|n", $1, $2}'
|Ü| X|

As @StéphaneChazelas mentioned in a comment, see Why is printf "shrinking" umlaut? for the related behavior of printf in shell where @Léa Gris’s answer suggests this will get the character counts, and so the formatted output, correct in bash 3.0 and later:

$ a='Ü' b='X' LC_ALL='en_US.UTF-8' 
$ printf '|%*s%s|%*s%s|n' "$(( 2 - ${#a} ))" '' "$a" "$(( 2 - ${#b} ))" '' "$b"
| Ü| X|

and that functionality is also affected by locale:

$ LC_ALL='C'
$ printf "|%*s%s|%*s%s|n" "$(( 2 - ${#a} ))" '' "$a" "$(( 2 - ${#b} ))" '' "$b"
|Ü| X|

See also length-of-string-in-bash for more information on getting the length of characters in bash.

Answered By: Ed Morton
Categories: Answers Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.