Awk seems to be confused what $1 is

I use awk a fair bit for parsing logs; I have never seen anything like this:
I have six file containing a number of lines; I want the ones containing "100", and to choose which columns to print

me:~/tmp> grep 100 *.dl.tst

outputs what I expect:

100  139M  100  139M    0     0  6376k      0  0:00:22  0:00:22 --:--:-- 6539k
100  139M  100  139M    0     0  6677k      0  0:00:21  0:00:21 --:--:-- 6579k
100  139M  100  139M    0     0  6022k      0  0:00:23  0:00:23 --:--:-- 6093k
100  139M  100  139M    0     0  13.9M      0  0:00:10  0:00:10 --:--:-- 14.3M
100  139M  100  139M    0     0  14.3M      0  0:00:09  0:00:09 --:--:-- 14.7M
100  139M  100  139M    0     0  13.2M      0  0:00:10  0:00:10 --:--:-- 13.3M

as does:

me:~/tmp> grep 100 *.dl.tst|awk '{print$0}'
100  139M  100  139M    0     0  6376k      0  0:00:22  0:00:22 --:--:-- 6539k
100  139M  100  139M    0     0  6677k      0  0:00:21  0:00:21 --:--:-- 6579k
100  139M  100  139M    0     0  6022k      0  0:00:23  0:00:23 --:--:-- 6093k
100  139M  100  139M    0     0  13.9M      0  0:00:10  0:00:10 --:--:-- 14.3M
100  139M  100  139M    0     0  14.3M      0  0:00:09  0:00:09 --:--:-- 14.7M
100  139M  100  139M    0     0  13.2M      0  0:00:10  0:00:10 --:--:-- 13.3M

Why then does $1 become the file name:

me:~/tmp> grep 100 *.dl.tst|awk '{print$1}'
shpr002.20201124_141036.dl.tst:
shpr003.20201124_141036.dl.tst:
shpr004.20201124_141036.dl.tst:
hipr002.20201124_141036.dl.tst:
hipr003.20201124_141036.dl.tst:
hipr004.20201124_141036.dl.tst:

And $2:

me:~/tmp> grep 100 *.dl.tst|awk '{print$2}'
0
0
0
0
0
0

I logged out and back in in case my shell (bash) was screwed up; no change… what am I doing wrong?

Output from grep 100 *.dl.tst | awk '{print$1}' | head -n1 | od -c
(some of the alpha characters have been substituted by x; the list above had been edited/obfuscated)

0000000   x  s   h   p   r   0   0   2   x   x   x  .   x   x   x   .
0000020   x   x   x   x   .   c   o   m   .   2   0   2   0   -   1   1
0000040   -   2   4   _   1   4   1   0   3   6   .   d   l   .   t   s
0000060   t   :  r  n
0000064
Asked By: Iain

||

Those files contain the output from curl downloading files, and curl updates its progress information during downloads by outputting a carriage return (commonly represented as r, the escape used to produce it in a number of contexts), which causes the cursor to return to the start of the line.

When you run grep 100 *.dl.tst, each line that’s output starts with the file name, but that’s followed by multiple updates which return the cursor to the start of the line, so you don’t see the file name — it’s overwritten by subsequent output. In more detail, the output looks like

shpr002.20201124_141036.dl.tst:

followed by a carriage return, followed by the first progress output from curl,

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

followed by a carriage return, etc., until the percentage reaches 100. Because all this is only separated by carriage returns, not line feeds, it counts as a single line, and grep matches that in its entirety.

The same effect explains the output of grep 100 *.dl.tst|awk '{print$0}'.

When you ask AWK to output $1, it outputs the first field, and now you can see it: it contains the file name, a colon, a carriage return, and that’s it — the start of curl’s output then starts with a space (to leave room for the percentage count), which is a field separator. When you ask it to output $2, it outputs the second field, which is the first percentage count, 0:

shpr002.20201124_141036.dl.tst:r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

<--          Field 1          -->  !     !    !     !  ...
                                   $2    $3   $4    $5 ...
Answered By: Stephen Kitt

Working on Stephen’s description of the issue, a simple way to make the output easier to process would be just translate all the carriage returns into newlines, leaving you with curl’s progress report as a bunch of individual lines, which you can then use awk on:

$ for f in *.dl; do < "$f" tr 'r' 'n' | awk '$1 == "100" {print $0}' ; done
100  720k  100  720k    0     0  22.5M      0 --:--:-- --:--:-- --:--:-- 22.7M
100 23.6M  100 23.6M    0     0   372M      0 --:--:-- --:--:-- --:--:--  369M

(Though, if curl rounds the percentage it prints to nearest integer instead of down, huge files might show multiple lines with 100 in the first column.)

On the other hand, if it’s known that the files contain nothing but the output from curl, then we might as well just pick the last line instead of looking at the contents:

$ for f in *.dl; do < "$f" tr 'r' 'n' | tail -n1  ; done
100  720k  100  720k    0     0  22.5M      0 --:--:-- --:--:-- --:--:-- 22.7M
100 23.6M  100 23.6M    0     0   372M      0 --:--:-- --:--:-- --:--:--  369M
Answered By: ilkkachu
Categories: Answers Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.