Why not use backticks with for loop

Some time ago, I posted answer to some question about scripting. Someone pointed out that I shouldn’t use following command:

for x in $(cat file); do something; done 

but instead of that:

while read f; do something; done < file

Usless Use of Cat article suppose to explain the whole problem, but the only explanation is:

The backticks are outright dangerous, unless you know the result of
the backticks is going to be less than or equal to how long a command
line your shell can accept. (Actually, this is a kernel limitation.
The constant ARG_MAX in your limits.h should tell you how much your
own system can take. POSIX requires ARG_MAX to be at least 4,096
bytes.)

If I correctly understood this, bash(?) should crash if I use output of very big file in command (it should exceed ARG_MAX define in limits.h file). So I checked ARG_MAX with command:

> grep ARG_MAX /usr/src/kernels/$(uname -r)/include/uapi/linux/limits.h
#define ARG_MAX       131072    /* # bytes of args + environ for exec() */

Then I created file containing text with no spaces:

> ls -l
-rw-r--r--. 1 root root 100000000 Aug 21 15:37 in_file

Then I run:

for i in $(cat in_file); do echo $i; done

aaaand nothing terrible happened.

So what should I do to check if/how this whole ‘don’t use cat with loop’ thing is dangerous?

Asked By: mrc02_kr

||

@chepner explained the difference in comments:

for i in $(cat in_file) doesn’t iterate over the lines of the file, it iterates over the words resulting from the contents of the file being subjected to word-splitting and pathname expansion.

For the impact in performance and resource usage I did a small benchmark for both cases using input with 1M lines (about 19M) and measuring time and memory usage with /usr/bin/time -v:

test1.sh:

#!/bin/bash
while read x
do
    echo $x > /dev/null
done < input

Results:

Command being timed: "./test1.sh"
User time (seconds): 12.41
System time (seconds): 2.03
Percent of CPU this job got: 110%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:13.07
Maximum resident set size (kbytes): 3088

test2.sh:

#!/bin/bash
for x in $(cat input)
do
    echo $x > /dev/null
done

Results:

Command being timed: "./test2.sh"
User time (seconds): 17.19
System time (seconds): 3.13
Percent of CPU this job got: 109%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.51
Maximum resident set size (kbytes): 336356

I’ve uploaded full output of both tests to pastebin. With bash using for i in $(cat ...) utilizes significantly more memory and also runs slower. However results might vary depending if you would run these same tests on some other shell.

Answered By: sebasth

while loops can be problematical, most notably in that they eat standard input by default (hence ssh -n) so if you need standard input for something else a while loop will fail

$ find . -name "*.pm" | while read f; do aspell check $f; done
$ 

does nothing as aspell wants a terminal which instead is occupied by a list of perl module names; a for loop is more suitable (assuming that the filenames won’t be split by POSIX word splitting rules):

$ for f in $(find . -name *.pm); do aspell check $f; done
...

as that does not use standard input like while does by default.

Also, while is prone to silent data loss (and for behaves differently for that same input):

$ echo -n mmm silent data loss | while read line; do echo $line; done
$ for i in $(echo -n mmm silent data loss); do echo $i; done
mmm
silent
data
loss
$ 

So arguments can be made that while is dangerous and should not be used, depending on the context.

Answered By: thrig

It depends what file is meant to contain. If it’s meant to contain a IFS-separated list of shell globs like (assuming the default value of $IFS):

/var/log/*.log /var/adm/*~
/some/dir/*.txt

Then for i in $(cat file) would be the way to go. As that’s what that unquoted $(cat file) does: apply the split+glob operator on the output of cat file stripped of its trailing newline characters. So it would loop over each filename resulting of the expansions of those globs (except in the cases where the globs don’t match any file where that would leave the glob there but unexpanded).

If you wanted to loop over each delimited line of file, you’d do:

while IFS= read -r line <&3; do
{
  something with "$line"
} 3<&-
done 3< file

With a for loop, you could loop over every non-empty line with:

IFS='
' # split on newline only (actually sequences of newlines and
  # ignoring leading and trailing ones as newline is a
  # IFS whitespace character)
set -o noglob # disable the glob part of the split+glob operator:
for line in $(cat file); do
   something with "$line"
done

However a:

while read line; do
  something with "$line"
done < file

Makes little sense. That’s reading the content of file in a very convoluted way where characters of $IFS and backslashes are treated specially.

In any case, the ARG_MAX limit the text you’re quoting refers to is on the execve() system call (on the cumulative size of the arguments and environment variables), so only applies to cases where a command on the filesystem is being executed with the possibly very long expansion of the split+glob operator applied to the command substitution (that text is misleading and wrong on several accounts).

It would apply for instance in:

cat -- $(cat file) # with shell implementations where cat is not builtin

But not in:

for i in $(cat file)

where there’s no execve() system call involved.

Compare:

bash-4.4$ echo '/*/*/*/*' > file
bash-4.4$ true $(cat file)
bash-4.4$ n=0; for f in $(cat file); do ((n++)); done; echo "$n"
523696
bash-4.4$ /bin/true $(cat file)
bash: /bin/true: Argument list too long

It’s OK with bash‘s true builtin command or the for loop, but not when executing /bin/true. Note how the file is just 9 bytes large but the expansion of $(cat file) is several megabytes because the /*/*/*/* glob is being expanded by the shell.

More reading at:

Answered By: Stéphane Chazelas
Categories: Answers Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.