Why *not* parse `ls` (and what to do instead)?

I consistently see answers quoting this link stating definitively “Don’t parse ls!” This bothers me for a couple of reasons:

  1. It seems the information in that link has been accepted wholesale with little question, though I can pick out at least a few errors in casual reading.

  2. It also seems as if the problems stated in that link have sparked no desire to find a solution.

From the first paragraph:

…when you ask [ls] for a list
of files, there’s a huge problem: Unix allows almost any character in
a filename, including whitespace, newlines, commas, pipe symbols, and
pretty much anything else you’d ever try to use as a delimiter except
NUL. … ls separates filenames with newlines. This is fine
until you have a file with a newline in its name. And since I don’t
know of any implementation of ls that allows you to terminate
filenames with NUL characters instead of newlines, this leaves us
unable to get a list of filenames safely with ls.

Bummer, right? How ever can we handle a newline terminated listed dataset for data that might contain newlines? Well, if the people answering questions on this website didn’t do this kind of thing on a daily basis, I might think we were in some trouble.

The truth is though, most ls implementations actually provide a very simple api for parsing their output and we’ve all been doing it all along without even realizing it. Not only can you end a filename with null, you can begin one with null as well or with any other arbitrary string you might desire. What’s more, you can assign these arbitrary strings per file-type. Please consider:

LS_COLORS='lc=:rc=:ec=:fi=:di=:' ls -l --color=always | cat -A
total 4$
drwxr-xr-x 1 mikeserv mikeserv 0 Jul 10 01:05 ^@^@^@^@dir^@^@^@/$
-rw-r--r-- 1 mikeserv mikeserv 4 Jul 10 02:18 ^@file1^@^@^@$
-rw-r--r-- 1 mikeserv mikeserv 0 Jul 10 01:08 ^@file2^@^@^@$
-rw-r--r-- 1 mikeserv mikeserv 0 Jul 10 02:27 ^@new$
line$
file^@^@^@$
^@

See this for more.

Now it’s the next part of this article that really gets me though:

$ ls -l
total 8
-rw-r-----  1 lhunath  lhunath  19 Mar 27 10:47 a
-rw-r-----  1 lhunath  lhunath   0 Mar 27 10:47 a?newline
-rw-r-----  1 lhunath  lhunath   0 Mar 27 10:47 a space

The problem is that from the output of ls, neither you or the
computer can tell what parts of it constitute a filename. Is it each
word? No. Is it each line? No. There is no correct answer to this
question other than: you can’t tell.

Also notice how ls sometimes garbles your filename data (in our
case, it turned the n character in between the words “a” and
“newline” into a ?question mark

If you just want to iterate over all the files in the current
directory, use a for loop and a glob:

for f in *; do
    [[ -e $f ]] || continue
    ...
done

The author calls it garbling filenames when ls returns a list of filenames containing shell globs and then recommends using a shell glob to retrieve a file list!

Consider the following:

printf 'touch ./"%b"n' "filenname" "f i l e n a m e" |
    . /dev/stdin
ls -1q

f i l e n a m e  
file?name

IFS="
" ; printf "'%s'n" $(ls -1q)

'f i l e n a m e'
'file
name'

POSIX defines the -1 and -q ls operands so:

-q – Force each instance of non-printable filename characters and <tab>s to be written as the question-mark ( '?' ) character. Implementations
may provide this option by default if the output is to a terminal
device.

-1(The numeric digit one.) Force output to be one entry per line.

Globbing is not without its own problems – the ? matches any character so multiple matching ? results in a list will match the same file multiple times. That’s easily handled.

Though how to do this thing is not the point – it doesn’t take much to do after all and is demonstrated below – I was interested in why not. As I consider it, the best answer to that question has been accepted. I would suggest you try to focus more often on telling people what they can do than on what they can’t. You’re a lot less likely, as I think, to be proven wrong at least.

But why even try? Admittedly, my primary motivation was that others kept telling me I couldn’t. I know very well that ls output is as regular and predictable as you could wish it so long as you know what to look for. Misinformation bothers me more than do most things.

The truth is, though, with the notable exception of both Patrick’s and Wumpus Q. Wumbley’s answers (despite the latter’s awesome handle), I regard most of the information in the answers here as mostly correct – a shell glob is both more simple to use and generally more effective when it comes to searching the current directory than is parsing ls. They are not, however, at least in my regard, reason enough to justify either propagating the misinformation quoted in the article above nor are they acceptable justification to “never parse ls.

Please note that Patrick’s answer’s inconsistent results are mostly a result of him using zsh then bash. zsh – by default – does not word-split $(command substituted) results in a portable manner. So when he asks where did the rest of the files go? the answer to that question is your shell ate them. This is why you need to set the SH_WORD_SPLIT variable when using zsh and dealing with portable shell code. I regard his failure to note this in his answer as awfully misleading.

Wumpus’s answer doesn’t compute for me – in a list context the ? character is a shell glob. I don’t know how else to say that.

In order to handle a multiple results case you need to restrict the glob’s greediness. The following will just create a test base of awful file names and display it for you:

{ printf %b $(printf \%04o `seq 0 127`) |
sed "/[^[-b]*/s///g
        s/(.)(.)/touch '?v2' '1t2' '1n2'n/g" |
. /dev/stdin

echo '`ls` ?QUOTED `-m` COMMA,SEP'
ls -qm
echo ; echo 'NOW LITERAL - COMMA,SEP'
ls -m | cat
( set -- * ; printf "nFILE COUNT: %sn" $# )
}

OUTPUT

`ls` ?QUOTED `-m` COMMA,SEP
??, ??^, ??`, ??b, [?, [?, ]?^, ]?^, _?`, _?`, a?b, a?b

NOW LITERAL - COMMA,SEP
?
 , ?
     ^, ?
         `, ?
             b, [       , [
, ]    ^, ]
^, _    `, _
`, a    b, a
b

FILE COUNT: 12

Now I’ll safe every character that isn’t a /slash, -dash, :colon, or alpha-numeric character in a shell glob then sort -u the list for unique results. This is safe because ls has already safed-away any non printable characters for us. Watch:

for f in $(
        ls -1q |
        sed 's|[^-:/[:alnum:]]|[!-\:[:alnum:]]|g' |
        sort -u | {
                echo 'PRE-GLOB:' >&2
                tee /dev/fd/2
                printf 'nPOST-GLOB:n' >&2
        }
) ; do
        printf "FILE #$((i=i+1)): '%s'n" "$f"
done

OUTPUT:

PRE-GLOB:
[!-:[:alnum:]][!-:[:alnum:]][!-:[:alnum:]]
[!-:[:alnum:]][!-:[:alnum:]]b
a[!-:[:alnum:]]b

POST-GLOB:
FILE #1: '?
           '
FILE #2: '?
           ^'
FILE #3: '?
           `'
FILE #4: '[     '
FILE #5: '[
'
FILE #6: ']     ^'
FILE #7: ']
^'
FILE #8: '_     `'
FILE #9: '_
`'
FILE #10: '?
            b'
FILE #11: 'a    b'
FILE #12: 'a
b'

Below I approach the problem again but I use a different methodology. Remember that – besides null – the / ASCII character is the only byte forbidden in a pathname. I put globs aside here and instead combine the POSIX specified -d option for ls and the also POSIX specified -exec $cmd {} + construct for find. Because find will only ever naturally emit one / in sequence, the following easily procures a recursive and reliably delimited filelist including all dentry information for every entry. Just imagine what you might do with something like this:

#v#note: to do this fully portably substitute an actual newline #v#
#v#for 'n' for the first sed invocation#v#
cd ..
find ././ -exec ls -1ldin {} + |
sed -e '| *././|{s||n.///|;i///' -e } |
sed 'N;s|(n)///|///1|;$s|$|///|;P;D'

###OUTPUT

152398 drwxr-xr-x 1 1000 1000        72 Jun 24 14:49
.///testls///

152399 -rw-r--r-- 1 1000 1000         0 Jun 24 14:49
.///testls/?
            ///

152402 -rw-r--r-- 1 1000 1000         0 Jun 24 14:49
.///testls/?
            ^///

152405 -rw-r--r-- 1 1000 1000         0 Jun 24 14:49
.///testls/?
        `///
...

ls -i can be very useful – especially when result uniqueness is in question.

ls -1iq | 
sed '/ .*/s///;s/^/-inum /;$!s/$/ -o /' | 
tr -d 'n' | 
xargs find

These are just the most portable means I can think of. With GNU ls you could do:

ls --quoting-style=WORD

And last, here’s a much simpler method of parsing ls that I happen to use quite often when in need of inode numbers:

ls -1iq | grep -o '^ *[0-9]*'

That just returns inode numbers – which is another handy POSIX specified option.

Asked By: mikeserv

||

That link is referenced a lot because the information is completely accurate, and it has been there for a very long time.


ls replaces non-printable characters with glob characters yes, but those characters aren’t in the actual filename. Why does this matter? 2 reasons:

  1. If you pass that filename to a program, that filename doesn’t actually exist. It would have to expand the glob to get the real file name.
  2. The file glob might match more than one file.

For example:

$ touch a$'t'b
$ touch a$'n'b
$ ls -1
a?b
a?b

Notice how we have 2 files which look exactly the same. How are you going to distinguish them if they both are represented as a?b?


The author calls it garbling filenames when ls returns a list of filenames containing shell globs and then recommends using a shell glob to retrieve a file list!

There is a difference here. When you get a glob back, as shown, that glob might match more than one file. However when you iterate through the results matching a glob, you get back the exact file, not a glob.

For example:

$ for file in *; do printf '%s' "$file" | xxd; done
0000000: 6109 62                                  a.b
0000000: 610a 62                                  a.b

Notice how the xxd output shows that $file contained the raw characters t and n, not ?.

If you use ls, you get this instead:

for file in $(ls -1q); do printf '%s' "$file" | xxd; done
0000000: 613f 62                                  a?b
0000000: 613f 62                                  a?b

“I’m going to iterate anyway, why not use ls?”

Your example you gave doesn’t actually work. It looks like it works, but it doesn’t.

I’m referring to this:

 for f in $(ls -1q | tr " " "?") ; do [ -f "$f" ] && echo "./$f" ; done

I’ve created a directory with a bunch of file names:

$ for file in *; do printf '%s' "$file" | xxd; done
0000000: 6120 62                                  a b
0000000: 6120 2062                                a  b
0000000: 61e2 8082 62                             a...b
0000000: 61e2 8083 62                             a...b
0000000: 6109 62                                  a.b
0000000: 610a 62                                  a.b

When I run your code, I get this:

$ for f in $(ls -1q | tr " " "?") ; do [ -f "$f" ] && echo "./$f" ; done
./a b
./a b

Where’d the rest of the files go?

Let’s try this instead:

$ for f in $(ls -1q | tr " " "?") ; do stat --format='%n' "./$f"; done
stat: cannot stat ‘./a?b’: No such file or directory
stat: cannot stat ‘./a??b’: No such file or directory
./a b
./a b
stat: cannot stat ‘./a?b’: No such file or directory
stat: cannot stat ‘./a?b’: No such file or directory

Now lets use an actual glob:

$ for f in *; do stat --format='%n' "./$f"; done
./a b
./a  b
./a b
./a b
./a b
./a
b

With bash

The above example was with my normal shell, zsh. When I repeat the procedure with bash, I get another completely different set of results with your example:

Same set of files:

$ for file in *; do printf '%s' "$file" | xxd; done
0000000: 6120 62                                  a b
0000000: 6120 2062                                a  b
0000000: 61e2 8082 62                             a...b
0000000: 61e2 8083 62                             a...b
0000000: 6109 62                                  a.b
0000000: 610a 62                                  a.b

Radically different results with your code:

for f in $(ls -1q | tr " " "?") ; do stat --format='%n' "./$f"; done
./a b
./a b
./a b
./a b
./a
b
./a  b
./a b
./a b
./a b
./a b
./a b
./a b
./a
b
./a b
./a b
./a b
./a b
./a
b

With a shell glob, it works perfectly fine:

$ for f in *; do stat --format='%n' "./$f"; done
./a b
./a  b
./a b
./a b
./a b
./a
b

The reason bash behaves this way goes back to one of the points I made at the beginning of the answer: “The file glob might match more than one file”.

ls is returning the same glob (a?b) for several files, so each time we expand this glob, we get every single file that matches it.


How to recreate the list of files I was using:

touch 'a b' 'a  b' a$'xe2x80x82'b a$'xe2x80x83'b a$'t'b a$'n'b

The hex code ones are UTF-8 NBSP characters.

Answered By: phemmer

Let’s try and simplify a little:

$ touch a$'n'b a$'t'b 'a b'
$ ls
a b  a?b  a?b
$ IFS="
"
$ set -- $(ls -1q | uniq)
$ echo "Total files in shell array: $#"
Total files in shell array: 4

See? That’s already wrong right there. There are 3 files but bash is reporting 4. This is because the set is being given the globs generated by ls which are expanded by the shell before being passed to set. Which is why you get:

$ for x ; do
>     printf 'File #%d: %sn' $((i=$i+1)) "$x"
> done
File #1: a b
File #2: a b
File #3: a    b
File #4: a
b

Or, if you prefer:

$ printf ./%s\0 "$@" |
> od -A n -c -w1 |
> sed -n '/ {1,3}/s///;H
> /\0/{g;s///;s/n//gp;s/.*//;h}'
./a b
./a b
./atb
./anb

The above was run on bash 4.2.45.

Answered By: terdon

The answer is simple: The special cases of ls you have to handle outweigh any possible benefit. These special cases can be avoided if you don’t parse ls output.

The mantra here is never trust the user filesystem (the equivalent to never trust user input). If there’s a method that will work always, with 100% certainty, it should be the method you prefer even if ls does the same but with less certainty. I won’t go into technical details since those were covered by terdon and Patrick extensively. I know that due to the risks of using ls in an important (and maybe expensive) transaction where my job/prestige is on the line, I will prefer any solution that doesn’t have a grade of uncertainty if it can be avoided.

I know some people prefer some risk over certainty, but I’ve filed a bug report.

Answered By: Braiam

The output of ls -q isn’t a glob at all. It uses ? to mean “There is a character here that can’t be displayed directly”. Globs use ? to mean “Any character is allowed here”.

Globs have other special characters (* and [] at least, and inside the [] pair there are more). None of those are escaped by ls -q.

$ touch x '[x]'
$ ls -1q
[x]
x

If you treat the ls -1q output there are a set of globs and expand them, not only will you get x twice, you’ll miss [x] completely. As a glob, it doesn’t match itself as a string.

ls -q is meant to save your eyes and/or terminal from crazy characters, not to produce something that you can feed back to the shell.

Answered By: user41515

I am not at all convinced of this, but let’s suppose for the sake of argument that you could, if you’re prepared to put in enough effort, parse the output of ls reliably, even in the face of an "adversary" — someone who knows the code you wrote and is deliberately choosing filenames designed to break it.

Even if you could do that, it would still be a bad idea.

Bourne shell1 is a bad language. It should not be used for anything complicated, unless extreme portability is more important than any other factor (e.g. autoconf).

I claim that if you’re faced with a problem where parsing the output of ls seems like the path of least resistance for a shell script, that’s a strong indication that whatever you are doing is too complicated to be a shell script and you should rewrite the entire thing in Perl, Python, Julia, or any of the other good scripting languages that are readily available. As a demonstration, here’s your last program in Python:

import os, sys
for subdir, dirs, files in os.walk("."):
    for f in dirs + files:
      ino = os.lstat(os.path.join(subdir, f)).st_ino
      sys.stdout.write("%d %s %sn" % (ino, subdir, f))

This has no issues whatsoever with unusual characters in filenames — the output is ambiguous in the same way the output of ls is ambiguous, but that wouldn’t matter in a "real" program (as opposed to a demo like this), which would use the result of os.path.join(subdir, f) directly.

Equally important, and in stark contrast to the thing you wrote, it will still make sense six months from now, and it will be easy to modify when you need it to do something slightly different. By way of illustration, suppose you discover a need to exclude dotfiles and editor backups, and to process everything in alphabetical order by basename:

import os, sys
filelist = []
for subdir, dirs, files in os.walk("."):
    for f in dirs + files:
        if f[0] == '.' or f[-1] == '~': continue
        lstat = os.lstat(os.path.join(subdir, f))
        filelist.append((f, subdir, lstat.st_ino))

filelist.sort(key = lambda x: x[0])
for f, subdir, ino in filelist: 
   sys.stdout.write("%d %s %sn" % (ino, subdir, f))

1 Yes, extended versions of the Bourne shell are readily available nowadays: bash and zsh are both considerably better than the original. The GNU extensions to the core "shell utilities" (find, grep, etc.) also help a lot. But even with all the extensions, the shell environment is not improved enough to compete with scripting languages that are actually good, so my advice remains "don’t use shell for anything complicated" regardless of which shell you’re talking about.

"What would a good interactive shell that was also a good scripting language look like?" is a live research question, because there is an inherent tension between the conveniences required for an interactive CLI (such as being allowed to type cc -c -g -O2 -o foo.o foo.c instead of subprocess.run(["cc", "-c", "-g", "-O2", "-o", "foo.o", "foo.c"])) and the strictures required to avoid subtle errors in complex scripts (such as not interpreting unquoted words in random locations as string literals). If I were to attempt to design such a thing, I’d probably start by putting IPython, PowerShell, and Lua in a blender, but I have no idea what the result would look like.

Answered By: zwol

The reason people say never do something isn’t necessarily because it absolutely positively cannot be done correctly. We may be able to do so, but it may be more complicated, less efficient both space- or time-wise. For example it would be perfectly fine to say “Never build a large e-commerce backend in x86 assembly”.

So now to the issue at hand: As you’ve demonstrated you can create a solution that parses ls and gives the right result – so correctness isn’t an issue.

Is it more complicated? Yes, but we can hide that behind a helper function.

So now to efficiency:

Space-efficiency: Your solution relies on uniq to filter out duplicates, consequently we cannot generate the results lazily. So either O(1) vs. O(n) or both have O(n).

Time-efficiency: Best case uniq uses a hashmap approach so we still have a O(n) algorithm in the number of elements procured, probably though it’s O(n log n).

Now the real problem: While your algorithm is still not looking too bad I was really careful to use elements procured and not elements for n. Because that does make a big difference. Say you have a file nn that will result in a glob for ?? so match every 2 character file in the listing. Funnily if you have another file nr that will also result in ?? and also return all 2 character files.. see where this is going? Exponential instead of linear behavior certainly qualifies as “worse runtime behavior”.. it’s the difference between a practical algorithm and one you write papers in theoretical CS journals about.

Everybody loves examples right? Here we go. Make a folder called “test” and use this python script in the same directory where the folder is.

#!/usr/bin/env python3
import itertools
dir = "test/"
filename_length = 3
options = "abtnvfr"

for filename in itertools.product(options, repeat=filename_length):
        open(dir + ''.join(filename), "a").close()

Only thing this does is generate all products of length 3 for 7 characters. High school math tells us that ought to be 343 files. Well that ought to be really quick to print, so let’s see:

time for f in *; do stat --format='%n' "./$f" >/dev/null; done
real    0m0.508s
user    0m0.051s
sys 0m0.480s

Now let’s try your first solution, because I really can’t get this

eval set -- $(ls -1qrR ././ | tr ' ' '?' |
sed -e '|^(.{,1})/.(/.*):|{' -e 
        's//12/;|/$|!s|.*|&/|;h;s/.*//;b}' -e 
        '/..*/!d;G;s/(.*)n(.*)/21/' -e 
        "s/'/'\''/g;s/.*/'&'/;s/?/'["?$IFS"]'/g" |
uniq)

thing here to work on Linux mint 16 (which I think speaks volumes for the usability of this method).

Anyhow since the above pretty much only filters the result after it gets it, the earlier solution should be at least as quick as the later (no inode tricks in that one- but those are unreliable so you’d give up correctness).

So now how long does

time for f in $(ls -1q | tr " " "?") ; do stat --format='%n' "./$f" >/dev/null; done

take? Well I really don’t know, it takes a while to check 343^343 file names – I’ll tell you after the heat death of the universe.

Answered By: Voo

OP’s Stated Intention Addressed

preface and original answer’s rationaleupdated on 2015-05-18

mikeserv (the OP) stated in latest update to his question: “I do consider it a shame though that I first asked this question to point out a source of misinformation, and, unfortunately, the most upvoted answer here is in large part misleading.”

Well, okay; I feel it was rather a shame that I spent so much time trying to figure out how to explain my meaning only to find that as I re-read the question. This question ended up “[generating] discussion rather than answers” and ended up weighing in at ~18K of text (for the question alone, just to be clear) which would be long even for a blog post.

But StackExchange is not your soapbox, and it’s not your blog. However, in effect, you have used it as at least bit of both. People ended up spending a lot of time answering your “To-Point-Out” instead of answering people’s actual questions. At this point I will be flagging the question as not a good fit for our format, given that the OP has stated explicitly that it wasn’t even intended to be a question at all.

At this point I’m not sure whether my answer was to the point, or not; probably not, but it was directed at some of your questions, and maybe it can be a useful answer to someone else; beginners take heart, some of those “do not”s turn into “do sometimes” once you get more experienced. 🙂

As a General Rule…

please forgive remaining rough edges; i having spent far too much time on this already… rather than quote the OP directly (as originally intended) i will try to summarize and paraphrase.

[largely reworked from my original answer]
upon consideration, i believe that i mis-read the emphasis that the OP was placing on the questions i answered; however, the points addressed were brought up, and i have left the answers largely intact as i believe them to be to-the-point and to address issues that i’ve seen brought up in other contexts as well regarding advice to beginners.

The original post asked, in several ways, why various articles gave advice such as «Don’t parse ls output» or «You should never parse ls output», and so forth.

My suggested resolution to the issue is that instances of this kind of statement are simply examples of an idiom, phrased in slightly different ways, in which an absolute quantifier is paired with an imperative [e.g., «don’t [ever] X», «[you should] always Y», «[one should] never Z»] to form statements intended to be used as general rules or guidelines, especially when given to those new to a subject, rather than being intended as absolute truths, the apparent form of those statements notwithstanding.

When you’re beginning to learn new subject matter, and unless you have some good understanding of why you might need to do else-wise, it’s a good idea to simply follow the accepted general rules without exception—unless under guidance from someone more experienced that yourself. With rising skill and experience you become further able to determine when and if a rule applies in any particular situation. Once you do reach a significant level of experience, you will likely understand the reasoning behind the general rule in the first place, and at that point you can begin to use your judgement as to whether and to what level the reasons behind the rule apply in that situation, and also as to whether there are perhaps overriding concerns.

And that’s when an expert, perhaps, might choose to do things in violation of “The Rules”. But that wouldn’t make them any less “The Rules”.

And, so, to the topic at hand: in my view, just because an expert might be able to violate this rule without getting completely smacked down, i don’t see any way that you could justify telling a beginner that “sometimes” it’s okay to parse ls output, because: it’s not. Or, at least, certainly it’s not right for a beginner to do so.

You always put your pawns in the center; in the opening one piece, one move; castle at the earliest opportunity; knights before bishops; a knight on the rim is grim; and always make sure you can see your calculation through to the end! (Whoops, sorry, getting tired, that’s for the chess StackExchange.)

Rules, Meant to Be Broken?

When reading an article on a subject that is targeted at, or likely to be read by, beginners, often you will see things like this:

  • “You should not ever do X.”
  • “Never do Q!”
  • “Don’t do Z.”
  • “One should always do Y!”
  • “C, no matter what.”

While these statements certainly seem to be stating absolute and timeless rules, they are not; instead this is a way of stating general rules [a.k.a. “guidelines”, “rules of thumb”, “the basics”, etc.] that is at least arguably one appropriate way to state them for the beginners that might be reading those articles. However, just because they are stated as absolutes, the rules certainly don’t bind professionals and experts [who were likely the ones who summarized such rules in the first place, as a way to record and pass on knowledge gained as they dealt with recurring issues in their particular craft.]

Those rules certainly aren’t going to reveal how an expert would deal with a complex or nuanced problem, in which, say, those rules conflict with each other; or in which the concerns that led to the rule in the first place simply don’t apply. Experts are not afraid to (or should not be afraid to!) simply break rules that they happen to know don’t make sense in a particular situation. Experts are constantly dealing with balancing various risks and concerns in their craft, and must frequently use their judgement to choose to break those kind of rules, having to balance various factors and not being able to just rely on a table of rules to follow. Take Goto as an example: there’s been a long, recurring, debate on whether they are harmful. (Yeah, don’t ever use gotos. ;D)

A Modal Proposition

An odd feature, at least in English, and I imagine in many other languages, of general rules, is that they are stated in the same form as a modal proposition, yet the experts in a field are willing to give a general rule for a situation, all the while knowing that they will break the rule when appropriate. Clearly, therefore, these statements aren’t meant to be equivalent to the same statements in modal logic.

This is why i say they must simply be idiomatic. Rather than truly being a “never” or an “always” situation, these rules usually serve to codify general guidelines that tend to be appropriate over a wide range of situations, and that, when beginners follow them blindly, are likely to result in far better results than the beginner choosing to go against them without good reason. Sometimes they codify rules simply leading to substandard results rather than the outright failures accompanying incorrect choices when going against the rules.

So, general rules are not the absolute modal propositions they appear to be on the surface, but instead are a shorthand way of giving the rule with a standard boilerplate implied, something like the following:

unless you have the ability to tell that this guideline is incorrect in a particular case, and prove to yourself that you are right, then ${RULE}

where, of course you could substitute “never parse ls output” in place of ${RULE}. 🙂

Oh Yeah! What About Parsing ls Output?

Well, so, given all that… i think it’s pretty clear that this rule is a good one. First of all, the real rule has to be understood to be idiomatic, as explained above…

But furthermore, it’s not just that you have to be very good with shell scripting to know whether it can be broken, in some particular case. It’s, also, that it’s takes just as much skill to tell you got it wrong when you are trying to break it in testing! And, I say confidently that a very large majority of the likely audience of such articles (giving advice like «Don’t parse the output of ls!») can’t do those things, and those that do have such skill will likely realize that they figure it out on their own and ignore the rule anyway.

But… just look at this question, and how even people that probably do have the skill thought it was a bad call to do so; and how much effort the author of the question spent just getting to a point of the current best example! I guarantee you on a problem that hard, 99% of the people out there would get it wrong, and with potentially very bad results! Even if the method that is decided on turns out to be a good one; until it (or another) ls parsing idea becomes adopted by IT/developer folk as a whole, withstands a lot of testing (especially the test of time) and, finally, manages to graduate to a ‘common technique’ status, it’s likely that a lot of people might try it, and get it wrong… with disastrous consequences.

So, I will reiterate one last time…. that, especially in this case, that is why “never parse ls output!” is decidedly the right way to phrase it.

[UPDATE 2014-05-18: clarified reasoning for answer (above) to respond to a comment from OP; the following addition is in response to the OP’s additions to the question from yesterday]

[UPDATE 2014-11-10: added headers and reorganized/refactored content; and also: reformatting, rewording, clarifying, and um… “concise-ifying”… i intended this to simply be a clean-up, though it did turn into a bit of a rework. i had left it in a sorry state, so i mainly tried to give it some order. i did feel it was important to largely leave the first section intact; so only two minor changes there, redundant ‘but’ removed, and ‘that’ emphasized.]

† I originally intended this solely as a clarification on my original; but decided on other additions upon reflection

‡ see https://unix.stackexchange.com/tour for guidelines on posts

Answered By: shelleybutterfly

Is it possible to parse the output of ls in certain cases? Sure. The idea of extracting a list of inode numbers from a directory is a good example – if you know that your implementation’s ls supports -q, and therefore each file will produce exactly one line of output, and all you need are the inode numbers, parsing them out of ls -Rai1q output is certainly a possible solution. Of course, if the author hadn’t seen advice like “Never parse the output of ls” before, he probably wouldn’t think about filenames with newlines in them, and would probably leave off the ‘q’ as a result, and the code would be subtly broken in that edge case – so, even in cases where parsing ls‘s output is reasonable, this advice is still useful.

The broader point is that, when a newbie to shell scripting tries to have a script figure out (for instance) what’s the biggest file in a directory, or what’s the most recently modified file in a directory, his first instinct is to parse ls‘s output – understandable, because ls is one of the first commands a newbie learns.

Unfortunately, that instinct is wrong, and that approach is broken. Even more unfortunately, it’s subtly broken – it will work most of the time, but fail in edge cases that could perhaps be exploited by someone with knowledge of the code.

The newbie might think of ls -s | sort -n | tail -n 1 | awk '{print $2}' as a way to get the biggest file in a directory. And it works, until you have a file with a space in the name.

OK, so how about ls -s | sort -n | tail -n 1 | sed 's/[^ ]* *[0-9]* *//'? Works fine until you have a file with a newline in the name.

Does adding -q to ls‘s arguments help when there’s a newline in the filename? It might look like it does, until you have 2 different files that contain a non-printable character in the same spot in the filename, and then ls‘s output doesn’t let you distinguish which of those was biggest. Worse, in order to expand the “?”, he probably resorts to his shell’s eval – which will cause problems if he hits a file named, for instance,

foo`/tmp/malicious_script`bar

Does --quoting-style=shell help (if your ls even supports it)? Nope, still displays ? for nonprintable characters, so it’s still ambiguous which of multiple matches was the biggest. --quoting-style=literal? Nope, same. --quoting-style=locale or --quoting-style=c might help if you just need to print the name of the biggest file unambiguously, but probably not if you need to do something with the file afterwards – it would be a bunch of code to undo the quoting and get back to the real filename so that you can pass it to, say, gzip.

And at the end of all that work, even if what he has is safe and correct for all possible filenames, it’s unreadable and unmaintainable, and could have been done much more easily, safely, and readably in python or perl or ruby.

Or even using other shell tools – off the top of my head, I think this ought to do the trick:

find . -type f -printf "%s %f" | sort -nz | awk 'BEGIN{RS=""} END{sub(/[0-9]* /, "", $0); print}'

And ought to be at least as portable as --quoting-style is.

Answered By: godlygeek

I "don’t parse ls" because:

  • Filenames can contain any ASCII character except / and NUL (0x00). ls ouputs a multi-character representations of strange characters. This must be reversed (undone) before the filename can be passed to another program.

  • ls outputs SPACE (" "), NewLine (^J), and other "form control" characters in filenames literally. Special care must be exercised in subsequent processing. All variables must be quoted.

  • After a certain length of time, ls‘s date representation changes from 3 fields ("mmm dd HH:MM") to 1 field ("yyyy"), and all subsequent fields are renumbered.

And, the #1 reason NOT to get information about files by "parsing ls": There is a better way!

The find command can be used to select files, and with the -print0 option, produce a list of filenames (strange and form control characters intact), separated by NUL 0x00 bytes.

The xargs command, with the "-0" option consumes the list of NUL separated filenames and passes them (again, intact) to the command specified on the xargs command line. The command could even be a bash script.

The stat command, given a list of filenames, can output any file information, in a format you can specify.

Read man find xargs stat.

For giggles, read man ls and try to see how you can guarantee parsibility

Answered By: waltinator

Added for completeness sake

There’s a trick called “slash dot saves the day” which consists in using /./ as a magic anchor for determining if a newline was produced as a record separator or due to embedded newlines in the path.
When you use it in a glob as argument of ls -d, you’ll be able to parse the output accurately.

Limitations

  • Requires the use of a glob, so ARG_MAX might kick in at some point.

  • Not all options of ls can be parsed. For example ls -dl adds symlink ‑> target to the output, which makes the boundary between the symlink and the target ambiguous.

  • Might be useless for non-POSIX compliant implementations of ls.


Here’s an example that converts the output of ls -dnL, ls -drt, etc… into a TSV with C-style escaping:

{ command -p ls -dnL ././* /./*; echo /./; } |

LANG=C command -p awk -F '/[.]/' '
    function tsv_escape(s) {
        gsub(/\/,"&&",s)
        gsub(/n/,"\n",s)
        gsub(/t/,"\t",s)
        gsub(/r/,"\r",s)
        return s
    }
    {
        if ( NF == 2 ) {
            if ( NR > 1 )
                print fields "/./" tsv_escape(filename)
            fields = $1
            filename = $2
            gsub(/[[:space:]]+/, "t", fields)
        } else
            filename = filename "n" $0
    }
'

note: I added an echo /./ so that awk doesn’t need an END block

As you can see, you need to start a relative path with ././, and an absolute path with /./.


I have to say, this trick would only be useful for systems that don’t have GNU tools nor perl/python/zsh/etc… which is probably a non-existent use-case nowadays.

Answered By: Fravadona
Categories: Answers Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.