Recursively list path of files only

Why

I have two folders that should contain the exact same files, however, when I look at the number of files, they are different. I would like to know which files/folders are present in one, not the other. My thinking is I will make a list of all the files and then use comm to find differences between the two folders.

Question

How to make a list recursively of files and folders in the format /path/to/dir and /path/to/dir/file ?

Important notes

OS: Windows 11, subsystem Ubuntu 20.04.4 LTS

Locations folders: One network drive, one local

Size of folders: ~2tb each

Asked By: Olaf

||

You don’t need any of that, just use diff -qr dir1 dir2. For example:

$ tree
.
├── dir1
│   ├── file1
│   ├── file3
│   ├── file4
│   ├── file6
│   ├── file7
│   ├── file8
│   └── subdir1
│       ├── dsaf
│       ├── sufile1
│       └── sufile3
└── dir2
    ├── file1
    ├── file2
    ├── file3
    ├── file4
    ├── file9
    └── subdir1
        ├── sufile1
        └── sufile3

4 directories, 16 files

If I now run diff -qr (-r for "recursive" and -q to only report when the files differ, and not show the actual differences) on the two directories, I get:

$ diff -qr dir1/ dir2/
Only in dir2/: file2
Only in dir1/: file6
Only in dir1/: file7
Only in dir1/: file8
Only in dir2/: file9
Only in dir1/subdir1: dsaf

That said, the way to get a list of files is find:

$ find dir1 -type f
dir1/subdir1/dsaf
dir1/subdir1/sufile1
dir1/subdir1/sufile3
dir1/file6
dir1/file1
dir1/file8
dir1/file4
dir1/file7
dir1/file3

Then, you can remove the dir1/ and dir2/ using sed, and compare the output of two directories using process substitution in a shell that supports it:

$ comm -3 <(find dir1 -type f | sed 's|dir1/||' | sort) <(find dir2 -type f | sed 's|dir2/||' | sort)
    file2
file6
file7
file8
    file9
subdir1/dsaf

Note that this assumes file names with no newline characters. If you need to handle those, just use the diff -r approach above.

Answered By: terdon

try

 cd /path/1
 find . -type d -print | sort > list1.dir
 find . -type f -print | sort > list1.file
 cd /path/2
 find . -type d -print | sort > list2.dir
 find . -type f -print | sort > list2.file
  • sort is used to ensure same order, and a smaller result for diff or comm
  • you might use absolute destination file name, so that list1.file and list2.file will not "polute" results.
Answered By: Archemar

Note that directories on Unix are just one of many types of files. With find, you can search for them with -type d, or use the / qualifier in zsh globs. Other types of files include regular files (-type f, . glob qualifier, maybe what you meant by file), but also symlinks (-type l / @), devices, fifos, sockets…

To get the files of type directory, you can do:

find dir1/ -type d

And for files of any other type:

find dir1/ ! -type d

And same for dir2.

Now that comes with 3 main problems:

  • the printed paths will start with dir1/ for dir1 and dir2/ for dir2 which would make the comparison more difficult.
  • the order will be random.
  • the file paths are written one per line, but the newline character is as valid as any in a file path, or in other words, file paths can be made of several lines, so the output is not post-processable reliably.

Those can be addressed with GNU find and sort by using:

find dir1/ -type f -printf '%P' | LC_ALL=C sort -z

Where:

  • %P prints the path of the file relative to dir1
  • we sort the list (in the C locale as file paths don’t have to be made of text)
  • we use NUL-delimited records instead of lines as 0 is the only byte that cannot occur in a file path.

Now, you can compare the list with:

list() {
  find "$@" -printf '%P' | LC_ALL=C sort -z
}
echo Directory differences:
comm -z3 <(list dir1/ -type d) <(list dir2/ -type d) | tr '' 'n'
echo Non-directory differences:
comm -z3 <(list dir1/ ! -type d) <(list dir2/ ! -type d) | tr '' 'n'

That output is not post-processable reliably as we translate the NULs back to newline for displays an comm uses TABs to separate the columns which again is valid in a file path.

Alternatively, you can get the lists in zsh arrays and use its array comparison operators:

dirs_in_dir1=( dir1/**/*(ND/:s:dir1/::) )
dirs_in_dir2=( dir2/**/*(ND/:s:dir2/::) )
nondirs_in_dir1=( dir1/**/*(ND^/:s:dir1/::) )
nondirs_in_dir2=( dir2/**/*(ND^/:s:dir2/::) )

Then:

dirs_only_in_dir1=( ${dirs_in_dir1:|dirs_in_dir2} )
dirs_only_in_dir2=( ${dirs_in_dir2:|dirs_in_dir1} )
nondirs_only_in_dir1=( ${nondirs_in_dir1:|nondirs_in_dir2} )
nondirs_only_in_dir2=( ${nondirs_in_dir2:|nondirs_in_dir1} )

And do what you have to do with those arrays, like print them raw on 1 Column with:

print -rC1 -- $array

(or NUL-delimited so it can be post-processed by adding the -N option).

Answered By: Stéphane Chazelas
Categories: Answers Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.