Recursively list path of files only
Why
I have two folders that should contain the exact same files, however, when I look at the number of files, they are different. I would like to know which files/folders are present in one, not the other. My thinking is I will make a list of all the files and then use comm to find differences between the two folders.
Question
How to make a list recursively of files and folders in the format /path/to/dir and /path/to/dir/file ?
Important notes
OS: Windows 11, subsystem Ubuntu 20.04.4 LTS
Locations folders: One network drive, one local
Size of folders: ~2tb each
You don’t need any of that, just use diff -qr dir1 dir2
. For example:
$ tree
.
├── dir1
│ ├── file1
│ ├── file3
│ ├── file4
│ ├── file6
│ ├── file7
│ ├── file8
│ └── subdir1
│ ├── dsaf
│ ├── sufile1
│ └── sufile3
└── dir2
├── file1
├── file2
├── file3
├── file4
├── file9
└── subdir1
├── sufile1
└── sufile3
4 directories, 16 files
If I now run diff -qr
(-r
for "recursive" and -q
to only report when the files differ, and not show the actual differences) on the two directories, I get:
$ diff -qr dir1/ dir2/
Only in dir2/: file2
Only in dir1/: file6
Only in dir1/: file7
Only in dir1/: file8
Only in dir2/: file9
Only in dir1/subdir1: dsaf
That said, the way to get a list of files is find
:
$ find dir1 -type f
dir1/subdir1/dsaf
dir1/subdir1/sufile1
dir1/subdir1/sufile3
dir1/file6
dir1/file1
dir1/file8
dir1/file4
dir1/file7
dir1/file3
Then, you can remove the dir1/
and dir2/
using sed
, and compare the output of two directories using process substitution in a shell that supports it:
$ comm -3 <(find dir1 -type f | sed 's|dir1/||' | sort) <(find dir2 -type f | sed 's|dir2/||' | sort)
file2
file6
file7
file8
file9
subdir1/dsaf
Note that this assumes file names with no newline characters. If you need to handle those, just use the diff -r
approach above.
try
cd /path/1
find . -type d -print | sort > list1.dir
find . -type f -print | sort > list1.file
cd /path/2
find . -type d -print | sort > list2.dir
find . -type f -print | sort > list2.file
sort
is used to ensure same order, and a smaller result fordiff
orcomm
- you might use absolute destination file name, so that
list1.file
andlist2.file
will not "polute" results.
Note that directories on Unix are just one of many types of files. With find
, you can search for them with -type d
, or use the /
qualifier in zsh globs. Other types of files include regular files (-type f
, .
glob qualifier, maybe what you meant by file), but also symlinks (-type l
/ @
), devices, fifos, sockets…
To get the files of type directory, you can do:
find dir1/ -type d
And for files of any other type:
find dir1/ ! -type d
And same for dir2
.
Now that comes with 3 main problems:
- the printed paths will start with
dir1/
fordir1
anddir2/
fordir2
which would make the comparison more difficult. - the order will be random.
- the file paths are written one per line, but the newline character is as valid as any in a file path, or in other words, file paths can be made of several lines, so the output is not post-processable reliably.
Those can be addressed with GNU find
and sort
by using:
find dir1/ -type f -printf '%P ' | LC_ALL=C sort -z
Where:
%P
prints the path of the file relative to dir1- we sort the list (in the C locale as file paths don’t have to be made of text)
- we use NUL-delimited records instead of lines as 0 is the only byte that cannot occur in a file path.
Now, you can compare the list with:
list() {
find "$@" -printf '%P ' | LC_ALL=C sort -z
}
echo Directory differences:
comm -z3 <(list dir1/ -type d) <(list dir2/ -type d) | tr ' ' 'n'
echo Non-directory differences:
comm -z3 <(list dir1/ ! -type d) <(list dir2/ ! -type d) | tr ' ' 'n'
That output is not post-processable reliably as we translate the NULs back to newline for displays an comm
uses TABs to separate the columns which again is valid in a file path.
Alternatively, you can get the lists in zsh arrays and use its array comparison operators:
dirs_in_dir1=( dir1/**/*(ND/:s:dir1/::) )
dirs_in_dir2=( dir2/**/*(ND/:s:dir2/::) )
nondirs_in_dir1=( dir1/**/*(ND^/:s:dir1/::) )
nondirs_in_dir2=( dir2/**/*(ND^/:s:dir2/::) )
Then:
dirs_only_in_dir1=( ${dirs_in_dir1:|dirs_in_dir2} )
dirs_only_in_dir2=( ${dirs_in_dir2:|dirs_in_dir1} )
nondirs_only_in_dir1=( ${nondirs_in_dir1:|nondirs_in_dir2} )
nondirs_only_in_dir2=( ${nondirs_in_dir2:|nondirs_in_dir1} )
And do what you have to do with those arrays, like print
them r
aw on 1
C
olumn with:
print -rC1 -- $array
(or N
UL-delimited so it can be post-processed by adding the -N
option).