Rsync filter: copying one pattern only
I am trying to create a directory that will house all and only my PDFs compiled from LaTeX. I like keeping each project in a separate folder, all housed in a big folder called LaTeX
. So I tried running:
rsync -avn *.pdf ~/LaTeX/ ~/Output/
which should find all the pdfs in ~/LaTeX/
and transfer them to the output folder. This doesn’t work. It tells me it’s found no matches for “*.pdf
“. If I leave out this filter, the command lists all the files in all the project folders under LaTeX. So it’s a problem with the *.pdf filter. I tried replacing ~/
with the full path to my home directory, but that didn’t have an effect.
I’m, using zsh. I tried doing the same thing in bash and even with the filter that listed every single file in every subdirectory… What’s going on here?
Why isn’t rsync understanding my pdf only filter?
OK. So update: No I’m trying
rsync -avn --include="*/" --include="*.pdf" LaTeX/ Output/
And this gives me the whole file list. I guess because everything matches the first pattern…
How about this:
rsync -avn --include="*.pdf" ~/Latex/ ~/Output/
If you use a pattern like *.pdf
, the shell “expands“ that pattern, i.e. it replaces the pattern with all matches in the current directory. The command you are running (in this case rsync) is unaware of the fact that you tried to use a pattern.
When you are using zsh, there is an easy solution, though: The **
pattern can be used to match folders recursively. Try this:
rsync -avn ~/LaTeX/**/*.pdf ~/Output/
Judging by the “INCLUDE/EXCLUDE PATTERN RULES” section of the manpage, the way to do this is
rsync -avn --include="*/" --include="*.pdf" ~/Latex/ ~/Output/
The critical difference between this and kbrd’s answer is the --include="*/"
flag, which tells rsync to go ahead and copy any directories it finds, whatever they are named. This is needed because rsync will not recurse into a subdirectory unless it has been instructed to copy that subdirectory.
Also, note that the quotation marks prevent the shell from trying to expand the patterns to filenames relative to the current directory, and doing one of the following:
-
Succeeding and messing up your filter (not too likely in the middle of a flag like that, though you really never know when someone will make a file named
--include=foo.pdf
…) -
Failing, and potentially producing an error instead of running the command (as you’ve discovered zsh does by default).
You can use find
and an intermediate list of files (files_to_copy
) to solve your issue. Make sure you’re in your home directory, then:
find LaTeX/ -type f -a -iname "*.pdf" > files_to_copy && rsync -avn --files-from=files_to_copy ~/ ~/Output/ && rm files_to_copy
Tested with Bash.
Here is something that should work without using find. The difference from answers already posted is the order of the filter rules. Filter rules in an rsync command work a lot like iptable rules, the first rule that a file matches is the one that is used. From the manual page:
As the list of files/directories to
transfer is built, rsync checks each
name to be transferred against the
list of include/exclude patterns in
turn, and the
first matching pattern is acted on: if it is an exclude pattern, then
that file is skipped; if it is an
include pattern then that filename is
not skipped; if
no matching pattern is found, then the filename is not skipped.
Thus, you need a command as follows:
rsync -avn --include="**.pdf" --exclude="*" ~/LaTeX/ ~/Output/
Note the “**.pdf” pattern. According to the man page:
if the pattern contains a / (not counting a trailing /) or a “**”, then it is matched against the full pathname, including any leading directories. If
the pattern doesn’t contain a / or a “**”, then it is matched only against the final component of the filename. (Remember that the algorithm is
applied recursively so “full filename” can actually be any portion of a path from the starting directory on down
In my small test, this does work recursively down the directory tree and only selects the pdfs.
TL,DR:
rsync -am --include='*.pdf' --include='*/' --exclude='*' ~/LaTeX/ ~/Output/
Rsync copies the source(s) to the destination. If you pass *.pdf
as sources, the shell expands this to the list of files with the .pdf
extension in the current directory. No recursive traversal happens because you didn’t pass any directory as a source.
So you need to run rsync -a ~/LaTeX/ ~/Output/
, but with a filter to tell rsync to copy .pdf
files only. Rsync’s filter rules can seem daunting when you read the manual, but you can construct many examples with just a few simple rules.
-
Inclusions and exclusions:
- Excluding files by name or by location is easy:
--exclude=*~
,--exclude=/some/relative/location
(relative to the source argument, e.g. this excludes~/LaTeX/some/relative/location
). - If you only want to match a few files or locations, include them, include every directory leading to them (for example with
--include=*/
), then exclude the rest with--exclude='*'
. This is because: - If you exclude a directory, this excludes everything below it. The excluded files won’t be considered at all.
- If you include a directory, this doesn’t automatically include its contents. In recent versions,
--include='directory/***'
will do that. - For each file, the first matching rule applies (and anything never matched is included).
- Excluding files by name or by location is easy:
-
Patterns:
- If a pattern doesn’t contain a
/
, it applies to the file name sans directory. - If a pattern ends with
/
, it applies to directories only. - If a pattern starts with
/
, it applies to the whole path from the directory that was passed as an argument torsync
. *
any substring of a single directory component (i.e. never matches/
);**
matches any path substring.
- If a pattern doesn’t contain a
-
If a source argument ends with a
/
, its contents are copied (rsync -r a/ b
createsb/foo
for everya/foo
). Otherwise the directory itself is copied (rsync -r a b
createsb/a
).
Thus here we need to include *.pdf
, include directories containing them, and exclude everything else.
rsync -a --include='*.pdf' --include='*/' --exclude='*' ~/LaTeX/ ~/Output/
Note that this copies all directories, even the ones that contain no matching file or subdirectory containing one. This can be avoided with the --prune-empty-dirs
option (it’s not a universal solution since you then can’t copy a directory even by matching it explicitly, but that’s a rare requirement).
rsync -am --include='*.pdf' --include='*/' --exclude='*' ~/LaTeX/ ~/Output/
rsync -av --include="*/" --include="*.pdf" --exclude="*" ~/Latex/ ~/Output/ --dry-run
The default is to include everything, so you must explicitly exclude everything after including the files you want to transfer.
Remove the –dry-run to actually transfer the files.
If you start off with:
--exclude '*' --include '*.pdf'
Then the greedy matching will exclude everything right off.
If you try:
--include '*.pdf' --exclude '*'
Then only pdf files in the top level folder will be transferred. It won’t follow any directories, since those are excluded by ‘*’.
This is my preferred solution:
find source_dir -iname '*.jpg' -print0 | rsync -0 -v --files-from=- . destination_dir/
The find
command is easier to understand than the include/exclude rules of rsync
🙂
If you want to copy only pdf files, just change .jpg
to .pdf
To generate a directory containing only headers (../include) from inside the source directory:
rsync -avh --prune-empty-dirs --exclude="build" --include="*/" --include="*.h" --exclude="*" ./* ../include/
This excludes all empty directories and the directory build
In an update to @Giles’ answer, please consider that the order of the include and exclude commands must be changed with current versions (>=3.x.x) to have the include options before the exlude options in order to build the correct file list. It is also my personal best practice to put the "include all subdirectories" instruction generally first and then the file pattern:
rsync -avh --include='*/' --include='file-pattern' --exclude='*' /sourcedir/ /targetdir/
i.e. in your case:
rsync -avh --include='*/' -include='*.pdf' --exclude='*' ~/LaTeX/ ~/Output/
Further explanation can also be drawn from the manual at https://www.samba.org/ftp/rsync/rsync.html under the headline "FILTER RULES":
Note that, when using the –recursive (-r) option (which is implied by -a), every subdir component of every path is visited left to right, with each directory having a chance for exclusion before its content. In this way include/exclude patterns are applied recursively to the pathname of each node in the filesystem’s tree (those inside the transfer). The exclude patterns short-circuit the directory traversal stage as rsync finds the files to send.
For instance, to include "/foo/bar/baz", the directories "/foo" and "/foo/bar" must not be excluded. Excluding one of those parent directories prevents the examination of its content, cutting off rsync’s recursion into those paths and rendering the include for "/foo/bar/baz" ineffectual (since rsync can’t match something it never sees in the cut-off section of the directory hierarchy).
The concept path exclusion is particularly important when using a trailing ‘*’ rule. For instance, this won’t work:
+ /some/path/this-file-will-not-be-found
+ /file-is-included
- *
This fails because the parent directory "some" is excluded by the ‘*’ rule, so rsync never visits any of the files in the "some" or "some/path" directories. One solution is to ask for all directories in the hierarchy to be included by using a single rule: "+ */" (put it somewhere before the "- *" rule), and perhaps use the –prune-empty-dirs option. Another solution is to add specific include rules for all the parent dirs that need to be visited. For instance, this set of rules works fine:
+ /some/
+ /some/path/
+ /some/path/this-file-is-found
+ /file-also-included
- *
Here are some examples of exclude/include matching:
"- *.o" would exclude all names matching *.o
"- /foo" would exclude a file (or directory) named foo in the transfer-root directory
"- foo/" would exclude any directory named foo
"- /foo/*/bar" would exclude any file named bar which is at two levels below a directory named foo in the transfer-root directory
"- /foo/**/bar" would exclude any file named bar two or more levels below a directory named foo in the transfer-root directory
The combination of "+ */", "+ *.c", and "- *" would include all directories and C source files but nothing else (see also the --prune-empty-dirs option)
The combination of "+ foo/", "+ foo/bar.c", and "- *" would include only the foo directory and foo/bar.c (the foo directory must be explicitly included or it would be excluded by the "*")
The following modifiers are accepted after a "+" or "-":
A / specifies that the include/exclude rule should be matched against the absolute pathname of the current item. For example, "-/ /etc/passwd" would exclude the passwd file any time the transfer was sending files from the "/etc" directory, and "-/ subdir/foo" would always exclude "foo" when it is in a dir named "subdir", even if "foo" is at the root of the current transfer.
A ! specifies that the include/exclude should take effect if the pattern fails to match. For instance, "-! */" would exclude all non-directories.
A C is used to indicate that all the global CVS-exclude rules should be inserted as excludes in place of the "-C". No arg should follow.
An s is used to indicate that the rule applies to the sending side. When a rule affects the sending side, it prevents files from being transferred. The default is for a rule to affect both sides unless --delete-excluded was specified, in which case default rules become sender-side only. See also the hide (H) and show (S) rules, which are an alternate way to specify sending-side includes/excludes.
An r is used to indicate that the rule applies to the receiving side. When a rule affects the receiving side, it prevents files from being deleted. See the s modifier for more info. See also the protect (P) and risk (R) rules, which are an alternate way to specify receiver-side includes/excludes.
A p indicates that a rule is perishable, meaning that it is ignored in directories that are being deleted. For instance, the -C option's default rules that exclude things like "CVS" and "*.o" are marked as perishable, and will not prevent a directory that was removed on the source from being deleted on the destination.
An x indicates that a rule affects xattr names in xattr copy/delete operations (and is thus ignored when matching file/dir names). If no xattr-matching rules are specified, a default xattr filtering rule is used (see the --xattrs option).
For those who want a solution that does not copy the original directly structure (ie dumps all pdf’s into one directory). This should work:
find SRC_DIR/ -type f | grep *.pdf | xargs -i cp {} DEST_DIR