Slow down a `split`
I have a really large archive consisting of really small files, concatenated into a
single text file, with a "" dilimiter. For smaller archives, I would split
the archive using "" as a pattern, and then work on the resulting files.
However, in this archive there are on the order of magnitude of a hundred million
such files — clearly, too much for putting them all into a single directory. I
have created folders aa
, ab
, etc. for trying to move them into directories as
they are created. However, I ran into issues. Things I’ve tried:
-
There is no command for
split
to execute any command on the resulting file. So
I have to do it by hand. -
Moving the files into the
**
directory usingfind . -name "xaa*" -exec mv {} aa +
does not work because{}
is not at the end of the line. -
The
-t
flag, for reversing the source and the destination, is not available in
my version of Unix. -
I had to pipe the output of
find
intoxargs
, for it to work out.
However, this is too slow — files are being created way faster than they are moved
away.
-
I suspect that
xargs
is processing less files at a time than using a+
after
find -exec
. I tried adding a `-R 6000′ flag, for running 6000 entries at a time;
however, I don’t think it made a difference. -
I decreased the priority of
split
to the lowest possible. No change in the
amount of CPU it consumed, so probably no effect either. -
I open up to seven command prompts for running the
mv
commands (last four
letters per command prompt) — however, this is still not nearly enough. I would
open more, but once the system gets to seven, the response is so slow that I have to
stop the split. For example, the source archive got copied to a USB all while
waiting for als -l | tail
command to return something.
So what I’ve been doing is, stopping the split
at that point, waiting for the mv
commands to catch up, and then restarting the split. At that point I would use
find -exec rm {} +
to delete the files I already have; this is a bit faster, so
when it gets to the files I don’t have, there’s less files around.
So the first such iteration lasted ~3 million files, the next one ~2 million, the
next ~1.5. I am sure there should be a better way, though. Any ideas for what else
to try?
Something like xargs -I {} ... mv {} aa
is still going to run mv
once per each line of input. From the POSIX specification of the -I
option of xargs
:
Insert mode: utility is executed for each logical line from standard input.
You’d need something like xargs -r sh -c 'mv "$@" aa' _
(or at that point, just find ... -exec sh -c 'mv "$@" aa' _ {} +
) to really run a single mv
for multiple files. With this, you’re using the shell to insert the arguments between mv
and the target directory.
"$@"
is substituted by the shell with all arguments without any field splitting or globbing.- The
_
is acting as$0
for the script specified tosh -c
. The arguments after that will be$1
,$2
, etc., or collectively,$@
.
Even with this, I think your find
will be involved in race conditions. It might end up finishing reading the directory list before split
ends, and so may not process all files. It might also end up recursing into the subdirectories you’ve made and detect the files that were previously moved there, and may end up trying to move aa/xaa
to aa/
again and erroring out (however, -exec ... {} +
ignores the command’s exit status).