Slow down a `split`

I have a really large archive consisting of really small files, concatenated into a
single text file, with a "" dilimiter. For smaller archives, I would split
the archive using "" as a pattern, and then work on the resulting files.
However, in this archive there are on the order of magnitude of a hundred million
such files — clearly, too much for putting them all into a single directory. I
have created folders aa, ab, etc. for trying to move them into directories as
they are created. However, I ran into issues. Things I’ve tried:

  1. There is no command for split to execute any command on the resulting file. So
    I have to do it by hand.

  2. Moving the files into the ** directory using find . -name "xaa*" -exec mv {} aa + does not work because {} is not at the end of the line.

  3. The -t flag, for reversing the source and the destination, is not available in
    my version of Unix.

  4. I had to pipe the output of find into xargs, for it to work out.

However, this is too slow — files are being created way faster than they are moved
away.

  1. I suspect that xargs is processing less files at a time than using a + after
    find -exec. I tried adding a `-R 6000′ flag, for running 6000 entries at a time;
    however, I don’t think it made a difference.

  2. I decreased the priority of split to the lowest possible. No change in the
    amount of CPU it consumed, so probably no effect either.

  3. I open up to seven command prompts for running the mv commands (last four
    letters per command prompt) — however, this is still not nearly enough. I would
    open more, but once the system gets to seven, the response is so slow that I have to
    stop the split. For example, the source archive got copied to a USB all while
    waiting for a ls -l | tail command to return something.

So what I’ve been doing is, stopping the split at that point, waiting for the mv
commands to catch up, and then restarting the split. At that point I would use
find -exec rm {} + to delete the files I already have; this is a bit faster, so
when it gets to the files I don’t have, there’s less files around.

So the first such iteration lasted ~3 million files, the next one ~2 million, the
next ~1.5. I am sure there should be a better way, though. Any ideas for what else
to try?

Asked By: Alex

||

Something like xargs -I {} ... mv {} aa is still going to run mv once per each line of input. From the POSIX specification of the -I option of xargs:

Insert mode: utility is executed for each  logical  line  from  standard  input.

You’d need something like xargs -r sh -c 'mv "$@" aa' _ (or at that point, just find ... -exec sh -c 'mv "$@" aa' _ {} +) to really run a single mv for multiple files. With this, you’re using the shell to insert the arguments between mv and the target directory.

  • "$@" is substituted by the shell with all arguments without any field splitting or globbing.
  • The _ is acting as $0 for the script specified to sh -c. The arguments after that will be $1, $2, etc., or collectively, $@.

Even with this, I think your find will be involved in race conditions. It might end up finishing reading the directory list before split ends, and so may not process all files. It might also end up recursing into the subdirectories you’ve made and detect the files that were previously moved there, and may end up trying to move aa/xaa to aa/ again and erroring out (however, -exec ... {} + ignores the command’s exit status).

Answered By: muru
Categories: Answers Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.