split a file, pass each piece as a param to a script, run each script in parallel

I have a words.txt with 10000 words (one to a line). I have 5,000 documents. I want to see which documents contain which of those words (with a regex pattern around the word). I have a script.sh that greps the documents and outputs hits. I want to (1) split my input file into smaller files (2) feed each of the files to script.sh as a parameter and (3) run all of this in parallel.

My attempt based on the tutorial is hitting errors

$parallel ./script.sh ::: split words.txt # ./script.sh: line 22: split: No such file or directory

My script.sh looks like this

#!/usr/bin/env bash

line 1 while read line
line 2  do
        some stuff
line 22 done < $1

I guess I could output split to a directory loop thru the files in the directory launching grep commands — but how can do this elegantly and concisely (using parallel)?

Asked By: bernie2436


You can use the split tool:

split -l 1000 words.txt words-

will split your words.txt file into files with no more than 1000 lines each named


If you omit the prefix (words- in the above example), split uses x as the default prefix.

For using the generated files with parallel you can make use of a glob:

split -l 1000 words.txt words-
parallel ./script.sh ::: words-[a-z][a-z]
Answered By: Joseph R.

You probably do not need the temporary files, since you read from STDIN. So there is really no reason to use split. Get rid of the files by using --pipe:

cat words | parallel --pipe -L 1000 -N1 ./script.sh

If it is really just a grep you want:

find dir-with-5000-files -type f | parallel -X grep -f words.txt 

If words.txt is too big to fit in memory, you can split that up:

find dir-with-5000-files -type f | parallel -X "cat words.txt | parallel --pipe grep -f -"

The man page of GNU Parallel covers how to most efficently grep n lines for m regular expressions: https://www.gnu.org/software/parallel/parallel_examples.html#example-grepping-n-lines-for-m-regular-expressions

The simplest solution to grep a big file for a lot of regexps is:

grep -f regexps.txt bigfile

Or if the regexps are fixed strings:

grep -F -f regexps.txt bigfile

There are 2 limiting factors: CPU and disk I/O. CPU is easy to measure: If the grep takes >90% CPU (e.g. when running top), then the CPU is a limiting factor, and parallelization will speed this up. If not, then disk I/O is the limiting factor, and depending on the disk system it may be faster or slower to parallelize. The only way to know for certain is to measure.

If the CPU is the limiting factor parallelization should be done on the regexps:

cat regexp.txt | parallel --pipe -L1000 --round-robin grep -f - bigfile

This will start one grep per CPU and read bigfile one time per CPU, but as that is done in parallel, all reads except the first will be cached in RAM. Depending on the size of regexp.txt it may be faster to use –block 10m instead of -L1000. If regexp.txt is too big to fit in RAM, remove –round-robin and adjust -L1000. This will cause bigfile to be read more times.

Some storage systems perform better when reading multiple chunks in parallel. This is true for some RAID systems and for some network file systems. To parallelize the reading of bigfile:

parallel --pipepart --block 100M -a bigfile grep -f regexp.txt

This will split bigfile into 100MB chunks and run grep on each of these chunks. To parallelize both reading of bigfile and regexp.txt combine the two using –fifo:

parallel --pipepart --block 100M -a bigfile --fifo cat regexp.txt 
| parallel --pipe -L1000 --round-robin grep -f - {}
Answered By: Ole Tange