A GNU Parallel Job Queue Script

I found a script on GitHub which I’ve slightly modified to fit the needs of the program I’m trying to run in a queue.

It is not working however and I’m not sure why. It never actually echos the jobs to the queue file.

Here is a link to the GitHub page:

https://gist.github.com/tubaterry/c6ef393a39cfbc82e13b8716c60f7824

Here is the version I modified:

#!/bin/sh

END="END"
true > queue

tail -n+0 -f queue | parallel -j 16 -E "$END"

while read i; do
    echo "command [args] > ${i%.inp}.log 2> ${i%.inp}.err" > queue
done < "jobs.txt"

echo "$END" >> queue
echo "Waiting for jobs to complete"

while [ "$(pgrep 'perl /usr/local/bin/parallel' | grep -evc 'grep' | tr -d " ")" -gt "0" ]; do
    sleep 1
done

touch killtail
mv killtail queue
rm queue

The only thing I can think of is that one of these steps isn’t operating as expected on OpenBSD. But I rearranged a step and everything executes without errors but it only submits one job. The change was moving tail -n+0 -f queue | parallel -j 16 -E "$END" after the first while loop and changing true > queue to touch queue since I’m not quite sure what true > queue means.

Any help would be appreciated.

EDIT:

I have a jobs.txt file filled with the path the the input files to the command I plan to run. The files in jobs.txt would be one of the arguments to command and then I output the results of the calculation to a log file and any errors to an error file.

My expectation is that each job will be added to queue and parallel will execute up to 16 jobs, one per core as one of the arguments to command is the utilisation of one core per calculation. This will continue until it reaches the “END” which is signified by the -E argument to parallel.

As written, nothing echos from jobs.txt to queue. I will try again with a >>

I have questioned quite a few things in the original script. I changed the things I’m sure about but some of the functionality I was very confused by and decided to leave it as is.

One of those things I’m not clear on is tail -n+0

I have no idea what that is doing

EDIT2:

${PROGRAM} ${JOB}.inp ${NCPU} > ${JOB}.log 2> ${JOB}.err

${JOB} is a reference to anywhere between 1 and ∞ calculations depending on how many I need to do at a given time. Currently, jobs.txt has 374 individual tests that I need to run. ${PROGRAM} is the software that takes the parameters from ${JOB}.inp and calculates accordingly. ${NCPU} is how many cores I want to use per job; currently I am trying to run each job in serial on a 16-core processor.

The goal is to queue as many calculations as I need to without ever typing that full command in. I just want to generate a list using find calculations -name '*.inp' -print > jobs.txt and then run a script such as SerialRun.sh or ParallelRun.sh and let it crank out results. The jobs may be nested in many different directories depending on how different users choose to organise their work and this method using find allows me to very quickly submit a job and generate results to the correct paths. As each calculation finishes, I can then analyse the data while the system continues to run through the tests.

The script very well may be over complicated. I was looking for a job queue system and found nqs which became the GNU Parallel project. I cannot find many examples of queueing jobs with parallel but came across that script on GitHub and decided to give it a shot. I have quite a few issues with how it is written but I don’t understand parallel well enough to question it.

I figured it should be a bit simpler than this to build a queue for it.

EDIT3:

Maybe the correct way to go about this is to just do:

while read i; do
    command "$i" > "${i%.inp}".log 2> "${i%.inp}".err | parallel -j 16
done < "jobs.txt"

Would that work?

Asked By: brokaryote

||

You don’t need this complex script, parallel can do what want by itself. Just remove the .inp extension from your list of files using sed or any other tool of your choice, and feed the base name to parallel like this:

sed 's/.inp//' jobs.txt | parallel -j 16 "${PROGRAM} {}.inp > {}.log 2> {}.err"

The {} notation is part of parallel’s basic functionality, described in man parallel as follows:

{} Input line.

This replacement string will be replaced by a full line read from the input source. The input source is normally stdin (standard input), but can also
be given with --arg-file, :::, or ::::.

So it is simply replaced by whatever you pass to parallel, in this case the list of file names with their extension removed by sed.

Alternatively, you can use {.} which is:

{.} Input line without extension.

This replacement string will be replaced by the input with the
extension removed. If the input line contains . after the last /,
the last . until the end of the string will be removed and {.}
will be replaced with the remaining. E.g. foo.jpg becomes foo,
subdir/foo.jpg becomes subdir/foo, sub.dir/foo.jpg becomes
sub.dir/foo, sub.dir/bar remains sub.dir/bar. If the input line
does not contain . it will remain unchanged.

The replacement string {.} can be changed with –extensionreplace

With this, you don’t even need the jobs.txt file. If all of your files are in the same directory, you can do:

parallel -j 16 "${PROGRAM} {.}.inp > {.}.log 2> {.}.err" ::: *.inp

Or, to make it recursively descend into subdirectories, assuming you are using bash, you can do:

shopt -s globstar
parallel -j 16 "${PROGRAM} {.}.inp > {.}.log 2> {.}.err" ::: **/*.inp
Answered By: terdon
Categories: Answers Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.