Combining large amount of files

I have ±10,000 files (res.1res.10000) all consisting of one column, and an equal number of rows.
What I want is, in essence, simple; merge all files column-wise in a new file final.res. I have tried using:

paste res.*

However (although this seems to work for a small subset of result files, this gives the following error when performed on the whole set: Too many open files.

There must be an ‘easy’ way to get this done, but unfortunately I’m quite new to unix. Thanks in advance!

PS: To give you an idea of what (one of my) datafile(s) looks like:

Asked By: mats


Try to execute it on this way:

ls res.*|xargs paste >final.res

You can also split the batch in parts and try something like:

paste `echo res.{1..100}` >final.100
paste `echo res.{101..200}` >final.200

and at the end combine final files

paste final.* >final.res
Answered By: Romeo Ninov

If you have root permissions on that machine you can temporarily increase the “maximum number of open file descriptors” limit:

ulimit -Hn 10240 # The hard limit
ulimit -Sn 10240 # The soft limit

And then

paste res.* >final.res

After that you can set it back to the original values.

A second solution, if you cannot change the limit:

for f in res.*; do cat final.res | paste - $f >temp; cp temp final.res; done; rm temp

It calls paste for each file once, and at the end there is a huge file with all columns (it takes its minute).

Edit: Useless use of catNot!

As mentioned in the comments the usage of cat here (cat final.res | paste - $f >temp) is not useless. The first time the loop runs, the file final.res doesn’t already exist. paste would then fail and the file is never filled, nor created. With my solution only cat fails the first time with No such file or directory and paste reads from stdin just an empty file, but it continues. The error can be ignored.

Answered By: chaos

If chaos‘ answer isn’t applicable (because you don’t have the required permissions), you can batch up the paste calls as follows:

ls -1 res.* | split -l 1000 -d - lists
for list in lists*; do paste $(cat $list) > merge${list##lists}; done
paste merge* > final.res

This lists the files 1000 at a time in files named lists00, lists01 etc., then pastes the corresponding res. files into files named merge00, merge01 etc., and finally merges all the resulting partially merged files.

As mentioned by chaos you can increase the number of files used at once; the limit is the value given ulimit -n minus however many files you already have open, so you’d say

ls -1 res.* | split -l $(($(ulimit -n)-10)) -d - lists

to use the limit minus ten.

If your version of split doesn’t support -d, you can remove it: all it does is tell split to use numeric suffixes. By default the suffixes will be aa, ab etc. instead of 01, 02 etc.

If there are so many files that ls -1 res.* fails (“argument list too long”), you can replace it with find which will avoid that error:

find . -maxdepth 1 -type f -name res.* | split -l 1000 -d - lists

(As pointed out by don_crissti, -1 shouldn’t be necessary when piping ls‘s output; but I’m leaving it in to handle cases where ls is aliased with -C.)

Answered By: Stephen Kitt

Given the amount of files, line sizes, etc. involved, I think it will surpass the default sizes of tools (awk, sed, paste, *, etc)

I would create a small program for this, it would neither have 10,000 files open, nor a line of hundred of thousands in length (10,000 files of 10 (max size of line in the example)). It only requires an ~10,000 array of integers, to store the number of bytes have been read from each file. The disadvantage is that it has only one file descriptor, it is reused for each file, for each line, and this could be slow.

The definitions of FILES and ROWS should be changed to the actual exact values. The output is sent to the standard output.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define FILES 10000 /* number of files */
#define ROWS 500    /* number of rows  */

int main() {
   int positions[FILES + 1];
   FILE *file;
   int r, f;
   char filename[100];
   size_t linesize = 100;
   char *line = (char *) malloc(linesize * sizeof(char));

   for (f = 1; f <= FILES; positions[f++] = 0); /* sets the initial positions to zero */

   for (r = 1; r <= ROWS; ++r) {
      for (f = 1; f <= FILES; ++f) {
         sprintf(filename, "res.%d", f);                  /* creates the name of the current file */
         file = fopen(filename, "r");                     /* opens the current file */
         fseek(file, positions[f], SEEK_SET);             /* set position from the saved one */
         positions[f] += getline(&line, &linesize, file); /* reads line and saves the new position */
         line[strlen(line) - 1] = 0;                      /* removes the newline */
         printf("%s ", line);                             /* prints in the standard ouput, and a single space */
         fclose(file);                                    /* closes the current file */
      printf("n");  /* after getting the line from each file, prints a new line to standard output */
Answered By: Laurence R. Ugalde
{ paste res.? res.?? res.???
while paste ./res."$((i+=1))"[0-9][0-9][0-9]
do :; done; } >outfile

I don’t think this is as complicated as all that – you’ve already done the hard work by ordering the filenames. Just don’t open all of them at the same time, is all.

Another way:

pst()      if   shift "$1"
           then paste "$@"
set ./res.*
while  [ -n "${1024}" ] ||
     ! paste "$@"
do     pst "$(($#-1023))" "$@"
       shift 1024
done >outfile

…but I think that does them backwards… This might work better:

i=0;  echo 'while paste '
until [ "$((i+=1))" -gt 1023 ] &&
      printf '%sn' '"${1024}"' 
      do shift 1024 done
do    echo '"${'"$i"'-/dev/null}" '
done | sh -s -- ./res.* >outfile

And here is yet another way:

tar --no-recursion -c ./ |
{ printf \0; tr -s \0; }    |
cut -d '' -f-2,13              |
tr 'n' 'nt' >outfile

That allows tar to gather all of the files into a null-delimited stream for you, parses out all of its header metadata but the filename, and transforms all lines in all files to tabs. It relies on the input being actual text-files though – meaning each ends w/ a newline and there are no null-bytes in the files. Oh – and it also relies on the filenames themselves being newline-free (though that might be handled robustly with GNU tar‘s --xform option). Given these conditions are met, it should make very short work of any number of files – and tar will do almost all of it.

The result is a set of lines that look like:


And so on.

I tested it by first creating 5 testfiles. I didn’t really feel like genning 10000 files just now, so I just went a little bigger for each – and also ensured that the file lengths differed by a great deal. This is important when testing tar scripts because tar will block out input to fixed lengths – if you don’t try at least a few different lengths you’ll never know whether you’ll actually handle only the one.

Anyway, for the test files I did:

for f in 1 2 3 4 5; do : >./"$f"
seq "${f}000" | tee -a [12345] >>"$f"

ls afterward reported:

ls -sh [12345]
68K 1 68K 2 56K 3 44K 4 24K 5

…then I ran…

tar --no-recursion -c ./ |
{ printf \0; tr -s \0; }|
cut -d '' -f-2,13          |
tr 'n' 'nt' | cut -f-25

…just to show only the first 25 tab-delimited fields per line (because each file is a single line – there are a lot)

The output was:

1    2    3    4    5    6    7    8    9    10    11    12    13    14    15    16    17    18    19    20    21    22    23    24    25
1    2    3    4    5    6    7    8    9    10    11    12    13    14    15    16    17    18    19    20    21    22    23    24    25
1    2    3    4    5    6    7    8    9    10    11    12    13    14    15    16    17    18    19    20    21    22    23    24    25
1    2    3    4    5    6    7    8    9    10    11    12    13    14    15    16    17    18    19    20    21    22    23    24    25
1    2    3    4    5    6    7    8    9    10    11    12    13    14    15    16    17    18    19    20    21    22    23    24    25
Answered By: mikeserv

In case someone stumbles upon this looking to accomplish similar with cat – I had no luck raising the file limit, as zsh complained about argument list too long regardless.

But this can be easily and safely prevented using find (not ls, as that will produce edge cases when combined with xargs):

find DIR -type f -print0 | xargs -0 cat >OUT
Answered By: xeruf