Merging large data set in fastest possible way

I have a large data set ~100 GBs having smaller csv files ~ 100,000. I wish to merge all .csv files together in the fastest possible way. Files are located in two different folders.

  1. Someone suggested me to create two files having file names of the
    file from both the folders, and using the file name, iterate in the
    directory to append the output files, rather than using glob ls *.
    How can I achieve this using Linux. I’m fairly new to this, if
    someone can help me out, I would appreciate.
  2. If there any other way, considering the data set, to achieve the
    result
Asked By: user4943236

||

Try something like this:

find /path/to/dir1 /path/to/dir2 -type f -name '*.csv' -exec cat {} + >/path/to/merged.csv

Make sure not to put the merged file in either of the input directories. 🙂

Answered By: lcd047

If I understand you correctly, you want to create a single (100GB-sized?) file from all of your files. In that case cat might do the trick:

cat foo1 foo2 bar1 bar2 > newfile

or in your case

cat folder1/* folder2/* > newfile

which will take all files from folder1 and folder2 and merge them (as text) into newfile. note that cat just appends the files at the end (i.e. adding lines)

This will however keep the old files and thus you’ll need twice the space, so you might also treat them individually:

for f in folder1/* folder2/* ; do
    cat $f >> newfile && rm $f
done

that way each file is added and then removed.

Is that what you were searching for?

Answered By: FelixJN

If all the CSV files have the same structure (header), and placed in 2-levels sub-directories, you might want to use DuckDB which can handle such load (100k files / GB of data) with reasonably recent computers that have 32GB or 64GB of RAM

Here is a way to ingest all files:

CREATE TABLE merge AS SELECT * from read_csv_auto('*/*/*.csv');

a quick glance of the table generated

SUMMARIZE SELECT * from merge;

and finally, to export to a single csv file

COPY merge TO 'export.csv' (HEADER, DELIMITER ',');
Answered By: Cyril Chaboisseau
Categories: Answers Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.