Read nth column iteratively in two large files, cut and paste side by side to create new nth file
I have two files with around 100,000 space-separated columns in each. I want to extract each with column of the two files, write to a separate file and do command 2.
Example for a file with two columns.
cat test1.txt
rr1 rr2
1 2
1 2
1 1
2 1
cat test2.txt
rr1 rr2
2 2
1 1
2 1
2 2
I want to get the first column of test1.txt and the first column of test2.txt and create a new file with those two columns side by side in a new file called out. This is the code I have tried so far:
awk -F' ' '{
for(i=1; i<=NF; i++){ # iterate over each column
paste -d' ' <(sed 1d test1.txt | awk -v var1="$i" '{print $var1}') <(sed 1d test2.txt | awk -v var2="$i" '{print $var2}' ) > out$i
# write to file named with column name or i
# do command2 for out$1
}
}'
However, the code produces an error.
awk: cmd. line:3: paste -d
awk: cmd. line:3: ^ unexpected newline or end of string
Expected output
cat out1
1 2
1 1
1 2
2 2
cat out2
2 2
2 1
1 1
1 2
I have spent a lot of time on this and cannot figure out how to fix this. Any help would be appreciated. Is there a better way to do this as well?
Like this, using bash and awk to get number of columns and usual commands in the toolchest:
#!/bin/bash
for i in $(seq 1 $(awk '{print NF;exit}' test1.txt)); do
paste <(sed 1d test1.txt | cut -d ' ' -f"$i")
<(sed 1d test2.txt | cut -d ' ' -f"$i") > "out.$i"
done
or
#!/bin/bash
numcols=$(awk '{print NF;exit}' test1.txt)
for ((i=1; i<=numcols; i++)); do
paste <(sed 1d test1.txt | cut -d ' ' -f"$i")
<(sed 1d test2.txt | cut -d ' ' -f"$i") > "out.$i"
done
or using ksh:
#!/bin/ksh
numcols=$(awk '{print NF;exit}' test1.txt)
for i in {1..$numcols}; do
paste <(sed 1d test1.txt | cut -d ' ' -f"$i")
<(sed 1d test2.txt | cut -d ' ' -f"$i") > "out.$i"
done
Then:
cat out.1
cat out.2
As explained in comments, you can’t mix awk
and shell
like you do.
And if you are not a developer, better learn basic shell commands like I do here.
Please read documentation on these basic commands:
tr
paste
sed
seq
and not so basic (used in a simple way here):
awk
Process Substitution >(command ...)
or <(...)
is replaced by a temporary filename. Writing or reading that file causes bytes to get piped to the command inside. Often used in combination with file redirection: cmd1 2> >(cmd2)
.
See:
http://mywiki.wooledge.org/ProcessSubstitution
http://mywiki.wooledge.org/BashFAQ/024
The error comes from the fact that you can’t use single quotes within a single quoted string. The awk
command sees the program up to paste -d
as the awk
program (which includes syntax errors due to being truncated) and the rest of your code, up to the next unquoted space, as the first filename to process, etc. You also can’t use shell commands within an awk
program.
The following pipeline feeds the two files side by side into an awk
command using paste
. The awk
command outputs pairs of columns from each file to a different output file per column.
$ paste test1.txt test2.txt | awk 'NR > 1 { for (i = 1; i <= NF/2; ++i) print $i, $(NF/2+i) >("out" i) }'
$ cat out1
1 2
1 1
1 2
2 2
$ cat out2
2 2
2 1
1 1
1 2
The awk
code, pretty-printed:
NR > 1 {
for (i = 1; i <= NF/2; ++i)
print $i, $(NF/2+i) > ("out" i)
}
After ignoring the headers on the first line of input, this code iterates over the fields of one of the files (we assume that both files have the same number of fields and that the fields should be paired in the same order from both files), which are NF/2
fields, i.e. half of the ones we are given. It then prints the i
th field together the field had by adding NF/2
to that number to a file named out
followed by the field number, i
.
With a small modification, you can name your output files according to the headers in the first file (we ignore the headers in the second file and assume they are the same in the same order):
NR == 1 {
for (i = 1; i <= NF/2; ++i) head[i] = $i
next
}
{
for (i = 1; i <= NF/2; ++i)
print $i, $(NF/2+i) > head[i]
}
With the data given in the question, this would create the two files rr1
and rr2
(or overwrite them if they already exist).
As correctly pointed out in the comments below (comment now deleted), the above would likely cause a "too many open files" error for 100000 columns with awk
implementations that do not intelligently manage a pool of open file descriptors (like GNU awk
does). In other awk
implementations, you will need to close the output file after each print
and do the output with >>
(for appending) rather than with >
.
This is the adapted variant of the last awk
snippet from above:
NR == 1 {
for (i = 1; i <= NF/2; ++i) head[i] = $i
next
}
{
for (i = 1; i <= NF/2; ++i) {
print $i, $(NF/2+i) >> head[i]
close(head[i])
}
}
Assumptions:
- all files have at least one row (the header)
- all files have the same number of rows
- all files have the same number of columns
- all files can fit into memory (via an
awk
array)
General approach:
- with
GNU awk
we could use a multi-dimensional array but with the side effect that we’d use significantly more memory (than a single-dimensional index) - store data in a one-dimensional array with an index of
column # (NF)
+row number (FNR)
+file count
- in the
END{...}
block we loop through array printing data toout{1..NF}
files
Using just awk
:
$ cat merge.awk
FNR==1 { fcnt++ } # keep track of number of files
FNR>1 { for (i=1; i<=NF; i++) # loop through columns
lines[i,FNR,fcnt]=$i # index = column # + row number + file count
}
END { for (i=1; i<=NF; i++) { # loop through columns
for (j=2; j<=FNR; j++) # loop through rows
for (k=1; k<=fcnt; k++) # loop through filecount
printf "%s%s", lines[i,j,k], (k<fcnt ? OFS : ORS), lines[i,j,k] > ("out" i)
close ("out" i)
}
}
Running against OP’s two files:
$ awk -f merge.awk test1.txt test2.txt
$ head out?
==> out1 <==
1 2
1 1
1 2
2 2
==> out2 <==
2 2
2 1
1 1
1 2
Three new files:
$ head t?.txt
==> t1.txt <==
rr1 rr2 rr3
1 2 3
4 5 6
7 8 9
==> t2.txt <==
rr1 rr2 rr3
a b c
d e f
g h i
==> t3.txt <==
rr1 rr2 rr3
X XX XXX
Y YY YYY
Z ZZ ZZZ
Running against these three files:
$ awk -f merge.awk t1.txt t2.txt t3.txt
$ head out?
==> out1 <==
1 a X
4 d Y
7 g Z
==> out2 <==
2 b XX
5 e YY
8 h ZZ
==> out3 <==
3 c XXX
6 f YYY
9 i ZZZ
col_co=$(awk 'END{print NF}' f1.txt)
for ((i=1;i<=$col_co;i++))
do
awk -v i="$i" 'NR>1{print $i}' f1.txt|paste >file_1.txt
awk -v i="$i" 'NR>1{print $i}' f2.txt >file_2.txt
paste file_1.txt file_2.txt >out_new_$i.txt
done
output
cat out_new_1.txt
1 2
1 1
1 2
2 2
cat out_new_2.txt
2 2
2 1
1 1
1 2