Replace a column in file with look up table from another file
I have a tab delimited file with a column that has multiple values separated by comma where i wanted to replace values from a look up table
Lookup_file:
ID Name
g_00001 g_00001
g_00002 cytA
g_00003 g_00003
g_00004 mntB
g_00005 recF
g_00006 gyaN
g_00007 traR
g_00008 g_00008
g_00009 g_00009
g_00010 hypE
input_file:
Name Start Stop Strand Number of Genes Genes
op00001 1544 5454 + 2 cytA, g_00001
op00002 7026 12012 + 2 recF, mntB
op00003 15215 16854 - 3 g_00010,cytA, g_00009
op00004 19856 25454 - 2 hypE, g_00020
op00005 20791 23568 + 2 gyaN, g_00005
Output file:
Name Start Stop Strand Number of Genes Genes
op00001 1544 5454 + 2 g_00002, g_00001
op00002 7026 12012 + 2 g_00005, g_00004
op00003 15215 16854 - 3 g_00010, g_00002, g_00009
op00004 19856 25454 - 2 g_00010, g_00020
op00005 20791 23568 + 2 g_00006, g_00005
Based on some examples here , i have tried the following code
awk -F';' 'NR==FNR{a[$2]=$1;next}{$6=a[$1]}1' lookup input
and its not changing anything.
Another way is by trying one by one using sed -i ‘s/cytA/g_00002/’ and create sed files for each line and run it in loop is what i thought of but i wanted to check if there is better way of doing it.
$ cat tst.awk
BEGIN { FS=OFS="t" }
NR == FNR {
map[$2] = $1
next
}
{
n = split($NF,g,/[[:space:]]*,[[:space:]]*/)
out = ""
for ( i=1; i<=n; i++ ) {
out = (i>1 ? out ", " : "") (g[i] in map ? map[g[i]] : g[i])
}
$NF = out
print
}
$ awk -f tst.awk lookup_file input_file
Name Start Stop Strand Number of Genes Genes
op00001 1544 5454 + 2 g_00002, g_00001
op00002 7026 12012 + 2 g_00005, g_00004
op00003 15215 16854 - 3 g_00010, g_00002, g_00009
op00004 19856 25454 - 2 g_00010, g_00020
op00005 20791 23568 + 2 g_00006, g_00005
Those "multiple values separated by comma" are separated by comma and (in most but not all of the cases) spaces which doesn’t make it easier to handle them. Try adapting the field separator and operating on every gene as a single field:
awk -F"[, t]*" '
NR==FNR {a[$2] = $1
next
}
{for (i=6; i<=NF; i++) if ($i in a) sub($i, a[$i])
}
1
' OFS="t" Lookup_file input_file
Name Start Stop Strand Number of Genes Genes
op00001 1544 5454 + 2 g_00002, g_00001
op00002 7026 12012 + 2 g_00005, g_00004
op00003 15215 16854 - 3 g_00010,g_00002, g_00009
op00004 19856 25454 - 2 g_00010, g_00020
op00005 20791 23568 + 2 g_00006, g_00005