Replace a column in file with look up table from another file

I have a tab delimited file with a column that has multiple values separated by comma where i wanted to replace values from a look up table

Lookup_file:

  ID    Name
g_00001 g_00001
g_00002 cytA
g_00003 g_00003
g_00004 mntB
g_00005 recF
g_00006 gyaN
g_00007 traR
g_00008 g_00008
g_00009 g_00009
g_00010 hypE

input_file:

Name    Start   Stop    Strand  Number of Genes Genes
op00001 1544    5454    +   2   cytA, g_00001
op00002 7026    12012   +   2   recF, mntB
op00003 15215   16854   -   3   g_00010,cytA, g_00009
op00004 19856   25454   -   2   hypE, g_00020
op00005 20791   23568   +   2   gyaN, g_00005

Output file:

Name    Start   Stop    Strand  Number of Genes Genes
op00001 1544    5454    +   2   g_00002, g_00001
op00002 7026    12012   +   2   g_00005, g_00004
op00003 15215   16854   -   3   g_00010, g_00002, g_00009
op00004 19856   25454   -   2   g_00010, g_00020
op00005 20791   23568   +   2   g_00006, g_00005

Based on some examples here , i have tried the following code

awk -F';' 'NR==FNR{a[$2]=$1;next}{$6=a[$1]}1' lookup input

and its not changing anything.

Another way is by trying one by one using sed -i ‘s/cytA/g_00002/’ and create sed files for each line and run it in loop is what i thought of but i wanted to check if there is better way of doing it.

Asked By: Empyrean rocks

||
$ cat tst.awk
BEGIN { FS=OFS="t" }
NR == FNR {
    map[$2] = $1
    next
}
{
    n = split($NF,g,/[[:space:]]*,[[:space:]]*/)
    out = ""
    for ( i=1; i<=n; i++ ) {
        out = (i>1 ? out ", " : "") (g[i] in map ? map[g[i]] : g[i])
    }
    $NF = out
    print
}

$ awk -f tst.awk lookup_file input_file
Name    Start   Stop    Strand  Number of Genes Genes
op00001 1544    5454    +       2       g_00002, g_00001
op00002 7026    12012   +       2       g_00005, g_00004
op00003 15215   16854   -       3       g_00010, g_00002, g_00009
op00004 19856   25454   -       2       g_00010, g_00020
op00005 20791   23568   +       2       g_00006, g_00005
Answered By: Ed Morton

Those "multiple values separated by comma" are separated by comma and (in most but not all of the cases) spaces which doesn’t make it easier to handle them. Try adapting the field separator and operating on every gene as a single field:

awk -F"[, t]*" '
NR==FNR         {a[$2] = $1
                 next
                }
                {for (i=6; i<=NF; i++)  if ($i in a) sub($i, a[$i])
                }
1
' OFS="t" Lookup_file input_file
Name    Start   Stop    Strand  Number of Genes Genes
op00001 1544    5454    +   2   g_00002, g_00001
op00002 7026    12012   +   2   g_00005, g_00004
op00003 15215   16854   -   3   g_00010,g_00002, g_00009
op00004 19856   25454   -   2   g_00010, g_00020
op00005 20791   23568   +   2   g_00006, g_00005
Answered By: RudiC
Categories: Answers Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.