Delete rows from a simplified CSV where some columns matching specific pattern

I have the following simplified CSV file (no separators or newlines embedded in fields):

ID,PDBID,FirstResidue,FirstChain,SecondResidue,SecondChain,ThirdResidue,ThirdChain,FourthResidue,FourthChain,Pattern
RZ_AUTO_505,1hmh,A22L,C,A22L,A,G21L,A,A23L,A,AA/GA Naked ribose
RZ_AUTO_506,1hmh,A22L,C,A22L,A,G114,A,A23L,A,AA/GA Naked ribose
RZ_AUTO_507,1hmh,A130,E,A90,A,G80,A,A130,A,AA/GA Naked ribose
RZ_AUTO_508,1hmh,A140,E,A90,E,G120,A,A90,A,AA/GA Naked ribose
RZ_AUTO_509,1hmh,G102,A,C103,A,G102,E,A90,E,GC/GA Single ribose
RZ_AUTO_510,1hmh,G102,A,C103,A,G120,E,A90,E,GC/GA Single ribose
RZ_AUTO_511,1hmh,G113,C,C112,C,G21L,A,A23L,A,GC/GA Single ribose
RZ_AUTO_512,1hmh,G113,C,C112,C,G114,A,A23L,A,GC/GA Single ribose
RZ_AUTO_513,1hnw,C1496,A,G1497,A,A1518,A,A1519,A,CG/AA Canonical ribose
RZ_AUTO_514,1hnw,C1496,A,G1497,A,A1519,A,A1518,A,CG/AA Canonical ribose
RZ_AUTO_515,1hnw,C221,A,U222,A,A195,A,A196,A,CU/AA Canonical ribose
RZ_AUTO_516,1hnw,C221,A,U222,A,A196,A,A195,A,CU/AA Canonical ribose

I need to remove the CSV rows if the value of FirstResidue or SecondResidue or ThirdResidue or FourthResidue doesn’t end with an integer.
The output should look something like below.

RZ_AUTO_507,1hmh,A130,E,A90,A,G80,A,A130,A,AA/GA Naked ribose
RZ_AUTO_508,1hmh,A140,E,A90,E,G120,A,A90,A,AA/GA Naked ribose
RZ_AUTO_509,1hmh,G102,A,C103,A,G102,E,A90,E,GC/GA Single ribose
RZ_AUTO_510,1hmh,G102,A,C103,A,G120,E,A90,E,GC/GA Single ribose
RZ_AUTO_513,1hnw,C1496,A,G1497,A,A1518,A,A1519,A,CG/AA Canonical ribose
RZ_AUTO_514,1hnw,C1496,A,G1497,A,A1519,A,A1518,A,CG/AA Canonical ribose
RZ_AUTO_515,1hnw,C221,A,U222,A,A195,A,A196,A,CU/AA Canonical ribose
RZ_AUTO_516,1hnw,C221,A,U222,A,A196,A,A195,A,CU/AA Canonical ribose

So I’m just wondering how to achieve this using awk. I’m using Mac OSX.

Asked By: Sri

||

You want to print only lines for which the third, fifth, seventh, and ninth fields end with a digit. In that case:

$ awk -F, '$3 ~/[[:digit:]]$/ && $5 ~/[[:digit:]]$/ && $7 ~/[[:digit:]]$/ && $9 ~ /[[:digit:]]$/' file
RZ_AUTO_507,1hmh,A130,E,A90,A,G80,A,A130,A,AA/GA Naked ribose
RZ_AUTO_508,1hmh,A140,E,A90,E,G120,A,A90,A,AA/GA Naked ribose
RZ_AUTO_509,1hmh,G102,A,C103,A,G102,E,A90,E,GC/GA Single ribose
RZ_AUTO_510,1hmh,G102,A,C103,A,G120,E,A90,E,GC/GA Single ribose
RZ_AUTO_513,1hnw,C1496,A,G1497,A,A1518,A,A1519,A,CG/AA Canonical ribose
RZ_AUTO_514,1hnw,C1496,A,G1497,A,A1519,A,A1518,A,CG/AA Canonical ribose
RZ_AUTO_515,1hnw,C221,A,U222,A,A195,A,A196,A,CU/AA Canonical ribose
RZ_AUTO_516,1hnw,C221,A,U222,A,A196,A,A195,A,CU/AA Canonical ribose

How it works

Typical awk commands consist of a condition and an action. Here we have a condition that consists of four parts. Because the action that we want is the default action (print the line), we don’t actually need to specify it. Each part of the condition looks like:

$3 ~/[[:digit:]]$/

This is true if field 3 ends in a digit. This is “and”-ed with three others, one each for fields 5, 7, and 9. If all are true, then the line is printed.

Answered By: John1024

You can also try the following Python2 solution:

#!/usr/bin/env python2
import csv, re
with open('file.txt', 'rb') as f:
    for line in csv.reader(f):
        if re.search(r'[0-9]$', line[2]) and re.search(r'[0-9]$', line[4]) and re.search(r'[0-9]$', line[6]) and re.search(r'[0-9]$', line[8]):
            print ' '.join(line)
Answered By: heemayl

Using Miller (mlr) and testing the four named fields using a regular expression:

$ mlr --csvlite filter '$FirstResidue =~ "[0-9]$" && $SecondResidue =~ "[0-9]$" && $ThirdResidue =~ "[0-9]$" && $FourthResidue =~ "[0-9]$"' file
ID,PDBID,FirstResidue,FirstChain,SecondResidue,SecondChain,ThirdResidue,ThirdChain,FourthResidue,FourthChain,Pattern
RZ_AUTO_507,1hmh,A130,E,A90,A,G80,A,A130,A,AA/GA Naked ribose
RZ_AUTO_508,1hmh,A140,E,A90,E,G120,A,A90,A,AA/GA Naked ribose
RZ_AUTO_509,1hmh,G102,A,C103,A,G102,E,A90,E,GC/GA Single ribose
RZ_AUTO_510,1hmh,G102,A,C103,A,G120,E,A90,E,GC/GA Single ribose
RZ_AUTO_513,1hnw,C1496,A,G1497,A,A1518,A,A1519,A,CG/AA Canonical ribose
RZ_AUTO_514,1hnw,C1496,A,G1497,A,A1519,A,A1518,A,CG/AA Canonical ribose
RZ_AUTO_515,1hnw,C221,A,U222,A,A195,A,A196,A,CU/AA Canonical ribose
RZ_AUTO_516,1hnw,C221,A,U222,A,A196,A,A195,A,CU/AA Canonical ribose
Answered By: Kusalananda
Categories: Answers Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.