syntax to delete lines

I use this syntax to delete lines from 2.txt to 1.txt:

awk 'NR==FNR{a[$0]=1;next}!a[$0]' 2.txt 1.txt  > lines.txt

My lines are in this format:

email@email.com:something

If this line is identical in the two files, I get lines.txt without this, so it’s good, BUT I want to delete lines if the email@email.com is identical and ignore the word after :.

Asked By: Abracadabra

||

Use this:

awk -F: 'NR==FNR{a[$1]=1;next}!a[$1]' 2.txt 1.txt > lines.txt

-F: – sets the delimiter to : (colon), and than uses only the first field ($1) for the comparison.

Answered By: aviro

You need to:

  1. Tell awk you’re using : as the field separator, and
  2. Use a field as the array index instead of the whole line, and
  3. Test for the index being present rather than a value

i.e. do this:

awk -F':' 'NR==FNR{a[$1]; next} !($1 in a)' 2.txt 1.txt  > lines.txt

When you do NR==FNR{a[$1]=1; next} !a[$1] you’re filling up memory with 1s unnecessarily when you read 2.txt to first populate a[], and then when you read 1.txt you’re adding every $1 from that file into a[] because just doing a["foo"] creates an entry in a[] indexed by "foo" and thereby eating up [usually a lot] more memory unnecessarily and so slowing your script down and possibly causing it to fail if that 2nd file is large enough.

Usually with these types of problem the first file has far fewer values than the second one so just to give you an idea of the time difference between the 2 approaches, lets say you want to print lines from file2 that are or are not one of the lines from file1 given file1 has 1000 lines and file2 has 10 million. We can create the input with:

$ awk 'BEGIN{for (i=1; i<=1000; i++) print "foo"i}' > file1
$ awk 'BEGIN{for (i=1; i<=10000000; i++) print "foo"i}' > file2

and then test for printing the lines from file2 that are in file1:

$ time awk 'NR==FNR{a[$0]=1;next}a[$0]' file1 file2 >/dev/null

real    0m4.279s
user    0m3.375s
sys     0m0.796s

$ time awk 'NR==FNR{a[$0];next}$0 in a' file1 file2 >/dev/null

real    0m1.453s
user    0m1.343s
sys     0m0.046s

and test for printing the lines from file2 that are not in file1:

$ time awk 'NR==FNR{a[$0]=1;next}!a[$0]' file1 file2 >/dev/null

real    0m5.549s
user    0m4.828s
sys     0m0.656s

$ time awk 'NR==FNR{a[$0];next}!($0 in a)' file1 file2 >/dev/null

real    0m2.701s
user    0m2.640s
sys     0m0.000s
Answered By: Ed Morton
Categories: Answers Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.