Can some command or script subtract lines in one file from another faster than grep?

I have a shell script that runs regularly and the following part of it causes a slowdown.

grep -v -f RemoveTheseGoodIPs.txt FromTheseShadyIPs.txt > RemainingBadIPs.txt

It works. It just takes 156 seconds to give me an output. I’m hoping to figure out a quicker way to process that still is easy to understand and elegant.

For context: The "FromTheseShadyIPs.txt" is a list of 200k shady IP addresses and the "RemoveTheseGoodIPs.txt" is a whitelist file of 3k IPs that are good. Ultimately, I’m producing a blacklist that my external firewall can reference but I don’t want my 3k good IPs in my blacklist. If it helps, the order of the IPs in either file doesn’t matter and they’re already deduplicated within each file. Processing server runs Debian 9 on decent specs.

Asked By: Darius Dauer

||

Try adding the -F option. It does not do any regexp processing and just interprets input as string literals.

Answered By: Anthony De Vellis

There’s more important issues with your script than execution speed, it’ll also encounter false matches in 2 ways:

  1. Regexp vs String: You’re using a regexp comparison when you should be using a string comparison. As written the .s from the IP addresses in RemoveTheseGoodIPs.txt will match any character FromTheseShadyIPs.txt, and
  2. Partial vs Full: You’re using a partial line comparison when you should be using a full-line comparison. As written a shorter IP address in RemoveTheseGoodIPs.txt will match any IP address that includes it in FromTheseShadyIPs.txt

Given that, your current script is almost certainly removing IP addresses from FromTheseShadyIPs.txt that are not present in RemoveTheseGoodIPs.txt, thereby effectively breaking your firewall.

For example, if RemoveTheseGoodIPs.txt contained 1.2.3.4 and FromTheseShadyIPs.txt contained 911.253.456.789 then your grep would remove that 2nd IP address because you’re doing a partial-line regexp match instead of the full-line string match which you need:

$ head RemoveTheseGoodIPs.txt FromTheseShadyIPs.txt
==> RemoveTheseGoodIPs.txt <==
1.2.3.4

==> FromTheseShadyIPs.txt <==
9.8.7.6
911.253.456.789
6.7.8.9

$ grep -v -f RemoveTheseGoodIPs.txt FromTheseShadyIPs.txt
9.8.7.6
6.7.8.9

You should be using

$ grep -vFxf RemoveTheseGoodIPs.txt FromTheseShadyIPs.txt
9.8.7.6
911.253.456.789
6.7.8.9

to make your script work. That’s -F for string instead of regexp comparison, and -x for full-line instead of partial comparison. That will probably also be faster than your current script but the far more important difference is it’ll work robustly.

If your grep doesn’t support any of those options and you can’t get a version that does then use the following with any awk instead:

$ awk 'NR==FNR{a[$0]; next} !($0 in a)' RemoveTheseGoodIPs.txt FromTheseShadyIPs.txt
9.8.7.6
911.253.456.789
6.7.8.9

As @Paul_Pedant mentions in a comment using awk may be faster than grep for this anyway, depending on your grep and awk implementations.

Answered By: Ed Morton
Categories: Answers Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.