Remove all words that appear n+ times without deleting the lines or change the file order

I have a text file that consists of sections, each with two header lines and a body line consisting of space-separated words. An example looks as follows:

Shares for DED-SHD-ED-1:
    [--- Listable Shares ---]
        backup      backup2
Shares for DED-SHD-ED-2:
    [--- Listable Shares ---]
        ConsoleSetup        REMINST     SCCMContentLib$     SCCMContentLibC$        SEFPKGC$        SEFPKGD$        SEFPKGE$        SEFSIG$     Source      UpdateServicesPackages      WsusContent     backup      backup2
Shares for DED-SHD-BE-03:
    [--- Listable Shares ---]
        backup      backup2     print$

I want to delete from the body lines all words that occur three or more times.

  • I want to delete all occurences, not just "all occurences after the first two".
  • The tokens to be matched are the space-separated "words", e.g. print$ as a whole and not just the alphanumerical part print.
  • The matching should only apply to "whole words", i.e. no partial string matches. E.g. any occurence of backup would only count towards removal of backup, but not of backup2.
  • The header lines (Shares for ... and [--- Listable Shares ---]) are not to be considered.

The desired output for the above input would look as follows:

Shares for DED-SHD-ED-1:
    [--- Listable Shares ---]
                
Shares for DED-SHD-ED-2:
    [--- Listable Shares ---]
        ConsoleSetup        REMINST     SCCMContentLib$     SCCMContentLibC$        SEFPKGC$        SEFPKGD$        SEFPKGE$        SEFSIG$     Source      UpdateServicesPackages      WsusContent
Shares for DED-SHD-BE-03:
    [--- Listable Shares ---]
                        print$

As you can see, only the words backup and backup2 are removed as they occur in the body lines of three sections. However, Shares, for and Listable in the section header lines remain untouched because the header lines should not be considered for editing.

Some notes:

  • The files have sizes in the range of 100kB to 1MB.
  • I have found solutions like awk '++A[$0] < 3' but that would keep the first two occurences and looks at entire lines.
  • I am not looking specifically for an Awk-based solution, anything (except Perl ;)) is fine.
Asked By: pwrsheller

||

As the requirement covers files of up to 1MB, there are a couple of array inversions in this to improve efficiency. Because we are deleting words, I did not thing it was important to retain exact spacing, so each word in a replaced row is just preceded by TAB.

It is a Bash script containing a single shell function, which itself contains a single awk program. It takes one input file argument, and outputs to stdout.

I am not sure how you might want to verify the results. I had plenty of debug during development: it would be easy (for example) to log to stderr the deleted words, with their frequency.

#! /bin/bash

delByFreq () {

    local Awk='
BEGIN { SEP = "|"; Freq = 3; }
#.. Store every input line.
{ Line[NR] = $0; }
#.. Do not look for words on header lines.
/^Shares for / { next; }
/--- Listable Shares ---/ { next; }

#.. Keep an index to row/column of every unique word.
#.. So like: Ref ["backup2"] = "|2|3|5|1|5|7";
function Refer (row, txt, Local, f) {
    for (f = 1; f <= NF; ++f)
        Ref[$(f)] = Ref[$(f)] SEP row SEP f;
}
{ Refer( NR, $0); }

#.. Rearrange field indexes by line.
#.. So like: Del[row] = "|3|7|11"; for field numbers.
function refByLine (Local, word, j, n, V) {
    for (word in Ref) {
        n = split (Ref[word], V, SEP);
        if (n <= 2 * Freq) continue;
        for (j = 2; j < n; j += 2)
            Del[V[j]] = Del[V[j]] SEP (V[j+1]);
    }
}
#.. For every line with deletions, cross off the frequent words.
function Deletions (Local, row, j, f, n, V, X) {
    for (row in Del) {
        split (Del[row], V, SEP);
        split ("", X, FS); for (j = 2; j in V; ++j) X[V[j]];
        #.. Rebuild the line in field order. 
        split (Line[row], V, FS); Line[row] = "";
        for (j = 1; j in V; ++j)
            if (! (j in X)) Line[row] = Line[row] "t" V[j];
    }
}
function Output (Local, r) {
    for (r = 1; r in Line; ++r) printf ("%sn", Line[r]);
}
END { refByLine( ); Deletions( ); Output( ); }
'
    awk -f <( printf '%s' "${Awk}" ) "${1}"
}

    delByFreq "${1}"

Answered By: Paul_Pedant

Using GNU awk for the 4th arg to split() to save the strings that match the FS so we can have the same spacing in the output as was present in the output:

$ cat tst.awk
{ begFld = 1 }
/^Shares for/ { begFld = 3 }
/[--- Listable Shares ---]/ { begFld = NF+1 }
NR == FNR {
    for ( i=begFld; i<=NF; i++ ) {
        cnt[$i]++
    }
    next
}
{
    split($0,unused,FS,seps)
    out = seps[0]
    for ( i=1; i<=NF; i++ ) {
        out = out ( (i >= begFld) && (cnt[$i] >= 3) ? "" : $i ) seps[i]
    }
    print out
}

$ awk -f tst.awk file file
Shares for DED-SHD-ED-1:
    [--- Listable Shares ---]

Shares for DED-SHD-ED-2:
    [--- Listable Shares ---]
        ConsoleSetup        REMINST     SCCMContentLib$     SCCMContentLibC$        SEFPKGC$        SEFPKGD$        SEFPKGE$        SEFSIG$     Source      UpdateServicesPackages      WsusContent
Shares for DED-SHD-BE-03:
    [--- Listable Shares ---]
                   print$

You can do the same in any awk with a while ( match(...) ) loop instead of split(...); for (...) , it’d just be a couple more lines of code, e.g. this will work in any awk:

$ cat tst.awk
{ begFld = 1 }
/^Shares for/ { begFld = 3 }
/[--- Listable Shares ---]/ { begFld = NF+1 }
NR == FNR {
    for ( i=begFld; i<=NF; i++ ) {
        cnt[$i]++
    }
    next
}
{
    i = 0
    out = ""
    while ( match($0,/[^ t]+/) ) {
        sep = substr($0,1,RSTART-1)
        fld = substr($0,RSTART,RLENGTH)
        out = out sep ( (++i >= begFld) && (cnt[fld] >= 3) ? "" : fld )
        $0 = substr($0,RSTART+RLENGTH)
    }
    print out $0
}

$ awk -f tst.awk file file
Shares for DED-SHD-ED-1:
    [--- Listable Shares ---]

Shares for DED-SHD-ED-2:
    [--- Listable Shares ---]
        ConsoleSetup        REMINST     SCCMContentLib$     SCCMContentLibC$        SEFPKGC$        SEFPKGD$        SEFPKGE$        SEFSIG$     Source      UpdateServicesPackages      WsusContent
Shares for DED-SHD-BE-03:
    [--- Listable Shares ---]
                   print$

EDIT: @Paul_Pedant and I were chatting in the comments about the pros/cons of reading the input into an array and then processing it in the END section as his script does vs reading the input file twice as my script above does so I put mine in a shell script and added a bash shebang:

#!/usr/bin/env bash

awk '
    { begFld = 1 }
    ...
        print out
    }
' "$1" "$1"

then created an input file that’s 1 million copies of the OPs 9-line input file by doing this:

$ awk '{r=(NR>1 ? r ORS : "") $0} END{for (i=1; i<=1000000; i++) print r}' file > file1m

then timed executing my script on it:

$ time ./tst_ed.sh file1m > ed.out

real    1m3.814s
user    0m57.781s
sys     0m0.265s

but when I tried to run Pauls script on it:

$ time ./tst_paul.sh file1m > paul.out

my laptop started to sound like a helicopter taking off so after 5 minutes I interrupted it then waited about 3 more minutes for my laptop to settle down again.

I then tried both on a 100k reps file:

$ awk '{r=(NR>1 ? r ORS : "") $0} END{for (i=1; i<=100000; i++) print r}' file > file100k

$ time ./tst_ed.sh file100k > ed.out                                            
real    0m6.035s
user    0m5.875s
sys     0m0.031s

$ time ./tst_paul.sh file100k > paul.out

but I again had to eventually interrupt Pauls (I gave this one 10 minutes).

I then tried a 10k reps file:

$ awk '{r=(NR>1 ? r ORS : "") $0} END{for (i=1; i<=10000; i++) print r}' file > file10k

$ time ./tst_ed.sh file10k > ed.out                                             
real    0m0.783s
user    0m0.609s
sys     0m0.045s

$ time ./tst_paul.sh file10k > paul.out

real    0m1.039s
user    0m0.921s
sys     0m0.031s

This time I got output from both so I ran a diff -b on them and discovered that the output is different –

$ diff -b ed.out paul.out |head
1c1
< Shares for
---
> Shares for DED-SHD-ED-1:
4c4
< Shares for
---
> Shares for DED-SHD-ED-2:
7c7
< Shares for

Mine removes the duplicate values at the end of the Shares for ... lines while Paul’s doesn’t. idk which would be the desired behavior for the OP or if it even matters, it may just be unrealistic input.

I then tried 1k reps:

$ awk '{r=(NR>1 ? r ORS : "") $0} END{for (i=1; i<=1000; i++) print r}' file > file1k

$ time ./tst_ed.sh file1k > ed.out

real    0m0.133s
user    0m0.077s
sys     0m0.015s

$ time ./tst_paul.sh file1k > paul.out

real    0m0.133s
user    0m0.046s
sys     0m0.046s

and 100 reps:

$ awk '{r=(NR>1 ? r ORS : "") $0} END{for (i=1; i<=100; i++) print r}' file > file100

$ time ./tst_ed.sh file100 > ed.out

real    0m0.080s
user    0m0.000s
sys     0m0.015s

$ time ./tst_paul.sh file100 > paul.out

real    0m0.081s
user    0m0.000s
sys     0m0.000s

so it looks like for around 1k or less repetitions of the OPs data (i.e. up to about a 10k line input file) whether you store the data in memory and parse in the END section or read the input file twice is about a wash regarding execution speed (once you’re into the 10ths of a second execution time who cares?) and at about 10k repetitions (about 100k input lines) the read-twice approach is a bit faster but both are fast at around 1 sec execution time. Once you get to larger input file sizes than that, though, you really don’t want to be trying to store it in memory.

Answered By: Ed Morton
Categories: Answers Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.