How to preserve order of appearance when counting words in a word list file

I have two files:

  • file1 contains a list of unique words
  • file2 contains several sentences

I want to output a tab separated file with the occurrence of each word in listed in file1 in file2 while preserving the order in which they are listed in file 1.

For example:

  • file 1:
    dog 
    apple
    cat
    
  • file 2:
    the dog played with the cat and the cat was white.
    the boy ate the apple.
    
  • Desired output:
    dog 1
    apple 1
    cat 2
    

I tried existing answers in the community, but they all sort the output.

Asked By: M.A.G

||

Using any POSIX awk in any shell on every Unix box:

$ cat tst.awk
BEGIN { OFS="t" }
NR==FNR {
    words[NR] = $1
    next
}
{
    $0 = " " $0 " "
    gsub(/[^[:alpha:]]+/,"  ")
    for ( i in words ) {
        word = words[i]
        cnts[word] += gsub(" "word" ","&")
    }
}
END {
    for ( i=1; i in words; i++ ) {
        word = words[i]
        print word, cnts[word]+0
    }
}

$ awk -f tst.awk file1 file2
dog     1
apple   1
cat     2

The above assumes that "word"s are all alphabetic characters and that you want the matches to be case-sensitive or the input is all lower case as in your example and that the words in file1 are unique as in your example.

Answered By: Ed Morton

Using Raku (formerly known as Perl_6)

Below is a general solution for matching lines of one file (saved as @a array) against lines of a second file (saved as @b array), counting occurrences (i.e. Bagging):

raku -e 'my  @a =  dir(test => "alphabet.txt").IO.lines.reverse; my @b = $*ARGFILES.lines;  
         for @a -> $a {@b.grep(/<$a>/).Bag.pairs.say};'  alphabet.txt alphabet.txt

In constructing @a, Raku is given a dir() location and a test => "…" filename. In constructing @b, one-or-more files are entered on the command line, and read off via Raku’s $*ARGFILES dynamic variable.

General Input is alphabet.txt, one letter per line and reversed immediately upon reading into Raku to place the array in "z".."a" order;

General Output (when two copies of "a".."z" alphabet.txt are entered on the command-line):

(z => 2)
(y => 2)
(x => 2)
(w => 2)
(v => 2)
(u => 2)
(t => 2)
(s => 2)
(r => 2)
(q => 2)
(p => 2)
(o => 2)
(n => 2)
(m => 2)
(l => 2)
(k => 2)
(j => 2)
(i => 2)
(h => 2)
(g => 2)
(f => 2)
(e => 2)
(d => 2)
(c => 2)
(b => 2)
(a => 2)

Note how the return stays in the same order as the @a array, and how Raku doesn’t require a sort call to produce the output above.

Finally, solving the OP’s issue, all that has to be changed from the code above is using my @b = $*ARGFILES.lines.words instead of my @b = $*ARGFILES.lines.

[To obtain tab-separated output use .put instead of .say in the code above. This drops the surrounding parens and the => arrow between the two columns].

Final Code:

~$ raku -e 'my @a = dir(test => "dog_apple_cat.txt").IO.lines.grep(*.chars);  
            my @b = $*ARGFILES.lines.words; for @a -> $a {  
            @b.grep(/<$a>/).Bag.pairs.put};' text.txt
dog 1
apple.  1
cat 2

https://docs.raku.org/type/Bag
https://raku.org

Answered By: jubilatious1
Categories: Answers Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.