How to preserve order of appearance when counting words in a word list file
I have two files:
file1
contains a list of unique wordsfile2
contains several sentences
I want to output a tab separated file with the occurrence of each word in listed in file1
in file2
while preserving the order in which they are listed in file 1.
For example:
file 1
:dog apple cat
file 2
:the dog played with the cat and the cat was white. the boy ate the apple.
- Desired output:
dog 1 apple 1 cat 2
I tried existing answers in the community, but they all sort the output.
Using any POSIX awk in any shell on every Unix box:
$ cat tst.awk
BEGIN { OFS="t" }
NR==FNR {
words[NR] = $1
next
}
{
$0 = " " $0 " "
gsub(/[^[:alpha:]]+/," ")
for ( i in words ) {
word = words[i]
cnts[word] += gsub(" "word" ","&")
}
}
END {
for ( i=1; i in words; i++ ) {
word = words[i]
print word, cnts[word]+0
}
}
$ awk -f tst.awk file1 file2
dog 1
apple 1
cat 2
The above assumes that "word"s are all alphabetic characters and that you want the matches to be case-sensitive or the input is all lower case as in your example and that the words in file1 are unique as in your example.
Using Raku (formerly known as Perl_6)
Below is a general solution for matching lines of one file (saved as @a
array) against lines of a second file (saved as @b
array), counting occurrences (i.e. Bag
ging):
raku -e 'my @a = dir(test => "alphabet.txt").IO.lines.reverse; my @b = $*ARGFILES.lines;
for @a -> $a {@b.grep(/<$a>/).Bag.pairs.say};' alphabet.txt alphabet.txt
In constructing @a
, Raku is given a dir()
location and a test => "…"
filename. In constructing @b
, one-or-more files are entered on the command line, and read off via Raku’s $*ARGFILES
dynamic variable.
General Input is alphabet.txt
, one letter per line and reversed immediately upon reading into Raku to place the array in "z".."a" order;
General Output (when two copies of "a".."z" alphabet.txt
are entered on the command-line):
(z => 2)
(y => 2)
(x => 2)
(w => 2)
(v => 2)
(u => 2)
(t => 2)
(s => 2)
(r => 2)
(q => 2)
(p => 2)
(o => 2)
(n => 2)
(m => 2)
(l => 2)
(k => 2)
(j => 2)
(i => 2)
(h => 2)
(g => 2)
(f => 2)
(e => 2)
(d => 2)
(c => 2)
(b => 2)
(a => 2)
Note how the return stays in the same order as the @a
array, and how Raku doesn’t require a sort
call to produce the output above.
Finally, solving the OP’s issue, all that has to be changed from the code above is using my @b = $*ARGFILES.lines.words
instead of my @b = $*ARGFILES.lines
.
[To obtain tab-separated output use .put
instead of .say
in the code above. This drops the surrounding parens and the =>
arrow between the two columns].
Final Code:
~$ raku -e 'my @a = dir(test => "dog_apple_cat.txt").IO.lines.grep(*.chars);
my @b = $*ARGFILES.lines.words; for @a -> $a {
@b.grep(/<$a>/).Bag.pairs.put};' text.txt
dog 1
apple. 1
cat 2