awk, sed, or grep command to output boolean file for each word in file1 to check if it's existent or not in file2
I have two files, if the word in file 1 is not existent in file 2, I want to generate the word false in the corresponding line in a new file say file 3. Otherwise, I want to output true in the corresponding line.
file 1:
a
b
c
d
file 2:
a
d
c
e
t
y
file 3:
true
false
true
true
Is there a way to do that using awk/sed/grep commands?
Assuming file2 isn’t empty and isn’t so huge that it can’t fit in memory without causing problems:
awk 'NR==FNR{a[$0]; next} {print ($0 in a ? "true" : "false")}' file2 file1
Read man grep bash
, and do something like UNTESTED:
for pat in $(cat "file 1") ; do
ans="False"
grep --quiet "^$pat$" "file 2" ||
ans="True"
echo -e "$patt$ans" >>"file 3"
$ perl -le '
# construct a partial regex from the first filename argument
my $re = join("|", split /n+/, do { local(@ARGV,$/) = shift; <> });
# complete and pre-compile the regex
$re = qr/b(?:$re)b/;
# read and process stdin and/or remaining filename args
while(<>) {
print /$re/ ? "true" : "false"
}' file2 file1
true
false
true
true
This perl script reads in the entire file indicated by the first argument (file2
) and constructs a single regular expression that matches the individual words on each line of the file. Each word in the regex is separated by the |
alternation character. It assumes that the file contains one word per line and that lines are separated by one or more newline characters (which has the useful side-effect of ignoring blank lines).
NOTE: Each word of file2
will be interpreted as a regular expression. If you want them to be interpreted as fixed strings, change the my $re ...
line to:
my $re = join("|", map { quotemeta $_ } split /n+/, do { local(@ARGV,$/) = shift; <> });
The quotemeta function "quotes" all regexp metacharacters in a string so that they lose their special meaning and are treated as literal characters. See perldoc -f quotemeta
. The map
function causes the { quotemeta $_ }
block to be applied to each element of the list returned by split
. See perldoc -f map
and perldoc -f split
.
BTW, do { local(@ARGV,$/) = shift; <> })
is a fairly common perl idiom for "slurping" a file, i.e. reading in an entire file at once. There are many other methods to do the same thing, including modules like File::Slurper, but this is simple and portable and doesn’t require any library modules to be used or installed.
The script then uses the qr
quoting operator to pre-compile the regex to improve performance (recompiling the same regex on every pass through the loop would be an enormous waste of CPU time). Word-boundary markers, b
, are used to prevent partial matches, and ?:
is used to prevent capture of matches (which would just waste time since we only need to detect that a match occurred and don’t need to use the matches for anything). See perldoc -f qr
and man perlre
The regex is case-sensitive, but could be made case-insensitive with the i
regex modifier:
$re = qr/b(?:$re)b/i;
The script then reads any remaining input and, for each line of input, it prints "true" if the line matches the regex and "false" otherwise. Because it uses the while(<>)
it will read data from stdin and/or any filename arguments. In this example, it reads the input from file1
.
Memory usage is proportional to the size of the first file – the more words in file2
, the more RAM it will use. Run-time is, of course, proportional to the size of the first file and all other input.
Convert your second file into a sed
program:
$ sort -u file2 | sed 's:.*:s/^&$/true/; t:'
s/^a$/true/; t
s/^c$/true/; t
s/^d$/true/; t
s/^e$/true/; t
s/^t$/true/; t
s/^y$/true/; t
This sed
program replaces the lines from file2
with the string true
. The script immediately skips to the end using t
if a substitution is performed; otherwise, the next substitution is attempted.
For this to work, the input in file2
obviously can’t contain anything that could be interpreted as a regular expression. It also can’t contain the character /
.
Executing this on the first file, and adding a last substitution that defaults to replacing the line with the string false
, we get
$ sort -u file2 | sed 's:.*:s/^&$/true/; t:' | sed -f /dev/stdin -e 's/.*/false/' file1
true
false
true
true
tmp=$(mktemp)
comm -2 <(sort file1) <(sort file2)
| sed -e 's/^t.*/true/;t
c false' > "$tmp"
paste <(cat -n < "$tmp")
<(cat -n file1 | sort -bk2)
| sort -bk3,3n | cut -f2
Output:-
true
false
true
true
Notes:
- First step we sort both files and run them through comm with the second file suppressed.
- Next sed will identify the true/false elements
- Finally to recover the original order, we run a paste on output and numbered sorted input.
$ for i in $(<file1.txt); do grep -oq "$i" file2.txt && echo true || echo false; done
true
false
true
true
$ xargs -i sh -c 'grep -oq "{}" file2.txt && echo true || echo false' < file1.txt
true
false
true
true
#!/bin/bash
find_in_array() {
local string=$1
shift
for element in "$@"; do [[ "$element" == "$string" ]] && return 0; done
return 1
}
readarray -t arr1 <file1.txt
readarray -t arr2 <file2.txt
for str in "${arr1[@]}"; do
find_in_array "$str" "${arr2[@]}" && echo true || echo false
done
output
true
false
true
true