What makes grep consider a file to be binary?
I have some database dumps from a Windows system on my box. They are text files. I’m using cygwin to grep through them. These appear to be plain text files; I open them with text editors such as notepad and wordpad and they look legible. However, when I run grep on them, it will say binary file foo.txt matches
.
I have noticed that the files contain some ascii NUL
characters, which I believe are artifacts from the database dump.
So what makes grep consider these files to be binary? The NUL
character? Is there a flag on the filesystem? What do I need to change to get grep to show me the line matches?
The file /etc/magic
or /usr/share/misc/magic
has a list of sequences that the command file
uses for determining the file type.
Note that binary may just be a fallback solution. Sometimes files with strange encoding are considered binary too.
grep
on Linux has some options to handle binary files like --binary-files
or -U / --binary
If there is a NUL
character anywhere in the file, grep will consider it as a binary file.
There might a workaround like this cat file | tr -d ' 00' | yourgrep
to eliminate all null first, and then to search through file.
You can use the strings
utility to extract the text content from any file and then pipe it through grep
, like this: strings file | grep pattern
.
One of my text files was suddenly being seen as binary by grep:
$ file foo.txt
foo.txt: ISO-8859 text
Solution was to convert it by using iconv
:
iconv -t UTF-8 -f ISO-8859-1 foo.txt > foo_new.txt
I had the same problem. I used vi -b [filename]
to see the added characters. I found the control characters ^@
and ^M
. Then in vi type :1,$s/^@//g
to remove the ^@
characters. Repeat this command for ^M
.
Warning: To get the “blue” control characters press Ctrl+v then Ctrl+M or Ctrl+@. Then save and exit vi.
Actually answering the question “What makes grep consider a file to be binary?”, you can use iconv
:
$ iconv < myfile.java
iconv: (stdin):267:70: cannot convert
In my case there were Spanish characters that showed up correctly in text editors but grep considered them as binary; iconv
output pointed me to the line and column numbers of those characters
In the case of NUL
characters, iconv
will consider them normal and will not print that kind of output so this method is not suitable
grep -a
worked for me:
$ grep --help
[...]
-a, --text equivalent to --binary-files=text
One of my students had this problem. There is a bug in grep
in Cygwin
. If the file has non-Ascii characters, grep
and egrep
see it as binary.
GNU grep 2.24 RTFS
Conclusion: 2 and 2 cases only:
-
NUL
, e.g.printf 'a ' | grep 'a'
-
encoding error according to the C99
mbrlen()
, e.g.:export LC_CTYPE='en_US.UTF-8' printf 'ax80' | grep 'a'
because
x80
cannot be the first byte of an UTF-8 Unicode point: UTF-8 – Description | en.wikipedia.org
Those checks are only done up to the Nth byte of the input, where N = TODO (32KiB in one test system). If the check would fail after the Nth byte, the file is still considered a text file. (mentioned by Stéphane Chazelas).
Only up to the first buffer read
So if a NUL or encoding error happens in the middle of a very large file, it might be grepped anyways.
I imagine this is for performance reasons.
E.g.: this prints the line:
printf '%10000000snx80a' | grep 'a'
but this does not:
printf '%10snx80a' | grep 'a'
The actual buffer size depends on how the file is read. E.g. compare:
export LC_CTYPE='en_US.UTF-8'
(printf 'nx80a') | grep 'a'
(printf 'n'; sleep 1; printf 'x80a') | grep 'a'
With the sleep
, the first line gets passed to grep even if it is only 1 byte long because the process goes to sleep, and the second read does not check if the file is binary.
RTFS
git clone git://git.savannah.gnu.org/grep.git
cd grep
git checkout v2.24
Find where the stderr error message is encoded:
git grep 'Binary file'
Leads us to /src/grep.c
:
if (!out_quiet && (encoding_error_output
|| (0 <= nlines_first_null && nlines_first_null < nlines)))
{
printf (_("Binary file %s matchesn"), filename);
If those variables were well named, we basically reached the conclusion.
encoding_error_output
Quick grepping for encoding_error_output
shows that the only code path that can modify it goes through buf_has_encoding_errors
:
clen = mbrlen (p, buf + size - p, &mbs);
if ((size_t) -2 <= clen)
return true;
then just man mbrlen
.
nlines_first_null and nlines
Initialized as:
intmax_t nlines_first_null = -1;
nlines = 0;
so when a null is found 0 <= nlines_first_null
becomes true.
TODO when can nlines_first_null < nlines
ever be false? I got lazy.
POSIX
Does not define binary options grep – search a file for a pattern | pubs.opengroup.org , and GNU grep does not document it, so RTFS is the only way.
I also had this problem but in my case it was caused when a matched line is too long.
file myfile.txt
myfile.txt: UTF-8 Unicode text, with very long lines
grep
would run through the entire file fine with many patterns but when a pattern matched a "very long line" it stopped with Binary file myfile.txt matches
.
Adding -a
also solves this problem but pre-parsing the file for NULL or other invalid chars would have no effect (there are none otherwise grep would not complete for other patterns). In this case the offending line had 25k+ characters!
What I don’t understand is why it only happens when grep
tries to return the line and not when it is processing it looking for other patterns.