How can I grep in PDF files?

Is there a way to search PDF files using grep, without converting to text first in Ubuntu?

Asked By: Dervin Thunk

||

You could pipe it through strings first:-

cat file.pdf | strings | grep <...etc...>
Answered By: Andy Smith

If you have poppler-utils installed (default on Ubuntu Desktop), you could “convert” it on the fly and pipe it to grep:

pdftotext my.pdf - | grep 'pattern'

This won’t create a .txt file.

Answered By: wag

gpdf might be what you need if you’re using Gnome! Check this in case you’re not using Gnome. It’s got a list of CLI pdf viewers. Then you can use grep to find some pattern.

Answered By: Dharmit

No.

A pdf consists of chunks of data, some of them text, some of them pictures and some of them really magical fancy XYZ (eg. .u3d files). Those chunks are most of the times compressed (eg. flat, check http://www.verypdf.com/pdfinfoeditor/compression.htm). In order to ‘grep’ a .pdf you have to reverse the compression aka extract the text.

You can do that either per file with tools such as pdf2text and grep the result, or you run an ‘indexer’ (look at xapian.org or lucene) which builds an searchable index out of your .pdf files and then you can use the search engine tools of that indexer to get the content of the pdf.

But no, you can not grep pdf files and hope for reliable answers without extracting the text first.

Answered By: akira

Install the package pdfgrep, then use the command:

find /path -iname '*.pdf' -exec pdfgrep pattern {} +

——————

Simplest way to do that:

pdfgrep 'pattern' *.pdf
pdfgrep 'pattern' file.pdf 
Answered By: enzotib

try this

find /path -iname *.pdf -print0 | for i in `xargs 0`; do echo $i; 
    pdftotext "$i" - | grep pattern; done

for printing the lines the pattern occurs inside the pdf

Answered By: harish.venkat

Recoll can search PDFs. It doesn’t support regular expressions, but it has lots of other search options, so it might fit your needs.

Answered By: user39336

Take a look at the common resource grep tool crgrep which supports searching within PDF files.

It also allows searching other resources like content nested in archives, database tables, image meta-data, POM file dependencies and web resources – and combinations of these including recursive search.

Answered By: Craig

There is a duplicate question on StackOverflow. The people there suggest a variation of harish.venkarts answer:

find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' ;

The advantage over the similar answer here is the --with-filename flag for grep. This is somewhat superior to pdfgrep as well, because the standard grep has more features.

https://stackoverflow.com/questions/4643438/how-to-search-contents-of-multiple-pdf-files

Answered By: user7610

cd to your folder containing your pdf-file and then..

pdfgrep 'pattern' your.pdf

or if you want to search in more than just one pdf-file (e.g. in all pdf-files in your folder)

pdfgrep 'pattern'  `ls *.pdf`

or

pdfgrep 'pattern' $(ls *.pdf)
Answered By: Rasmuss Rall

pdfgrep was written for exactly this purpose and is available in Ubuntu.

It tries to be mostly compatible to grep and thus provides “the power of grep”, only specialized for PDFs. That includes common grep options, such as --recursive, --ignore-case or --color.

In contrast to pdftotext | grep, pdfgrep can output the page number of a match in a performant way and is generally faster when it doesn’t have to search the whole document (e.g. --max-count or --quiet).

The basic usage is:

pdfgrep PATTERN FILE..

where PATTERN is your search string and FILE a list of filenames (or wildcards in a shell).

See the manpage for more infos.

Answered By: hpdeifel

Here is a quick script for search pdf in the current directory :

#!/bin/bash

if [ $# -ne 1 ]; then
  echo "usage $0 VALUE" 1>&2
  exit 1
fi

echo 'SEARCH IS CASE SENSITIVE' 1>&2

find . -name '*.pdf' -exec /bin/bash -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "$0"' "$1" ;
Answered By: Nico

I assume you mean tp not convert it on the disk, you can convert them to stdout and then grep it with pdftotext. Grepping the pdf without any sort of conversion is not a practical approach since PDF is mostly a binary format.

In the directory:

ls -1 ./*.pdf | xargs -L1 -I {} pdftotext {}  - | grep "keyword"

or in the directory and its subdirectories:

tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} pdftotext {}  - | grep "keyword"

Also because some pdf are scans they need to be OCRed first. I wrote a pretty simple way to search all pdfs that cannot be greped and OCR them.

I noticed if a pdf file doesn’t have any font it is usually not searchable. So knowing this we can use pdffonts.

First 2 lines of the pdffonts are the table header, so when a file is searchable has more than two line output, knowing this we can create:

gedit check_pdf_searchable.sh

then paste this

#!/bin/bash 
#set -vx
if ((`pdffonts "$1" | wc -l` < 3 )); then
echo $1
pypdfocr "$1"
fi

then make it executable

chmod +x check_pdf_searchable.sh

then list all non-searchable pdfs in the directory:

ls -1 ./*.pdf | xargs -L1 -I {} ./check_pdf_searchable.sh {}

or in the directory and its subdirectories:

tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} ./check_pdf_searchable.sh {}
Answered By: Eduard Florinescu

If you just want to search for pdf names/properties… or simple strings that are not compressed or encoded then instead of strings you can use the below

grep -a STRING file.pdf
cat -v file.pdf | grep STRING

From grep --help:

      --binary-files=TYPE   assume that binary files are TYPE;
                            TYPE is 'binary', 'text', or 'without-match'
  -a, --text                equivalent to --binary-files=text

and cat --help:

  -v, --show-nonprinting   use ^ and M- notation, except for LFD and TAB
Answered By: phuclv

Quickest way is

grep -rinw "pattern" --include *.pdf *
Answered By: Parth
pdfgrep -r --include "*.pdf" -i 'pattern'
Answered By: Gavin Gao

put this in your bashrc:

LESSOPEN="|/usr/bin/lesspipe %s"; export LESSOPEN    

Then you can use less:

less mypdf.pdf | grep "Hello, World"

Check : https://www.zeuthen.desy.de/~friebel/unix/lesspipe.html : to get more about this.

Answered By: user7343148

ripgrep-all (or rga) enables ripgrep functionality on multiple file types, including PDFs.

Answered By: Sjoerd
Categories: Answers Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.