Debian mittels PDFGrep PDF Dokumenten durchsuchen

Ich möchte aus der Konsole unter Debian Wheezy nach einem bestimmten Wort innerhalb mehrerer PDF Dokumente suchen. Als Tool werde ich PDFGrep verwenden.

PDFGrep

Ist ein Tool für die Befehlszeilen. Es erlaubt das durchsuchen einer oder mehrerer PDF Dokumente. Neben einzelnen Dateien kann auch ein Ordner, der viele PDF Dokumente enthält, angegeben werden. Dafür kann ein Wort oder gar ein Suchmuster verwendet werden. Die Suchtreffer können zudem farblich hervorgehoben werden. Ein weiteres Feature ist die Ausgabe der Seite auf der der Treffer erreicht wurde. Ebenso kann Groß- und Kleinschreibung ignoriert werden. Das Konsolentool PDFGrep arbeitet dabei ähnlich dem bekannten Grep, jedoch nicht zeilenbasiert, sondern auf Seitenbasis.

Installation

aptitude install pdfgrep

Manual zu PDFGrep

SYNOPSIS
pdfgrep [OPTION…] PATTERN FILE…

DESCRIPTION
Search for PATTERN in each FILE. PATTERN is an extended regular expression.

pdfgrep works much like grep, with one distinction: It operates on pages and not on lines.

OPTIONS
-i, --ignore-case
Ignore case distinctions in both the PATTERN and the input files.

-H, --with-filename
Print the file name for each match. This is the default setting when there is more than one
file to search.

-h, --no-filename
Suppress the prefixing of file name on output. This is the default setting when there is only
one file to search.

-n, --page-number
Prefix each match with the number of the page where it was found.

-c, --count
Suppress normal output. Instead print the number of matches for each input file. Note that
unlike grep, multiple matches on the same page will be counted individually.

-C, --context NUM
Print at most NUM characters of context around each match. The exact number will vary,
because pdfgrep tries to respect word boundaries. If NUM is "line", the whole line will be
printed. If this option is not set, pdfgrep tries to print lines that are not longer than the
terminal width.

--color WHEN
Surround file names, page numbers and matched text with escape sequences to display them in
color on the terminal. (The default setting is auto).

WHEN can be:

always Always use colors, even when stdout is not a terminal.

never Do not use colors.

auto Use colors only when stdout is a terminal.

-R, -r, --recursive
Recursively search all files (restricted by --include and --exclude) under each directory.

--exclude=GLOB
Skip files whose base name matches GLOB. See glob(7) for wildcards you can use. You can use
this option multiple times to exclude more patterns. It takes precedence over --include.
Note, that in- and excludes apply only to files found via --recursive and not to the argument
list.

--include=GLOB
Only search files whose base name matches GLOB. See --exclude for details. The default is
*.pdf.

--unac Remove accents and ligatures from both the search pattern and the PDF documents. This is use‐
ful if you want to search for a word containing 'ae', but the PDF uses the single character
'æ' instead. See unac(3) and unaccent(1) for details.

[This option is experimental and only available if pdfgrep is compiled with unac support.]

-q, --quiet
Suppress all normal output to stdout. Errors will be printed and the exit codes will be
returned (see below).

--help Print a short summary of the options.

-V, --version
Show version information

ENVIRONMENT VARIABLES
The behavior of pdfgrep is affected by the following environment variable.

GREP_COLORS
Specifies the colors and other attributes used to highlight various parts of the output. The
syntax and values are like GREP_COLORS of grep. See grep(1) for more details. Currently
only the capabilities mt, ms, mc, fn, ln and se are used by pdfgrep, where mt, ms and mc have
the same effect on pdfgrep.

EXIT STATUS
Normally, the exit status is 0 if at least one match is found, 1 if no match is found and 2 if an
error occurred. But if the --quiet or -q option is used and a match was found, pdfgrep will return
0 regardless of errors.

AUTHOR
Hans-Peter Deifel <hpdeifel at gmx.de>

1 Beispiel – in PDF-Dok1 und PDF-Dok2 nach Taste suchen

pdfgrep Taste PDF-Dok1.pdf PDF-Dok2.pdf

2. Beispiel – in allen Dateien mit der Endung .pdf nach Taste suchen, wobei Groß- und Kleinschreibung (-i) egal ist

pdfgrep -i Taste *.pdf

3. Beispiel – in allen Verzeichnissen inkl. Unterverzeichnis (-r) nach Taste suchen, egal ob Groß- oder Kleinschreibung inkl. Rückgabe des Dateinamen und der Seitenummer (-n)

pdfgrep -rni Taste *.pdf

4. Beispiel – wie in Punkt 3 jedoch werden lediglich die Anzahl der Treffer je Dokument angezeigt

pdfgrep -ric Taste *.pdf

5. Beispiel – sucht nach Taste in allen Verzeichnissen, egal ob Groß- oder Kleinschreibung, jedoch nicht in den Dateien die „abc“ im Namen haben oder datei-taste-1.pdf heißen (-exclude)

pdfgrep -ri –exclude=datei-taste-1.pdf –exclude=’*abc*‘ Taste *

JARVIS

Interessiert in verschiedenste IT Themen, schreibe ich in diesem Blog über Software, Hardware, Smart Home, Games und vieles mehr. Ich berichte z.B. über die Installation und Konfiguration von Software als auch von Problemen mit dieser. News sind ebenso spannend, sodass ich auch über Updates, Releases und Neuigkeiten aus der IT berichte. Letztendlich nutze ich Taste-of-IT als eigene Dokumentation und Anlaufstelle bei wiederkehrenden Themen. Ich hoffe ich kann dich ebenso informieren und bei Problemen eine schnelle Lösung anbieten. Wer meinen Aufwand unterstützen möchte, kann gerne eine Tasse oder Pod Kaffe per PayPal spenden – vielen Dank.

Debian mittels PDFGrep PDF Dokumenten durchsuchen

Schreibe einen Kommentar Antworten abbrechen

rechtliches