-
Notifications
You must be signed in to change notification settings - Fork 0
/
pdftotext
52 lines (49 loc) · 2.04 KB
/
pdftotext
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
## You have a pdf, you need data
## You've tried Tabula
## Next try `pdftotext`
### Step 1: pipenv shell --three
### Step 2: pip install pdftotext (might have to do brew install poppler first)
### Step 3: pdftotext /Users/slamm/Downloads/EvalsWithViolationsRpt_Co_ID_6776.pdf
### For options: https://www.systutorials.com/docs/linux/man/1-pdftotext/
-f number
Specifies the first page to convert.
-l number
Specifies the last page to convert.
-r number
Specifies the resolution, in DPI. The default is 72 DPI.
-x number
Specifies the x-coordinate of the crop area top left corner
-y number
Specifies the y-coordinate of the crop area top left corner
-W number
Specifies the width of crop area in pixels (default is 0)
-H number
Specifies the height of crop area in pixels (default is 0)
-layout
Maintain (as best as possible) the original physical layout of the text. The default is to 'undo' physical layout (columns, hyphenation, etc.) and output the text in reading order.
-fixed number
Assume fixed-pitch (or tabular) text, with the specified character width (in points). This forces physical layout mode.
-raw
Keep the text in content stream order. This is a hack which often "undoes" column formatting, etc. Use of raw mode is no longer recommended.
-htmlmeta
Generate a simple HTML file, including the meta information. This simply wraps the text in <pre> and </pre> and prepends the meta headers.
-bbox
Generate an XHTML file containing bounding box information for each word in the file.
-enc encoding-name
Sets the encoding to use for text output. This defaults to "UTF-8".
-listenc
Lits the available encodings
-eol unix | dos | mac
Sets the end-of-line convention to use for text output.
-nopgbrk
Don't insert page breaks (form feed characters) between pages.
-opw password
Specify the owner password for the PDF file. Providing this will bypass all security restrictions.
-upw password
Specify the user password for the PDF file.
-q
Don't print any messages or errors.
-v
Print copyright and version information.
-h
Print usage information. (-help and --help are equivalent.)