Skip to content

Commit

Permalink
Merge branch 'main' of github.com:ocrmypdf/OCRmyPDF
Browse files Browse the repository at this point in the history
  • Loading branch information
jbarlow83 committed Jul 9, 2024
2 parents 4dde378 + d544342 commit 51c618e
Showing 1 changed file with 53 additions and 0 deletions.
53 changes: 53 additions & 0 deletions docs/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -228,6 +228,59 @@ then run ocrmypdf as follows (along with any other desired arguments):
Some combinations of control parameters will break Tesseract or break
assumptions that OCRmyPDF makes about Tesseract's output.

Changing page segmentation mode
-------------------------------

The directive ``--tesseract-pagesegmode Nmode`` forwards the desired page segmentation
mode to Tesseract OCR. The default is 3.

Page segmentation can improve OCR results when you know that a PDF ought to be
analyzed a particular way, such as PDFs whose pages contain only a single line of
text. For the vast majority of users, changing the page segmentation mode will only
make things worse.

As of June 2024, the Tesseract page segmentation modes are:

+-----+----------------------------------------------------------------------------------+
| ID | Description |
+=====+==================================================================================+
| 0 | Orientation and script detection (OSD) only. |
+-----+----------------------------------------------------------------------------------+
| 1 | Automatic page segmentation with OSD. |
+-----+----------------------------------------------------------------------------------+
| 2 | Automatic page segmentation, but no OSD, or OCR. (not implemented) |
+-----+----------------------------------------------------------------------------------+
| 3 | Fully automatic page segmentation, but no OSD. (Default) |
+-----+----------------------------------------------------------------------------------+
| 4 | Assume a single column of text of variable sizes. |
+-----+----------------------------------------------------------------------------------+
| 5 | Assume a single uniform block of vertically aligned text. |
+-----+----------------------------------------------------------------------------------+
| 6 | Assume a single uniform block of text. |
+-----+----------------------------------------------------------------------------------+
| 7 | Treat the image as a single text line. |
+-----+----------------------------------------------------------------------------------+
| 8 | Treat the image as a single word. |
+-----+----------------------------------------------------------------------------------+
| 9 | Treat the image as a single word in a circle. |
+-----+----------------------------------------------------------------------------------+
| 10 | Treat the image as a single character. |
+-----+----------------------------------------------------------------------------------+
| 11 | Sparse text. Find as much text as possible in no particular order. |
+-----+----------------------------------------------------------------------------------+
| 12 | Sparse text with OSD. |
+-----+----------------------------------------------------------------------------------+
| 13 | Raw line. Treat the image as a single text line, bypassing hacks that are |
| | Tesseract-specific. |
+-----+----------------------------------------------------------------------------------+

Modes 0, 1, 2, and 12 (all of those that enable orientation and script detection)
are not compatible with OCRmyPDF, which performs OSD in a separate step from OCR.
Their use may interfere with ``--rotate-pages`` and other features.

It is currently not possible to use advanced Tesseract OCR features, such as creating
OCR information, when using Tesseract through OCRmyPDF.

Changing the PDF renderer
=========================

Expand Down

0 comments on commit 51c618e

Please sign in to comment.