Merge branch 'main' of github.com:ocrmypdf/OCRmyPDF

ocrmypdf · Jul 9, 2024 · 51c618e · 51c618e
2 parents 4dde378 + d544342
commit 51c618e
Showing 1 changed file with 53 additions and 0 deletions.
diff --git a/docs/advanced.rst b/docs/advanced.rst
@@ -228,6 +228,59 @@ then run ocrmypdf as follows (along with any other desired arguments):
    Some combinations of control parameters will break Tesseract or break
    assumptions that OCRmyPDF makes about Tesseract's output.
 
+Changing page segmentation mode
+-------------------------------
+
+The directive ``--tesseract-pagesegmode Nmode`` forwards the desired page segmentation
+mode to Tesseract OCR. The default is 3.
+
+Page segmentation can improve OCR results when you know that a PDF ought to be
+analyzed a particular way, such as PDFs whose pages contain only a single line of 
+text. For the vast majority of users, changing the page segmentation mode will only
+make things worse.
+
+As of June 2024, the Tesseract page segmentation modes are:
+
++-----+----------------------------------------------------------------------------------+
+| ID  | Description                                                                      |
++=====+==================================================================================+
+|  0  | Orientation and script detection (OSD) only.                                      |
++-----+----------------------------------------------------------------------------------+
+|  1  | Automatic page segmentation with OSD.                                             |
++-----+----------------------------------------------------------------------------------+
+|  2  | Automatic page segmentation, but no OSD, or OCR. (not implemented)               |
++-----+----------------------------------------------------------------------------------+
+|  3  | Fully automatic page segmentation, but no OSD. (Default)                         |
++-----+----------------------------------------------------------------------------------+
+|  4  | Assume a single column of text of variable sizes.                                 |
++-----+----------------------------------------------------------------------------------+
+|  5  | Assume a single uniform block of vertically aligned text.                         |
++-----+----------------------------------------------------------------------------------+
+|  6  | Assume a single uniform block of text.                                            |
++-----+----------------------------------------------------------------------------------+
+|  7  | Treat the image as a single text line.                                            |
++-----+----------------------------------------------------------------------------------+
+|  8  | Treat the image as a single word.                                                 |
++-----+----------------------------------------------------------------------------------+
+|  9  | Treat the image as a single word in a circle.                                     |
++-----+----------------------------------------------------------------------------------+
+| 10  | Treat the image as a single character.                                            |
++-----+----------------------------------------------------------------------------------+
+| 11  | Sparse text. Find as much text as possible in no particular order.                |
++-----+----------------------------------------------------------------------------------+
+| 12  | Sparse text with OSD.                                                            |
++-----+----------------------------------------------------------------------------------+
+| 13  | Raw line. Treat the image as a single text line, bypassing hacks that are        |
+|     | Tesseract-specific.                                                               |
++-----+----------------------------------------------------------------------------------+
+
+Modes 0, 1, 2, and 12 (all of those that enable orientation and script detection) 
+are not compatible with OCRmyPDF, which performs OSD in a separate step from OCR.
+Their use may interfere with ``--rotate-pages`` and other features.
+
+It is currently not possible to use advanced Tesseract OCR features, such as creating
+OCR information, when using Tesseract through OCRmyPDF.
+
 Changing the PDF renderer
 =========================