Add support for structure tree and marked content sections #937

dhdaines · 2023-07-19T15:47:03Z

Implements #909 with an IMHO rather convenient interface for marked content - IDs are listed in the structure tree, then propagated to objects in each page.

codecov · 2023-07-19T16:13:25Z

Codecov Report

Merging #937 (42ec17e) into develop (f6887b5) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           develop      #937    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           18        19     +1     
  Lines         1588      1716   +128     
==========================================
+ Hits          1588      1716   +128

Files Changed	Coverage Δ
pdfplumber/page.py	`100.00% <100.00%> (ø)`
pdfplumber/structure.py	`100.00% <100.00%> (ø)`

dhdaines · 2023-07-21T18:52:19Z

Note that you can link the structure tree and text from marked content sections like this:

def get_text_by_mcid(page):
    mcids = []
    for c in page.chars:
        mcid = c.get("mcid")
        if mcid is None:
            continue
        while len(mcids) <= mcid:
            mcids.append("")
        mcids[mcid] += c["text"]
    return mcids

def get_structure_tree_with_text(page):
    texts = get_text_by_mcid(page)
    st = page.structure_tree
    d = deque(st)
    while d:
        el = d.popleft()
        if "children" in el:
            d.extend(el["children"])
        if "mcids" in el:
            el["mcids"] = [texts[mcid] for mcid in el["mcids"] if mcid < len(texts)]
    return st

Not sure if this is helpful to have as a method in the Page class? One thing I notice is that MCIDs are not reliably aligned to word breaks, they often change in the middle of a word for no apparent reason.

dhdaines · 2023-07-21T19:10:34Z

Another helpful example, if you want to for instance get the bounding box of a Table element on a page:

def get_tables(page):
    st = page.structure_tree
    d = deque(st)
    while d:
        el = d.popleft()
        if "children" in el:
            d.extend(el["children"])
        if el["type"] == "Table":
            yield el

def get_child_mcids(el):
    d = deque([el])
    while d:
        el = d.popleft()
        if "children" in el:
            d.extend(el["children"])
        if "mcids" in el:
            yield from el["mcids"]

t = next(get_tables(page))
mcids = set(get_child_mcids(t))
tbox = pdfplumber.utils.objects_to_bbox([
    c for c in itertools.chain(page.chars, page.images) if c.get("mcid") in mcids
])

dhdaines · 2023-07-24T16:01:27Z

Another note - there is a small problem with this PR which is that there can be marked content sections which aren't referenced by the structure tree - this is specifically the case for headers and footers. The PR retains their MCIDs but nothing else so there isn't any way to detect them. I'll add a marked_content_sections property which contains this information.

dhdaines · 2023-08-05T02:33:30Z

Another note - because this uses pypdfium2 to get the structure tree it introduces quite a lot of overhead, though only if you choose to access it, since now you are reading (at least partially) each page twice, once with pdfminer.six and once with pypdfium2. I will perhaps try to reimplement it with pdfminer.six - the logic of resolving the structure tree is slightly complicated, but the pdf.js implementation is a good guide.

jsvine · 2023-08-08T14:59:41Z

pdfplumber/page.py

+    def tag_cur_item(self, item_type: Any) -> None:
+        # Implementation Inheritance Considered Harmful
+        cur_obj = self.cur_item._objs[-1]
+        assert isinstance(cur_obj, item_type)


I'm curious: What is the bigger-picture purpose of this assert statement?

Ah, it (and the associated item_type parameter) is just there to confirm my understanding of how pdfminer.six works internally - as the comment above notes this depends on the internal behaviour of the render_* methods. In an ideal world they would simply return the objects they create instead of modifying some internal state.

jsvine · 2023-08-08T15:01:23Z

pdfplumber/page.py

+        super().render_image(*args, **kwargs)
+        self.tag_cur_item(LTImage)
+
+    def end_figure(self, *args, **kwargs) -> None:  # type: ignore


pdfplumber does not return LTFigure objects, which are somewhat confusingly named. Like LTTextBoxHorizontal (and similar), they refer not to discrete objects on the page, but rather layout-analyzed agglomerations of them. So I think we can skip handling these, but perhaps I misunderstand the intent here?

I'll have to get back to you on this one ... in the structure tree, the MCID is associated with the Figure, so the expected behaviour would be to have all the objects inside it be tagged with that MCID. This works for my PDFs (see for example https://dhdaines.github.io/serafim/?idx=1057 and associated PDF behind the "PDF" button) but as noted above I need to re-test it with more interesting figures.

jsvine · 2023-08-08T15:02:04Z

pdfplumber/page.py

+        self.tag_cur_item(LTChar)
+        return adv
+
+    def render_image(self, *args, **kwargs) -> None:  # type: ignore


What is the reason for tagging image and char objects, but not line, rect, or curve objects?

Ah, this could be an omission on my part (I didn't have a PDF with lines/rects/curves in a Figure handy to test with, but I should be able to create one myself with LibreOffice). Thanks for catching it, I'll add a test case.

Yes, indeed, this was wrong. I've fixed it to tag the lines/rects/curves with the MCID.

jsvine · 2023-08-08T15:09:51Z

Another note - because this uses pypdfium2 to get the structure tree it introduces quite a lot of overhead, though only if you choose to access it, since now you are reading (at least partially) each page twice, once with pdfminer.six and once with pypdfium2.

Thanks for flagging. As I understand it, this PR introduces two separate-but-related features:

Adding the mcid attribute to each parsed object, which the PR currently handles entirely through subclassing pdfminer.six's PDFPageAggregator.
Extracting a PDF's structure tree, which depends on pypdfium2, which introduces the overhead you mentioned, and appears to fail on Python 3.8 (something related to _ctypes.PyCArrayType but I haven't investigated closely).

Would it be possible to separate this PR into those two distinct features? Is that something you'd be open to?

dhdaines · 2023-08-08T15:12:15Z

Extracting a PDF's structure tree, which depends on pypdfium2, which introduces the overhead you mentioned, and appears to fail on Python 3.8 (something related to _ctypes.PyCArrayType but I haven't investigated closely).

Would it be possible to separate this PR into those two distinct features? Is that something you'd be open to?

Absolutely, especially since part 2 should probably be rewritten to use pdfminer.six.

Actually part 1 ought to be a PR for pdfminer.six itself but it doesn't seem likely that it could be merged anytime soon.

dhdaines · 2023-08-08T15:13:31Z

Extracting a PDF's structure tree, which depends on pypdfium2, which introduces the overhead you mentioned, and appears to fail on Python 3.8 (something related to _ctypes.PyCArrayType but I haven't investigated closely).

Ah, this can be fixed easily, it's just an issue with type-checking constructs that were introduced in Python 3.9 which slipped in accidentally.

dhdaines · 2023-08-09T19:59:57Z

Closing this PR and making two new ones! (the MCID one is there already: #961 )

dhdaines force-pushed the issue-909 branch 2 times, most recently from 5049bfe to c4670ad Compare July 19, 2023 16:04

dhdaines marked this pull request as ready for review July 19, 2023 17:39

jsvine mentioned this pull request Jul 25, 2023

Incorrect extraction in tables with overlapping columns #912

Open

jsvine reviewed Aug 8, 2023

View reviewed changes

dhdaines added 15 commits August 9, 2023 14:12

feat: preliminary structure tree extractor using pypdfium2

6dc9be5

test: basic test of structure tree

2098ca3

chore: add contributor

3c0e9c2

refactor: fix imports and isorts and types, oh my

8de9fc2

docs: initial documentation for structure tree

da46833

feat: extract MCID and add it to chars

f9d01a2

fix: types

3018536

test: test mcid on extract_words

297ddac

test: disable coverage for untestable attributes

8e74285

fix: actually get the right attributes (doh)

9de3200

fix: always add mcid so it works in extra_attrs

4741d36

test: add test data

084cf40

fix: fix types, again

06025a7

fix: tagstack not needed

b58947b

fix: skip empty (not on this page) children

ac158cb

dhdaines added 10 commits August 9, 2023 14:12

fix: fix fix to fix

13b0e92

fix: remove unnecessary pdf

8bd484a

docs: note about lang and image tags

dbccaab

test: test and document alt_text and mcid on images

e595a71

docs: minimally document structure and mcid here

47010a2

fix: give default value to cur_mcid

20dc157

test: add test of structured PDF from Word 365

827726c

fix: pragma nocover no longer needed (thanks, word365)

a5de5f8

test: fix CSV tests to include/exclude mcid field

cc7a378

test: sample pdf with weird tables and stuff

0077bf0

dhdaines force-pushed the issue-909 branch from d8588b1 to 0077bf0 Compare August 9, 2023 18:12

dhdaines added 3 commits August 9, 2023 14:48

fix: ctypes/mypy/py3.8 errors

2fd9f80

fix: really fix py38 (hardcoded venv/ in makefile! argh!)

b774531

fix: put mcids on lines and curves in figure

42ec17e

dhdaines mentioned this pull request Aug 9, 2023

Support for marked content section IDs #961

Merged

dhdaines closed this Aug 9, 2023

dhdaines deleted the issue-909 branch September 5, 2023 20:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for structure tree and marked content sections #937

Add support for structure tree and marked content sections #937

dhdaines commented Jul 19, 2023 •

edited

Loading

codecov bot commented Jul 19, 2023 •

edited

Loading

dhdaines commented Jul 21, 2023 •

edited

Loading

dhdaines commented Jul 21, 2023 •

edited

Loading

dhdaines commented Jul 24, 2023 •

edited

Loading

dhdaines commented Aug 5, 2023 •

edited

Loading

jsvine Aug 8, 2023

dhdaines Aug 8, 2023

jsvine Aug 8, 2023

dhdaines Aug 8, 2023

jsvine Aug 8, 2023

dhdaines Aug 8, 2023

dhdaines Aug 9, 2023

jsvine commented Aug 8, 2023

dhdaines commented Aug 8, 2023

dhdaines commented Aug 8, 2023

dhdaines commented Aug 9, 2023

Add support for structure tree and marked content sections #937

Add support for structure tree and marked content sections #937

Conversation

dhdaines commented Jul 19, 2023 • edited Loading

codecov bot commented Jul 19, 2023 • edited Loading

Codecov Report

dhdaines commented Jul 21, 2023 • edited Loading

dhdaines commented Jul 21, 2023 • edited Loading

dhdaines commented Jul 24, 2023 • edited Loading

dhdaines commented Aug 5, 2023 • edited Loading

jsvine Aug 8, 2023

Choose a reason for hiding this comment

dhdaines Aug 8, 2023

Choose a reason for hiding this comment

jsvine Aug 8, 2023

Choose a reason for hiding this comment

dhdaines Aug 8, 2023

Choose a reason for hiding this comment

jsvine Aug 8, 2023

Choose a reason for hiding this comment

dhdaines Aug 8, 2023

Choose a reason for hiding this comment

dhdaines Aug 9, 2023

Choose a reason for hiding this comment

jsvine commented Aug 8, 2023

dhdaines commented Aug 8, 2023

dhdaines commented Aug 8, 2023

dhdaines commented Aug 9, 2023

dhdaines commented Jul 19, 2023 •

edited

Loading

codecov bot commented Jul 19, 2023 •

edited

Loading

dhdaines commented Jul 21, 2023 •

edited

Loading

dhdaines commented Jul 21, 2023 •

edited

Loading

dhdaines commented Jul 24, 2023 •

edited

Loading

dhdaines commented Aug 5, 2023 •

edited

Loading