Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for structure tree and marked content sections #937

Closed
wants to merge 28 commits into from

Conversation

dhdaines
Copy link
Contributor

@dhdaines dhdaines commented Jul 19, 2023

Implements #909 with an IMHO rather convenient interface for marked content - IDs are listed in the structure tree, then propagated to objects in each page.

@dhdaines dhdaines force-pushed the issue-909 branch 2 times, most recently from 5049bfe to c4670ad Compare July 19, 2023 16:04
@codecov
Copy link

codecov bot commented Jul 19, 2023

Codecov Report

Merging #937 (42ec17e) into develop (f6887b5) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           develop      #937    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           18        19     +1     
  Lines         1588      1716   +128     
==========================================
+ Hits          1588      1716   +128     
Files Changed Coverage Δ
pdfplumber/page.py 100.00% <100.00%> (ø)
pdfplumber/structure.py 100.00% <100.00%> (ø)

@dhdaines dhdaines marked this pull request as ready for review July 19, 2023 17:39
@dhdaines
Copy link
Contributor Author

dhdaines commented Jul 21, 2023

Note that you can link the structure tree and text from marked content sections like this:

def get_text_by_mcid(page):
    mcids = []
    for c in page.chars:
        mcid = c.get("mcid")
        if mcid is None:
            continue
        while len(mcids) <= mcid:
            mcids.append("")
        mcids[mcid] += c["text"]
    return mcids

def get_structure_tree_with_text(page):
    texts = get_text_by_mcid(page)
    st = page.structure_tree
    d = deque(st)
    while d:
        el = d.popleft()
        if "children" in el:
            d.extend(el["children"])
        if "mcids" in el:
            el["mcids"] = [texts[mcid] for mcid in el["mcids"] if mcid < len(texts)]
    return st

Not sure if this is helpful to have as a method in the Page class? One thing I notice is that MCIDs are not reliably aligned to word breaks, they often change in the middle of a word for no apparent reason.

@dhdaines
Copy link
Contributor Author

dhdaines commented Jul 21, 2023

Another helpful example, if you want to for instance get the bounding box of a Table element on a page:

def get_tables(page):
    st = page.structure_tree
    d = deque(st)
    while d:
        el = d.popleft()
        if "children" in el:
            d.extend(el["children"])
        if el["type"] == "Table":
            yield el

def get_child_mcids(el):
    d = deque([el])
    while d:
        el = d.popleft()
        if "children" in el:
            d.extend(el["children"])
        if "mcids" in el:
            yield from el["mcids"]

t = next(get_tables(page))
mcids = set(get_child_mcids(t))
tbox = pdfplumber.utils.objects_to_bbox([
    c for c in itertools.chain(page.chars, page.images) if c.get("mcid") in mcids
])

@dhdaines
Copy link
Contributor Author

dhdaines commented Jul 24, 2023

Another note - there is a small problem with this PR which is that there can be marked content sections which aren't referenced by the structure tree - this is specifically the case for headers and footers. The PR retains their MCIDs but nothing else so there isn't any way to detect them. I'll add a marked_content_sections property which contains this information.

@dhdaines
Copy link
Contributor Author

dhdaines commented Aug 5, 2023

Another note - because this uses pypdfium2 to get the structure tree it introduces quite a lot of overhead, though only if you choose to access it, since now you are reading (at least partially) each page twice, once with pdfminer.six and once with pypdfium2. I will perhaps try to reimplement it with pdfminer.six - the logic of resolving the structure tree is slightly complicated, but the pdf.js implementation is a good guide.

def tag_cur_item(self, item_type: Any) -> None:
# Implementation Inheritance Considered Harmful
cur_obj = self.cur_item._objs[-1]
assert isinstance(cur_obj, item_type)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious: What is the bigger-picture purpose of this assert statement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it (and the associated item_type parameter) is just there to confirm my understanding of how pdfminer.six works internally - as the comment above notes this depends on the internal behaviour of the render_* methods. In an ideal world they would simply return the objects they create instead of modifying some internal state.

super().render_image(*args, **kwargs)
self.tag_cur_item(LTImage)

def end_figure(self, *args, **kwargs) -> None: # type: ignore
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pdfplumber does not return LTFigure objects, which are somewhat confusingly named. Like LTTextBoxHorizontal (and similar), they refer not to discrete objects on the page, but rather layout-analyzed agglomerations of them. So I think we can skip handling these, but perhaps I misunderstand the intent here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to get back to you on this one ... in the structure tree, the MCID is associated with the Figure, so the expected behaviour would be to have all the objects inside it be tagged with that MCID. This works for my PDFs (see for example https://dhdaines.github.io/serafim/?idx=1057 and associated PDF behind the "PDF" button) but as noted above I need to re-test it with more interesting figures.

self.tag_cur_item(LTChar)
return adv

def render_image(self, *args, **kwargs) -> None: # type: ignore
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for tagging image and char objects, but not line, rect, or curve objects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this could be an omission on my part (I didn't have a PDF with lines/rects/curves in a Figure handy to test with, but I should be able to create one myself with LibreOffice). Thanks for catching it, I'll add a test case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed, this was wrong. I've fixed it to tag the lines/rects/curves with the MCID.

@jsvine
Copy link
Owner

jsvine commented Aug 8, 2023

Another note - because this uses pypdfium2 to get the structure tree it introduces quite a lot of overhead, though only if you choose to access it, since now you are reading (at least partially) each page twice, once with pdfminer.six and once with pypdfium2.

Thanks for flagging. As I understand it, this PR introduces two separate-but-related features:

  1. Adding the mcid attribute to each parsed object, which the PR currently handles entirely through subclassing pdfminer.six's PDFPageAggregator.
  2. Extracting a PDF's structure tree, which depends on pypdfium2, which introduces the overhead you mentioned, and appears to fail on Python 3.8 (something related to _ctypes.PyCArrayType but I haven't investigated closely).

Would it be possible to separate this PR into those two distinct features? Is that something you'd be open to?

@dhdaines
Copy link
Contributor Author

dhdaines commented Aug 8, 2023

  1. Extracting a PDF's structure tree, which depends on pypdfium2, which introduces the overhead you mentioned, and appears to fail on Python 3.8 (something related to _ctypes.PyCArrayType but I haven't investigated closely).

Would it be possible to separate this PR into those two distinct features? Is that something you'd be open to?

Absolutely, especially since part 2 should probably be rewritten to use pdfminer.six.

Actually part 1 ought to be a PR for pdfminer.six itself but it doesn't seem likely that it could be merged anytime soon.

@dhdaines
Copy link
Contributor Author

dhdaines commented Aug 8, 2023

  1. Extracting a PDF's structure tree, which depends on pypdfium2, which introduces the overhead you mentioned, and appears to fail on Python 3.8 (something related to _ctypes.PyCArrayType but I haven't investigated closely).

Ah, this can be fixed easily, it's just an issue with type-checking constructs that were introduced in Python 3.9 which slipped in accidentally.

@dhdaines
Copy link
Contributor Author

dhdaines commented Aug 9, 2023

Closing this PR and making two new ones! (the MCID one is there already: #961 )

@dhdaines dhdaines closed this Aug 9, 2023
@dhdaines dhdaines deleted the issue-909 branch September 5, 2023 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants