-
Notifications
You must be signed in to change notification settings - Fork 659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for structure tree and marked content sections #937
Conversation
5049bfe
to
c4670ad
Compare
Codecov Report
@@ Coverage Diff @@
## develop #937 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 18 19 +1
Lines 1588 1716 +128
==========================================
+ Hits 1588 1716 +128
|
Note that you can link the structure tree and text from marked content sections like this: def get_text_by_mcid(page):
mcids = []
for c in page.chars:
mcid = c.get("mcid")
if mcid is None:
continue
while len(mcids) <= mcid:
mcids.append("")
mcids[mcid] += c["text"]
return mcids
def get_structure_tree_with_text(page):
texts = get_text_by_mcid(page)
st = page.structure_tree
d = deque(st)
while d:
el = d.popleft()
if "children" in el:
d.extend(el["children"])
if "mcids" in el:
el["mcids"] = [texts[mcid] for mcid in el["mcids"] if mcid < len(texts)]
return st Not sure if this is helpful to have as a method in the |
Another helpful example, if you want to for instance get the bounding box of a def get_tables(page):
st = page.structure_tree
d = deque(st)
while d:
el = d.popleft()
if "children" in el:
d.extend(el["children"])
if el["type"] == "Table":
yield el
def get_child_mcids(el):
d = deque([el])
while d:
el = d.popleft()
if "children" in el:
d.extend(el["children"])
if "mcids" in el:
yield from el["mcids"]
t = next(get_tables(page))
mcids = set(get_child_mcids(t))
tbox = pdfplumber.utils.objects_to_bbox([
c for c in itertools.chain(page.chars, page.images) if c.get("mcid") in mcids
]) |
Another note - there is a small problem with this PR which is that there can be marked content sections which aren't referenced by the structure tree - this is specifically the case for headers and footers. The PR retains their MCIDs but nothing else so there isn't any way to detect them. I'll add a |
Another note - because this uses |
pdfplumber/page.py
Outdated
def tag_cur_item(self, item_type: Any) -> None: | ||
# Implementation Inheritance Considered Harmful | ||
cur_obj = self.cur_item._objs[-1] | ||
assert isinstance(cur_obj, item_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious: What is the bigger-picture purpose of this assert statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it (and the associated item_type
parameter) is just there to confirm my understanding of how pdfminer.six
works internally - as the comment above notes this depends on the internal behaviour of the render_*
methods. In an ideal world they would simply return the objects they create instead of modifying some internal state.
pdfplumber/page.py
Outdated
super().render_image(*args, **kwargs) | ||
self.tag_cur_item(LTImage) | ||
|
||
def end_figure(self, *args, **kwargs) -> None: # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pdfplumber
does not return LTFigure
objects, which are somewhat confusingly named. Like LTTextBoxHorizontal
(and similar), they refer not to discrete objects on the page, but rather layout-analyzed agglomerations of them. So I think we can skip handling these, but perhaps I misunderstand the intent here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to get back to you on this one ... in the structure tree, the MCID is associated with the Figure, so the expected behaviour would be to have all the objects inside it be tagged with that MCID. This works for my PDFs (see for example https://dhdaines.github.io/serafim/?idx=1057 and associated PDF behind the "PDF" button) but as noted above I need to re-test it with more interesting figures.
self.tag_cur_item(LTChar) | ||
return adv | ||
|
||
def render_image(self, *args, **kwargs) -> None: # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for tagging image
and char
objects, but not line
, rect
, or curve
objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this could be an omission on my part (I didn't have a PDF with lines/rects/curves in a Figure handy to test with, but I should be able to create one myself with LibreOffice). Thanks for catching it, I'll add a test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed, this was wrong. I've fixed it to tag the lines/rects/curves with the MCID.
Thanks for flagging. As I understand it, this PR introduces two separate-but-related features:
Would it be possible to separate this PR into those two distinct features? Is that something you'd be open to? |
Absolutely, especially since part 2 should probably be rewritten to use Actually part 1 ought to be a PR for |
Ah, this can be fixed easily, it's just an issue with type-checking constructs that were introduced in Python 3.9 which slipped in accidentally. |
Closing this PR and making two new ones! (the MCID one is there already: #961 ) |
Implements #909 with an IMHO rather convenient interface for marked content - IDs are listed in the structure tree, then propagated to objects in each page.