Replies: 2 comments 2 replies
-
I am actively working on this topic and happy to share experiences/code. It's generally not straightforward and depends on the style of the author and their tools. I am aiming at a result that (in HTML) looks something like:
The biggest problem is that headers and paragraphs are not well defined and often depend on context/content. Here are two examples from our work on parsing the UN IPCC reports on climate change: Here there are several levels of headers. In the first page the headers are indicated by bold and a terminating colon. (Note there is no whitespace after the header). In the second page there is a running title (not a header) and then a large alpha-numbered header. In the next example we see a large header followed by decimal sections with no explicit header, though clearly they are separate. (I turn the number into a header and also use it as an id.) In some cases the first sentence of the following paragraph is bold, and this could be used as a header: I might keep the para intact and duplicate the first sentence (perhaps truncated) as a header . Note that here its a figure caption with a regular structure. But is this sentence a header? And is this a paragraph? I think it would be possible to come up with a set of templates which are fairly general and might give medium recall/precision on a range of document types. But it will never be 100%. For large corpora created with the same tools it's probably worth customising templates. For random small ones it may be that LLMs give useful results. Or they may garble it. BTW are you (or anyone) interested in extracting the paragraphs into flowable text (i.e. without hard line breaks)? Because I'm also working on that and made good progress a year back. If no one else is I'll re-do it over the next do or two. |
Beta Was this translation helpful? Give feedback.
-
The test/s are in TestPDFPumberTest in test/test_pdf.py in
github.com/petermr/py4ami branch pmr15
But I would wait for a day and it should be clearer. I'll also try to
create a discussion on the site.
…On Sun, Apr 23, 2023 at 12:03 PM Daniel Leong ***@***.***> wrote:
@petermr <https://github.com/petermr> thanks for the reply. Yes, I’m
actually doing something that extracts texts for LLMs. Sorry, didn’t quite
get what you meant by “hard line breaks”, but I would love to see what you
have done
—
Reply to this email directly, view it on GitHub
<#868 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFTCSZTNN4LZWFDZGYIDQTXCUEBVANCNFSM6AAAAAAXFBLFRI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Beta Was this translation helpful? Give feedback.
-
I am currently working on a project which takes in PDF files as the input document. One of the use cases requires the extracted text to be segmented into headers and the corresponding paragraphs. Wondering if anybody has done something similar either using pdfplumber or pdfminer.six (I am sort of limited to these 2 due to licensing) and if they are able to share some code to get me started.
My current code uses the font size and font itself to detect headers but the precision and recall isn't great. I am open to other solutions as well.
Thanks in advance :)
Beta Was this translation helpful? Give feedback.
All reactions