PDF header and paragraph detection #868

danltw · 2023-04-20T06:44:15Z

danltw
Apr 20, 2023

I am currently working on a project which takes in PDF files as the input document. One of the use cases requires the extracted text to be segmented into headers and the corresponding paragraphs. Wondering if anybody has done something similar either using pdfplumber or pdfminer.six (I am sort of limited to these 2 due to licensing) and if they are able to share some code to get me started.

My current code uses the font size and font itself to detect headers but the precision and recall isn't great. I am open to other solutions as well.

Thanks in advance :)

Answered by petermr

Apr 23, 2023

The test/s are in TestPDFPumberTest in test/test_pdf.py in github.com/petermr/py4ami branch pmr15 But I would wait for a day and it should be clearer. I'll also try to create a discussion on the site.

View full answer

petermr · 2023-04-21T07:59:58Z

petermr
Apr 21, 2023

I am actively working on this topic and happy to share experiences/code.

It's generally not straightforward and depends on the style of the author and their tools. I am aiming at a result that (in HTML) looks something like:

<div>
  <h3>header</h3>
  <p>para 1...</p>
  <p>para 2...</p>
</div>

The biggest problem is that headers and paragraphs are not well defined and often depend on context/content. Here are two examples from our work on parsing the UN IPCC reports on climate change:

Here there are several levels of headers. In the first page the headers are indicated by bold and a terminating colon. (Note there is no whitespace after the header).

In the second page there is a running title (not a header) and then a large alpha-numbered header.

In the next example we see a large header followed by decimal sections with no explicit header, though clearly they are separate. (I turn the number into a header and also use it as an id.)

In some cases the first sentence of the following paragraph is bold, and this could be used as a header:

I might keep the para intact and duplicate the first sentence (perhaps truncated) as a header . Note that here its a figure caption with a regular structure.

But is this sentence a header?

And is this a paragraph?

I think it would be possible to come up with a set of templates which are fairly general and might give medium recall/precision on a range of document types. But it will never be 100%. For large corpora created with the same tools it's probably worth customising templates. For random small ones it may be that LLMs give useful results. Or they may garble it.

BTW are you (or anyone) interested in extracting the paragraphs into flowable text (i.e. without hard line breaks)? Because I'm also working on that and made good progress a year back. If no one else is I'll re-do it over the next do or two.

1 reply

danltw Apr 23, 2023
Author

@petermr thanks for the reply. Yes, I’m actually doing something that extracts texts for LLMs. Sorry, didn’t quite get what you meant by “hard line breaks”, but I would love to see what you have done

petermr · 2023-04-23T11:47:08Z

petermr
Apr 23, 2023

The test/s are in TestPDFPumberTest in test/test_pdf.py in github.com/petermr/py4ami branch pmr15 But I would wait for a day and it should be clearer. I'll also try to create a discussion on the site.

…

On Sun, Apr 23, 2023 at 12:03 PM Daniel Leong ***@***.***> wrote: @petermr <https://github.com/petermr> thanks for the reply. Yes, I’m actually doing something that extracts texts for LLMs. Sorry, didn’t quite get what you meant by “hard line breaks”, but I would love to see what you have done — Reply to this email directly, view it on GitHub <#868 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFTCSZTNN4LZWFDZGYIDQTXCUEBVANCNFSM6AAAAAAXFBLFRI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

1 reply

danltw May 3, 2023
Author

Thanks @petermr , I've managed to get what I want without reference to your code. However, I would like to say that your works are very interesting. cheers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF header and paragraph detection #868

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

PDF header and paragraph detection #868

danltw Apr 20, 2023

Replies: 2 comments · 2 replies

petermr Apr 21, 2023

danltw Apr 23, 2023 Author

petermr Apr 23, 2023

danltw May 3, 2023 Author

danltw
Apr 20, 2023

Replies: 2 comments 2 replies

petermr
Apr 21, 2023

danltw Apr 23, 2023
Author

petermr
Apr 23, 2023

danltw May 3, 2023
Author