Extracting Tables Split by Extracted Text #593

WinstonDoodle · 2022-01-28T16:52:48Z

WinstonDoodle
Jan 28, 2022

I'm working through extracting tables from a PDF that has semi-consistent structure between pages. My goal is to extract each of the 2 tables under each section header (denoted by navy blue). After I achieve this first goal, I can comfortably loop through pages and append tables.

PDF: PDF Example Page.pdf

Components of page that I'm interested in

I initially tried an extract_text() approach that doesn't involve extract_table(), but the gray sections of the table were not recognized by extract_text. The output text had 6 columns for Facility 1 through 5, but the next line of text was just "Row 1, $1,000" so it omitted the null entries under Fac1-Fac4.

Challenges

How to identify which table is associated with each section header and description.
- On the page I've attached, there are two tables under the section header - this is the common scenario. Sometimes the section header starts at the bottom of the page and the tables don't start until the next page. Fortunately, there aren't any cases where a table "spills" across pages.
How to identify if the extracted table is "Table1" or "Table2" based on the text in the gray subheader box. Sometimes there is only a "Table1"
Using the text-only approach, I manually created a list of the 60+ section names so that the code could flag each section of the PDF, but I would strongly prefer to not rely on this manual check list in case the sections change.

End Goal

Section	Desc	Table_Num	Row	Facility 1	Facility 2	Facility 3	Facility 4	Facility 5
Header_1	Descr_1	Table 1	Row1					$1,000
Header_1	Descr_1	Table 1	Row 2					20%
Header_1	Descr_1	Table 2	Row1					$5,000
Header_1	Descr_1	Table 2	Row 2					50%
Header_2	Descr_2	Table 1	Row1	$1,000		$2,000
Header_2	Descr_2	Table 1	Row 2	20%		30%
Header_2	Descr_2	Table 2	Row1	$5,000		$7,500
Header_2	Descr_2	Table 2	Row 2	100%		80%
Header_3	...	...	...	...	...	...	...	...

Code:
#Using 'lines_strict' has resolved capturing false tables
`table_settings = {
"vertical_strategy": "lines_strict",
"horizontal_strategy": "lines_strict"
}

with pdfplumber.open(pdf_file) as pdf:
page = pdf.pages[0]
text = page.extract_text()
tbl_index = page.extract_tables(table_settings)
tbl0 = pd.DataFrame(tbl_index[0])`

jsvine · 2022-01-29T20:05:08Z

jsvine
Jan 29, 2022
Maintainer

Hi @WinstonDoodle, and thanks for sharing the PDF and a description of your goals. My initial thought is to suggest the following:

Filter through page.rects to identify the gray rectangles outlining the headers
Use page.crop((*bbox)).extract_text() to get the headers' text, using the coordinates from the step above
Instead of page.extract_tables(...), use page.find_tables(...), which will return Table objects, which contain information about their location. (You can then run table.extract() to get the table data.)
Compare the headers' location with the tables' location to determine which tables are associated with which headers.

Does that help?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting Tables Split by Extracted Text #593

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Extracting Tables Split by Extracted Text #593

WinstonDoodle Jan 28, 2022

PDF: PDF Example Page.pdf

Components of page that I'm interested in

Challenges

End Goal

Replies: 1 comment

jsvine Jan 29, 2022 Maintainer

WinstonDoodle
Jan 28, 2022

jsvine
Jan 29, 2022
Maintainer