Extracting Tables Split by Extracted Text #593
WinstonDoodle
started this conversation in
Ask for help with specific PDFs
Replies: 1 comment
-
Hi @WinstonDoodle, and thanks for sharing the PDF and a description of your goals. My initial thought is to suggest the following:
Does that help? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm working through extracting tables from a PDF that has semi-consistent structure between pages. My goal is to extract each of the 2 tables under each section header (denoted by navy blue). After I achieve this first goal, I can comfortably loop through pages and append tables.
PDF: PDF Example Page.pdf
Components of page that I'm interested in
I initially tried an
extract_text()
approach that doesn't involveextract_table()
, but the gray sections of the table were not recognized by extract_text. The output text had 6 columns for Facility 1 through 5, but the next line of text was just "Row 1, $1,000" so it omitted the null entries under Fac1-Fac4.Challenges
End Goal
Code:
#Using 'lines_strict' has resolved capturing false tables
`table_settings = {
"vertical_strategy": "lines_strict",
"horizontal_strategy": "lines_strict"
}
with pdfplumber.open(pdf_file) as pdf:
page = pdf.pages[0]
text = page.extract_text()
tbl_index = page.extract_tables(table_settings)
tbl0 = pd.DataFrame(tbl_index[0])`
Beta Was this translation helpful? Give feedback.
All reactions