Table Extraction Adds 0 to cell #1203
flickp
started this conversation in
Ask for help with specific PDFs
Replies: 2 comments
-
Interesting! Doing this: im = page.to_image(resolution=150)
im.reset().draw_rects([ c for c in page.chars if c["text"] == "0" ]) ... produces this: Looking toward the bottom of the page, there does seem to be an extra |
Beta Was this translation helpful? Give feedback.
0 replies
-
Ah, did not see that!
Get Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: Jeremy Singer-Vine ***@***.***>
Sent: Thursday, October 3, 2024 10:27:37 AM
To: jsvine/pdfplumber ***@***.***>
Cc: flickp ***@***.***>; Author ***@***.***>
Subject: Re: [jsvine/pdfplumber] Table Extraction Adds 0 to cell (Discussion #1203)
Interesting! Doing this:
im = page.to_image(resolution=150)
im.reset().draw_rects([ c for c in page.chars if c["text"] == "0" ])
... produces this:
image.png (view on web)<https://github.com/user-attachments/assets/6a3104f2-5c0f-4ee6-849f-032ecb41b77f>
Looking toward the bottom of the page, there does seem to be an extra 0 on the page that isn't otherwise visible. Characters can appear non-visible for various reasons, such as being the same color as the background (not the issue in this case), being overdrawn by another graphical element, or masking.
—
Reply to this email directly, view it on GitHub<#1203 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AR7IDOUE2FZAJHJUIX6Z7Y3ZZSMQTAVCNFSM6AAAAABODPZXQGVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAOBSGY3DCNY>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have a pdf that can be found here that when I extract a table on page 16, an extra 0 is added to a cell in the last row of the table. In the pdf there is no 0 at this location.
pdf_file = "PDFs\\WorkersCompAnnualReport2011.pdf"
pdf = pdfplumber.open(pdf_file)
page = pdf.pages[16]
im = page.to_image()
im.debug_tablefinder()
When I extract the table, there is an extra column with a 0 in it that is not present in the table:
# Extract the table from the current page
# snap_tolerance needed to combine the newline-separated years in the header
table = page.extract_table(table_settings={'snap_tolerance': 6})
table[32]
Any advice on what may be happening?
Thanks
Beta Was this translation helpful? Give feedback.
All reactions