Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract table merged cells #979

Open
John-Peter-R opened this issue Aug 30, 2023 · 6 comments
Open

Extract table merged cells #979

John-Peter-R opened this issue Aug 30, 2023 · 6 comments
Labels
feature-request All feature requests receive this label initially, can be upgraded to "enhancement"

Comments

@John-Peter-R
Copy link

Please describe, in as much detail as possible, your proposal and how it would improve your experience with pdfplumber.
So while extracting tables from a pdf there are pdf which has mered cells in that case table extraction method fails to extract merged cells in a merged format . the quality of extracting merged cells need to be improved

@John-Peter-R John-Peter-R added the feature-request All feature requests receive this label initially, can be upgraded to "enhancement" label Aug 30, 2023
@samkit-jain
Copy link
Collaborator

Hi @John-Peter-R Appreciate your interest in the library. Could you please provide an example PDF, the output you are getting and the output you expected?

@John-Peter-R
Copy link
Author

Thanks for your response .
The thing is I am finding a generic way to extract tables from pdf regardless of the tables structure a pdf may contain different merged cells . So researching on a generic way

@samkit-jain
Copy link
Collaborator

One thing that you could try for a generic way of handling merged cells from tables could be that

  1. Find a table.
  2. Reject all horizontal and vertical lines that don't span the table's width and height.
    That way, you'll discard all the horizontal and vertical lines that are part of a merged cell and instead of getting 2 cells, you'll get a single cell.

If my understanding of your requirement is incorrect, request you to provide additional information with examples.

@Pk13055
Copy link

Pk13055 commented Nov 20, 2023

@John-Peter-R As far as I have tested, the library, in its current state, is already able to extract merged-cell, tables

@yoursock
Copy link

H2_AN202404251631316496_1.pdf
here's an example, you can take the page 8 for a test. pic is here:
image
only 9 columns in this table, but extracted 15 columns instead. table is here:
[[['持股5%以上股东、前10名股东及前10名无限售流通股股东参与转融通业务出借股份情况', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None], ['股东名称\n(全称)', '', '期初普通账户、信用账户持', None, None, '', '', '期初转融通出借股份', None, None, None, '', '', '期末普通账户、信用账户', None, None, '', '', '期末转融通出借股份', None, None, None, ''], [None, None, '股', None, None, None, None, '且尚未归还', None, None, None, None, None, '持股', None, None, None, None, '且尚未归还', None, None, None, None], [None, '数量合计', None, '', '占总股本的', '', '', '数量', '', '', '占总股本', '', '数量合计', None, '', '占总股本', '', '', '数量', '', '', '占总股本', ''], [None, None, None, None, '比例', None, None, '合计', None, None, '的比例', None, None, None, None, '的比例', None, None, '合计', None, None, '的比例', None], ['华润东阿阿\n胶有限公司', '151,351,731', None, '23.50%', None, None, '0', None, None, '0.00%', None, None, '151,351,731', None, '23.50%', None, None, '0', None, None, '0.00%', None, None], ['香港中央结\n算有限公司', '72,926,439', None, '11.32%', None, None, '0', None, None, '0.00%', None, None, '63,067,676', None, '9.79%', None, None, '0', None, None, '0.00%', None, None], ['华润医药投\n资有限公司', '57,935,116', None, '9.00%', None, None, '0', None, None, '0.00%', None, None, '57,935,116', None, '9.00%', None, None, '0', None, None, '0.00%', None, None], ['中国工商银\n行股份有限\n公司-中欧\n医疗健康混\n合型证券投\n资基金', '11,823,465', None, '1.84%', None, None, '0', None, None, '0.00%', None, None, '21,508,141', None, '3.34%', None, None, '0', None, None, '0.00%', None, None], ['中国建设银\n行股份有限\n公司-工银\n瑞信前沿医\n疗股票型证\n券投资基金', '10,000,022', None, '1.55%', None, None, '0', None, None, '0.00%', None, None, '11,300,020', None, '1.75%', None, None, '0', None, None, '0.00%', None, None], ['中国农业银\n行股份有限\n公司-易方\n达消费行业\n股票型证券\n投资基金', '12,798,173', None, '1.99%', None, None, '0', None, None, '0.00%', None, None, '8,887,373', None, '1.38%', None, None, '0', None, None, '0.00%', None, None], ['张弦', '8,232,033', None, '1.28%', None, None, '0', None, None, '0.00%', None, None, '8,232,033', None, '1.28%', None, None, '0', None, None, '0.00%', None, None], ['中国农业银\n行股份有限\n公司-嘉实\n新兴产业股\n票型证券投\n资基金', '7,113,293', None, '1.10%', None, None, '0', None, None, '0.00%', None, None, '7,677,893', None, '1.19%', None, None, '0', None, None, '0.00%', None, None], ['中国农业银\n行股份有限\n公司-嘉实\n核心成长混\n合型证券投\n资基金', '5,824,900', None, '0.90%', None, None, '0', None, None, '0.00%', None, None, '6,206,300', None, '0.96%', None, None, '0', None, None, '0.00%', None, None], ['中国农业银\n行股份有限\n公司-中证\n500交易型开\n放式指数证\n券投资基金', '2,514,400', None, '0.39%', None, None, '730,900', None, None, '0.11%', None, None, '5,124,095', None, '0.80%', None, None, '513,900', None, None, '0.08%', None, None]], [['股东名称'], ['(全称)']]]

@jsvine
Copy link
Owner

jsvine commented Aug 2, 2024

Hi @yoursock, running page.to_image().debug_tablefinder(...), you'll see that there are some hidden lines in the header:

tmp

You can use some of the strategies described here to deal with this issue:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request All feature requests receive this label initially, can be upgraded to "enhancement"
Projects
None yet
Development

No branches or pull requests

5 participants