Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

original_path extraction error regarding LTCurve #1057

Open
KaboChow opened this issue Dec 8, 2023 · 8 comments
Open

original_path extraction error regarding LTCurve #1057

KaboChow opened this issue Dec 8, 2023 · 8 comments
Labels
feature-request All feature requests receive this label initially, can be upgraded to "enhancement"

Comments

@KaboChow
Copy link

KaboChow commented Dec 8, 2023

During the process of extracting shape data from a PDF, I converted the created text letter 'o' into a shape object.

image
Here is the curve data I obtained.

image
Normally, there should only be one set of curve data.
However, it seems that there are two in this case. Here is the graphic created on the canvas using the obtained data:

image
The filling color obtained for the second set of curve data is incorrect.

This is the PDF I conducted the test on:
LTCurve.pdf

Is there any way to resolve this?
Thank you very much.

@KaboChow KaboChow added the feature-request All feature requests receive this label initially, can be upgraded to "enhancement" label Dec 8, 2023
@jsvine
Copy link
Owner

jsvine commented Dec 21, 2023

Hi @KaboChow, and thanks for providing this interesting example. It appears to relate to pdfplumber's main dependency, pdfminer.six.

It seems that there's some discussion of this general issue here: pdfminer/pdfminer.six#861 (comment)

As it happens, however, the piece of pdfminer.six code it likely relates to is code I've contributed. Just brainstorming here, I think the issue is that folks generally want to decompose paths with multiple subpaths, for the purpose of rectangle detection. (See this test for an example.) As the issue comments above correctly point out, this makes it difficult/impossible to correctly handle more complex paths, such as shapes with holes in them.

One solution would be to propose reverting the behavior so that it does not decompose complex paths, with the downside being that some clearly rectangle-like things do not get recognized as such.

Another would be to tweak the behavior so that it mostly does not decompose complex paths except in the case of those composed entirely of rectangles. The downside would be that this may be a confusing rule, and also that some all-rectangle complex paths are still intending to be understood as shapes with holes in them.

Thanks again. Will keep thinking on this, and welcome suggestions from others, too.

@KaboChow
Copy link
Author

@jsvine Thank you for your answer.
Regarding the solution to this problem, I have done some processing on the obtained data, when the 'evenodd' value of two objects is false, to determine whether the boundaries of the two objects coincide, if they do, then the smaller side is the subpath, this method works for me, I hope it will be helpful for people who have the same confusion

@KaboChow
Copy link
Author

Hello @jsvine, I found a problem regarding the 'evenodd' value of the object。
image
When I try to get the data of this porous shape, the 'evenodd' values ​​obtained are all true
image
This is the PDF I used for testing:
Spin-City-Letters-6fae9bb1b9a6b3dd0f5811b066e9ed8e.pdf

@KaboChow
Copy link
Author

When I use letters or numbers to convert shapes, the data recognized is correct, and the value of 'evenodd' is false.
But when using a custom shape, the recognized values ​​of 'evenodd' are all true.

image

@jsvine
Copy link
Owner

jsvine commented Jul 14, 2024

Thank you for these additional examples, @KaboChow. I'm still unsure of the best solution, given the tradeoffs described above and that any changes would have to be made to pdfminer.six, but these examples are still helpful.

@KaboChow
Copy link
Author

Thank you @jsvine. The incorrect value of 'evenodd' has a great impact on my project. I have been looking for ways to solve it recently. If there is a new solution, please be sure to notify me

@dhdaines
Copy link
Contributor

dhdaines commented Jul 31, 2024

See discussion at pdfminer above. The issue is that pdfminer doesn't apply any fill rules in layout analysis. Ideally, you should be looking at the fill attribute, not the evenodd one, but it isn't getting set usefully, because pdfminer isn't applying fill rules. As a workaround obviously you could apply the fill rule yourself ;-)

@KaboChow
Copy link
Author

KaboChow commented Aug 1, 2024

You're right @dhdaines, the "fill" property is a good way to determine whether it is a hole, but the "fill" property value of the LTCurve child object that is currently split out is inherited from the parent object, obviously the "fill" property value of the LTCurve child object is incorrect, and my ability is limited and I can't solve this problem, so I finally chose to clear the rule of splitting the LTCurve shape

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request All feature requests receive this label initially, can be upgraded to "enhancement"
Projects
None yet
Development

No branches or pull requests

3 participants