Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functions that can be multi-threaded - Enhancement to documentation #995

Open
sandzone opened this issue Sep 22, 2023 · 5 comments
Open
Labels
feature-request All feature requests receive this label initially, can be upgraded to "enhancement"

Comments

@sandzone
Copy link

With reference to #91

Is extract_tables the only function with this issue?

I am using multiprocessing with extract_words and haven't faced this issue so far. I wonder if this is just luck or if extract_words doesn't depend on document-wide ._tokens issue that @jsvine mentioned in #91

It will be very helpful if this aspect is mentioned in the documentation.

@sandzone sandzone added the feature-request All feature requests receive this label initially, can be upgraded to "enhancement" label Sep 22, 2023
@jsvine
Copy link
Owner

jsvine commented Sep 22, 2023

Interesting. My best guess is "just luck," since they use the same underlying PDF-parsing process.

@Pk13055
Copy link

Pk13055 commented Nov 11, 2023

I was able to use multi-threading no problem :) You need to use ThreadPoolExecutor instead of underlying low-level threading.Thread

@jsvine
Copy link
Owner

jsvine commented Nov 13, 2023

Thanks for the note, @Pk13055! Are you able to share some code that demonstrates your approach?

@Pk13055
Copy link

Pk13055 commented Nov 16, 2023

Here's a small example I put together. It may not run off-the-bat, but should provide a general idea:

from asyncio import gather, ensure_future, get_event_loop, run

import pdfplumber


async def process_page(page):

    processed = page.extract_tables()
    # do other stuff with page
    return processed

async def main():
    pdf = pdfplumber.open("test.pdf")
    loop = get_event_loop()
    futures = []
    for pg_idx in range(len(pdf.pages)):
        page = pdf.pages[pg_idx]
        futures.append(ensure_future(process_page(page), loop=loop))
    await gather(*futures)


if __name__ == "__main__":
    run(main())

I found this approach to be much faster than using a ThreadPoolExecutor, but here's an example anyway:

from concurrent.futures import ThreadPoolExecutor, as_completed
from asyncio import run

import pdfplumber


async def process_page(page):
    processed = page.extract_tables()
    # do other stuff with page
    return processed

async def main():
    pdf = pdfplumber.open("test.pdf")

    futures = []
    with ThreadPoolExecutor() as executor:
        for pg_idx in range(len(pdf.pages)):
            page = pdf.pages[pg_idx]
            futures.append(executor.submit(process_page, page))

    for res in as_completed(futures):
        processed = res.result()
        # do something with processed


if __name__ == "__main__":
    run(main())

@jsvine
Copy link
Owner

jsvine commented Nov 17, 2023

Thanks! @sandzone: Does @Pk13055's approach work for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request All feature requests receive this label initially, can be upgraded to "enhancement"
Projects
None yet
Development

No branches or pull requests

3 participants