Functions that can be multi-threaded - Enhancement to documentation #995

sandzone · 2023-09-22T05:37:46Z

With reference to #91

Is extract_tables the only function with this issue?

I am using multiprocessing with extract_words and haven't faced this issue so far. I wonder if this is just luck or if extract_words doesn't depend on document-wide ._tokens issue that @jsvine mentioned in #91

It will be very helpful if this aspect is mentioned in the documentation.

The text was updated successfully, but these errors were encountered:

jsvine · 2023-09-22T19:24:49Z

Interesting. My best guess is "just luck," since they use the same underlying PDF-parsing process.

Pk13055 · 2023-11-11T13:35:03Z

I was able to use multi-threading no problem :) You need to use ThreadPoolExecutor instead of underlying low-level threading.Thread

jsvine · 2023-11-13T22:46:17Z

Thanks for the note, @Pk13055! Are you able to share some code that demonstrates your approach?

Pk13055 · 2023-11-16T12:32:52Z

Here's a small example I put together. It may not run off-the-bat, but should provide a general idea:

from asyncio import gather, ensure_future, get_event_loop, run

import pdfplumber


async def process_page(page):

    processed = page.extract_tables()
    # do other stuff with page
    return processed

async def main():
    pdf = pdfplumber.open("test.pdf")
    loop = get_event_loop()
    futures = []
    for pg_idx in range(len(pdf.pages)):
        page = pdf.pages[pg_idx]
        futures.append(ensure_future(process_page(page), loop=loop))
    await gather(*futures)


if __name__ == "__main__":
    run(main())

I found this approach to be much faster than using a ThreadPoolExecutor, but here's an example anyway:

from concurrent.futures import ThreadPoolExecutor, as_completed
from asyncio import run

import pdfplumber


async def process_page(page):
    processed = page.extract_tables()
    # do other stuff with page
    return processed

async def main():
    pdf = pdfplumber.open("test.pdf")

    futures = []
    with ThreadPoolExecutor() as executor:
        for pg_idx in range(len(pdf.pages)):
            page = pdf.pages[pg_idx]
            futures.append(executor.submit(process_page, page))

    for res in as_completed(futures):
        processed = res.result()
        # do something with processed


if __name__ == "__main__":
    run(main())

jsvine · 2023-11-17T21:22:24Z

Thanks! @sandzone: Does @Pk13055's approach work for you?

sandzone added the feature-request All feature requests receive this label initially, can be upgraded to "enhancement" label Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Functions that can be multi-threaded - Enhancement to documentation #995

Functions that can be multi-threaded - Enhancement to documentation #995

sandzone commented Sep 22, 2023

jsvine commented Sep 22, 2023

Pk13055 commented Nov 11, 2023

jsvine commented Nov 13, 2023

Pk13055 commented Nov 16, 2023

jsvine commented Nov 17, 2023

Functions that can be multi-threaded - Enhancement to documentation #995

Functions that can be multi-threaded - Enhancement to documentation #995

Comments

sandzone commented Sep 22, 2023

jsvine commented Sep 22, 2023

Pk13055 commented Nov 11, 2023

jsvine commented Nov 13, 2023

Pk13055 commented Nov 16, 2023

jsvine commented Nov 17, 2023