Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON tokenizer memory optimizations #16978

Open
wants to merge 8 commits into
base: branch-24.12
Choose a base branch
from

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Oct 2, 2024

Description

The full push-down automata that tokenizes the input JSON string, as well as the bracket-brace FST over-estimates the total buffer size required for the translated output and indices. This PR splits the transduce calls for both FSTs into two invocations. The first invocation estimates the size of the translated buffer and the translated indices, and the second call performs the DFA run.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 2, 2024
@shrshi shrshi added cuIO cuIO issue Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 5, 2024
@shrshi
Copy link
Contributor Author

shrshi commented Oct 5, 2024

Benchmark used for profiling study: https://github.com/karthikeyann/cudf/blob/enh-profile_memusage_json/wm_benchmark.py

Profiles before and after optimization:
image
image

We see that the peak memory usage comes down from 20.8GiB to 10.3GiB and the runtime of get_token_stream also reduces from 1.028s to 825.578ms

@shrshi shrshi marked this pull request as ready for review October 5, 2024 01:12
@shrshi shrshi requested a review from a team as a code owner October 5, 2024 01:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants