-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Freeze/bad performance on escaped quotation in f-string #4520
Comments
The hang happens inside tokenization at this line At that point Why I think it catastrophically backtracksSince the regex is pcre2 compatible, I can use regex101's debugger. It looks like the issue is since the first repeated group That regex got into At that point
Notably this is commented with It looks like single quoted f-strings don't add their same version of the regex to |
Thanks for the analysis @MeGaGiGaGon! Also cc @tusharsadhwani. |
Thanks for the analysis! I'm guessing it is a backtracking bug as well. I'm hoping to reproduce the slowness in regex101, and look for a workaround that doesn't break anything. |
Reproduced: https://regex101.com/r/tmjlX1/1 Adding any more |
https://regex101.com/r/H8yqIO/1/debugger And this debugger view shows what's going on: for every combination of number of |
I don't know why regex101 is weird about it, but if you put |
@JelleZijlstra while I'm debugging this, I do believe the tokenizer might be in need for a rewrite without using any regex, either in Python itself (if you think that won't be too slow), or in another langauge (which can also bring a nice perf. boost) I have already written a tokenizer that is written in Zig and is much faster, passes all tests in black, and it can be turned into a Python package. Would you consider adopting it into black? Or if not, would porting the exact same logic (it iterates over each character with almost no backtracking) into Python be too slow? |
Thanks for looking into this! If it works well, I'd consider switching to your tokenizer, but dependencies that need compilation can make it harder for users to use Black. In the past, that was the motivation for us to switch from I'm also not convinced this issue is a reason to throw out the entire current tokenizer. It seems it should be possible to write a local fix for this issue. |
Correct. I'm doing that today hopefully.
That's correct. Alternative would be to do a naïve Python port of the logic, making it much simpler, but without using any regex at all. Have you looked into that at all? Do you think it'll be a major performance hit? |
Understood the backtracking problem. Compare these two: https://regex101.com/r/SPaNxh/1 vs. https://regex101.com/r/o24Ekt/1 First one is linear time, second one is exponential time for the number of |
I haven't tried that. I don't have a good intuition as to what the performance impact would be. |
Alright. I'll do a simple port and benchmark it over the weekend. |
The plan for the fix right now is:
|
Okay I tried attempting this and the states are too complicated and hard to work with right now. In other news, my Python 3.12 tokenizer rewrite is currently passing >99.5% of black's test suite, so that's probably good? I tried running my tokenizer on every single Python file in the dependency tree of one of my projects, and it accurately tokenized 98% of the files (3849 / 3902): And this will probably be zero dependencies and only some 400 lines of pure Python code. |
I have a put up a rough version here: https://github.com/tusharsadhwani/pytokens This is a quick port from Zig to Python, and I may have made small errors. I'll check tomorrow and try to throw it into Black. |
Describe the bug
black hangs (or takes an extremely long time?) to process a file with a multiline f-string containing a bunch of escaped quotation marks.
To Reproduce
Run black against this file:
Will just hang without any output.
Expected behavior
Should run to completion within at most 1 second.
Environment
Additional context
Extracting the quotes into a variable seems to circumvent the issue:
The text was updated successfully, but these errors were encountered: