Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in tokenizer handling of \r #128233

Open
tusharsadhwani opened this issue Dec 25, 2024 · 1 comment
Open

Regression in tokenizer handling of \r #128233

tusharsadhwani opened this issue Dec 25, 2024 · 1 comment
Labels
topic-parser type-bug An unexpected behavior, bug, or error

Comments

@tusharsadhwani
Copy link
Contributor

tusharsadhwani commented Dec 25, 2024

Bug report

Bug description:

Python 3.12 onwards we get a weird \r} token when trying to parse a file just containing '{\r}':

$ printf '{\r}' | python3.11 -m tokenize
1,0-1,1:            OP             '{'            
1,1-1,2:            ERRORTOKEN     '\r'           
1,2-1,3:            OP             '}'            
1,3-1,4:            NEWLINE        ''             
2,0-2,0:            ENDMARKER      ''             

$ printf '{\r}' | python3.12 -m tokenize
1,0-1,1:            OP             '{'            
1,1-1,3:            OP             '\r}'          
1,3-1,4:            NEWLINE        ''             
2,0-2,0:            ENDMARKER      ''   

Weirdly, AST generation passes just fine in both cases:

$ printf '{\r}' | python3.11 -m ast
Module(
   body=[
      Expr(
         value=Dict(keys=[], values=[]))],
   type_ignores=[])

$ printf '{\r}' | python3.12 -m ast
Module(
   body=[
      Expr(
         value=Dict(keys=[], values=[]))],
   type_ignores=[])

Expected behaviour

I'd expect the \r to yield a NL instead, and we get a } OP as expected.

CPython versions tested on:

3.11, 3.12, 3.13, 3.14

Operating systems tested on:

macOS

@tusharsadhwani
Copy link
Contributor Author

There's one more interesting one, when the tokenizer seems to think that '\r ' is a non-whitespace token:

$ printf 'foo\n\r \nbar' | python3.11 -m tokenize
1,0-1,3:            NAME           'foo'          
1,3-1,4:            NEWLINE        '\n'           
2,0-2,3:            NL             '\r \n'        
3,0-3,3:            NAME           'bar'          
3,3-3,4:            NEWLINE        ''             
4,0-4,0:            ENDMARKER      ''             

$ printf 'foo\n\r \nbar' | python3.12 -m tokenize                  
1,0-1,3:            NAME           'foo'          
1,3-1,4:            NEWLINE        '\n'           
2,0-2,2:            OP             '\r '          
2,2-2,3:            NEWLINE        '\n'           
3,0-3,3:            NAME           'bar'          
3,3-3,4:            NEWLINE        ''             
4,0-4,0:            ENDMARKER      ''    

I would have expected the 2,2-2,3: NEWLINE '\n' case in Python 3.12 to be NL instead, as there is no semantic meaning to that newline. Python 3.11 categorizes that correctly as NL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-parser type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

2 participants