Regression in tokenizer handling of `\r` #128233

tusharsadhwani · 2024-12-25T00:48:05Z

Bug report

Bug description:

Python 3.12 onwards we get a weird \r} token when trying to parse a file just containing '{\r}':

$ printf '{\r}' | python3.11 -m tokenize
1,0-1,1:            OP             '{'            
1,1-1,2:            ERRORTOKEN     '\r'           
1,2-1,3:            OP             '}'            
1,3-1,4:            NEWLINE        ''             
2,0-2,0:            ENDMARKER      ''             

$ printf '{\r}' | python3.12 -m tokenize
1,0-1,1:            OP             '{'            
1,1-1,3:            OP             '\r}'          
1,3-1,4:            NEWLINE        ''             
2,0-2,0:            ENDMARKER      ''

Weirdly, AST generation passes just fine in both cases:

$ printf '{\r}' | python3.11 -m ast
Module(
   body=[
      Expr(
         value=Dict(keys=[], values=[]))],
   type_ignores=[])

$ printf '{\r}' | python3.12 -m ast
Module(
   body=[
      Expr(
         value=Dict(keys=[], values=[]))],
   type_ignores=[])

Expected behaviour

I'd expect the \r to yield a NL instead, and we get a } OP as expected.

CPython versions tested on:

3.11, 3.12, 3.13, 3.14

Operating systems tested on:

macOS

The text was updated successfully, but these errors were encountered:

tusharsadhwani · 2024-12-25T15:36:57Z

There's one more interesting one, when the tokenizer seems to think that '\r ' is a non-whitespace token:

$ printf 'foo\n\r \nbar' | python3.11 -m tokenize
1,0-1,3:            NAME           'foo'          
1,3-1,4:            NEWLINE        '\n'           
2,0-2,3:            NL             '\r \n'        
3,0-3,3:            NAME           'bar'          
3,3-3,4:            NEWLINE        ''             
4,0-4,0:            ENDMARKER      ''             

$ printf 'foo\n\r \nbar' | python3.12 -m tokenize                  
1,0-1,3:            NAME           'foo'          
1,3-1,4:            NEWLINE        '\n'           
2,0-2,2:            OP             '\r '          
2,2-2,3:            NEWLINE        '\n'           
3,0-3,3:            NAME           'bar'          
3,3-3,4:            NEWLINE        ''             
4,0-4,0:            ENDMARKER      ''

I would have expected the 2,2-2,3: NEWLINE '\n' case in Python 3.12 to be NL instead, as there is no semantic meaning to that newline. Python 3.11 categorizes that correctly as NL.

tusharsadhwani added the type-bug An unexpected behavior, bug, or error label Dec 25, 2024

tusharsadhwani mentioned this issue Dec 25, 2024

Black crashes on files containing \r, from e.g. old MacOS psf/black#3700

Open

picnixz added the topic-parser label Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in tokenizer handling of `\r` #128233

Regression in tokenizer handling of `\r` #128233

tusharsadhwani commented Dec 25, 2024 •

edited by github-actions bot

Loading

tusharsadhwani commented Dec 25, 2024

Regression in tokenizer handling of \r #128233

Regression in tokenizer handling of \r #128233

Comments

tusharsadhwani commented Dec 25, 2024 • edited by github-actions bot Loading

Bug report

Bug description:

Expected behaviour

CPython versions tested on:

Operating systems tested on:

tusharsadhwani commented Dec 25, 2024

Regression in tokenizer handling of `\r` #128233

Regression in tokenizer handling of `\r` #128233

tusharsadhwani commented Dec 25, 2024 •

edited by github-actions bot

Loading