Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: support python3.12 f-string tokenizing #213

Merged
merged 8 commits into from
Jan 9, 2024

Conversation

mbhall88
Copy link
Member

@mbhall88 mbhall88 commented Jan 7, 2024

Python 3.12 now formalises f-strings by parsing them as their own entities, rather than plain strings. This creates some problems for us as where a whole f-string would produce a single tokenize.STRING, it now breaks f-strings up into bits and adds the new tokens FSTRING_START, FSTRING_MIDDLE and FSTRING_END.

The fixes in this PR relate to issues people have already come across. I suspect there will be more. My solution is somewhat hacky in places but it was the best I could do given the tokens the tokenize library is giving us. #207 is particularly problematic as the tokenizer is the one that removes the extra braces, hence why I am manually adding them back. Fingers crossed this doesn't raise further issues.

I have also added python 3.12 to the CI so we are testing across <3.12 and >=3.12

Closes #207 and closes #210

@mbhall88 mbhall88 requested a review from bricoletc January 7, 2024 12:12
Copy link

codecov bot commented Jan 7, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (a80fa26) 97.73% compared to head (62bca59) 97.76%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #213      +/-   ##
==========================================
+ Coverage   97.73%   97.76%   +0.02%     
==========================================
  Files          12       12              
  Lines        1017     1028      +11     
  Branches      223      228       +5     
==========================================
+ Hits          994     1005      +11     
  Misses         12       12              
  Partials       11       11              
Flag Coverage Δ
unittests 97.66% <90.90%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
snakefmt/parser/syntax.py 98.91% <100.00%> (+0.04%) ⬆️

Copy link
Collaborator

@bricoletc bricoletc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing these v. important bugs @mbhall88
A bit unfortunate for us that python 3.12 is changing f-string tokenisation

I'm just going to try to summarise in words what I understand from this PR but also what I don't understand from snakefmt (:cry:) and what I think we might need to do specifically here:

  • You're adding 'space addition' through function add_token_space by registering 'FSTRING_START' to the spacing_triggers dict; why can't we run f-strings through black again?
  • If we can run f-strings through black, can't we try and run process_token from encountering 'FSTRING_START' to encountering 'FSTRING_END' as per the new f-string token sequence ? I'm thinking of something like lines 394 and 396, where if we're in_brackets or in_lambda we keep parsing, instead of triggering snakefmt parameter resolution. Here we'd similarly ask if we're in_fstring
    BUT it seems that you're having to introduce parse 'FSTRING_MID's individually (lines 335-341) and add braces because they get lost?

So in fact IIUC, tokenizer gets rid of '{' and '}' inside f-strings, so we have to re-add them manually? And thus, we can't just try and 'parse through' an f-string from 'FSTRING_START' to 'FSTRING_END' because that way we lose '{' and '}' in f-strings?

snakefmt/parser/syntax.py Show resolved Hide resolved
snakefmt/parser/syntax.py Show resolved Hide resolved
snakefmt/parser/syntax.py Show resolved Hide resolved
@bricoletc
Copy link
Collaborator

Second comment: echoing my review of #211, do you want to have a chat about a formal lexer/parser for snakefmt?

@mbhall88 mbhall88 mentioned this pull request Jan 9, 2024
@mbhall88
Copy link
Member Author

mbhall88 commented Jan 9, 2024

Thanks for fixing these v. important bugs @mbhall88 A bit unfortunate for us that python 3.12 is changing f-string tokenisation

I'm just going to try to summarise in words what I understand from this PR but also what I don't understand from snakefmt (:cry:) and what I think we might need to do specifically here:

* You're adding 'space addition' through function `add_token_space` by registering 'FSTRING_START' to the `spacing_triggers` dict; why can't we run f-strings through `black` again?

We can run them through black, but they get mangled by snakefmt's parser now before they go into black sadly.

* If we **can** run f-strings through black, can't we try and run `process_token` from encountering 'FSTRING_START' to encountering 'FSTRING_END' as per the new f-string [token sequence](https://peps.python.org/pep-0701/#new-tokens) ?  I'm thinking of something like lines 394 and 396, where if we're `in_brackets` or `in_lambda` we keep parsing, instead of triggering snakefmt parameter resolution. Here we'd similarly ask if we're `in_fstring`
  BUT it seems that you're having to introduce parse 'FSTRING_MID's individually (lines 335-341) and add braces because they get lost?

That is an interesting idea. Not sure I completely understand how we would do this though as I think this section of the code was your creation and I'm not as familiar with it. Feel free to have a go at adding an in_fstring variant.

Yes, I had to introduce them as tokenize automatically removes them, so we only get a single brace when doubles are used. They get returned in their surrounding strings too, not even as a single token which made it slightly more tricky.

So in fact IIUC, tokenizer gets rid of '{' and '}' inside f-strings, so we have to re-add them manually? And thus, we can't just try and 'parse through' an f-string from 'FSTRING_START' to 'FSTRING_END' because that way we lose '{' and '}' in f-strings?

Essentially yes, this is the problem I ran into. Another option I had thought of was once we hit FSTRING_START we greedily take tokens until FSTRING_END and we use the positions of those tokens in the original string (I assume we get that?) to pull out the whole f-string and make a STRING token out of it. However, f-strings have added functionality in python3.12 now that is compatible with 3.11 - though I'm not sure if this would be a problem with passing the string to black or not...

@mbhall88
Copy link
Member Author

mbhall88 commented Jan 9, 2024

Second comment: echoing my review of #211, do you want to have a chat about a formal lexer/parser for snakefmt?

See my message on slack

@bricoletc
Copy link
Collaborator

Replying to your reply #213 (comment), I think the solution you've implemented is currently the best - it's actually v. succinct
And I'm not a fan, in general, of extracting chunks of the original text using pointers stored in the tokens - we've avoided doing this so far IIRC
And my idea of trying to parse through the whole f-string and passing it to black will not work because of the loss of the double curly brackets; incidentally, why do they now do that?? It's a pain for us

So this is good to merge, thanks @mbhall88 !

snakefmt/parser/syntax.py Show resolved Hide resolved
@mbhall88 mbhall88 merged commit 4932076 into snakemake:master Jan 9, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants