Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flow-remove-types for input with multi-byte characters #7781

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mourner
Copy link

@mourner mourner commented Jun 3, 2019

Closes #7779 by implementing the approach described by @mroch in #7779 (comment).

Remaining work:

  • Try eliminating all string manipulation and buffer.write calls in favor of purely byte-based operations for performance. Not worth it — bottleneck is in parsing.
  • Benchmark the change to make sure performance didn't regress.
  • Address any review suggestions.
  • Sign the CLA.

@facebook-github-bot
Copy link
Contributor

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need the corporate CLA signed.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@mourner mourner marked this pull request as ready for review June 3, 2019 20:45
@mourner
Copy link
Author

mourner commented Jun 3, 2019

@mroch this is ready for review. While the performance didn't regress, there's no point in optimizing the code much because almost all time is spent in flow-parser. Preliminary benchmarks show that it's about 2.5–4 times slower than the previous flow-remove-type version based on babylon. 😭

@motiz88 motiz88 self-requested a review June 19, 2019 10:59
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@motiz88 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@motiz88 motiz88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mourner for catching this and working on a fix. I have a concern regarding source maps and some other minor comments, but this looks good in principle.

@@ -397,6 +407,7 @@ function getTrailingLineNode(context, node) {
// Creates a zero-width "node" with a value to splice at that position.
// WARNING: This is only safe to use when source maps are off!
function getSpliceNodeAtPos(context, pos, loc, value) {
context.bytesAdded += value.length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically mixing JS string length (UTF-16 code units) and UTF-8 bytes. Not a problem yet because the only call to getSpliceNodeAtPos uses a plain ASCII string (' =>'), but maybe we should future-proof this?

return (result += source.slice(lastPos));
offset += sourceBuffer.copy(buf, offset, lastPos, sourceBuffer.length);

return buf.toString('utf8', 0, offset);
},
generateMap: function() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this issue translate to source maps? I'm suddenly worried that the source map spec is silent on what "columns" mean, but I'd imagine it's meant to be interpreted more like "code points" than "bytes". So we might need to do some Unicode bookkeeping in generateSourceMappings as well.

@@ -1,6 +1,8 @@
/* @flow */
// @nolint

// multi-byte chars: Гарного дня, котики!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some non-BMP (multi UTF-16 code unit) characters to the test? e.g. emoji: '🐈'.length == 2, Buffer.from('🐈').length == 4

Also, I'd add a test case with non-ASCII characters in the actual code that we process, e.g.

var lambda: λ = (α: number): number => α;

@goodmind goodmind added the Stalled Issues and PRs that are stalled. label Jul 27, 2019
@thehogfather
Copy link

@mourner have you still got time to look into this and address the comments?

@mourner
Copy link
Author

mourner commented Aug 23, 2019

@thehogfather apologies — can't get to it at the moment but hopefully I'll find some time in a week or two. The only comment left unaddressed is this one:

How does this issue translate to source maps? I'm suddenly worried that the source map spec is silent on what "columns" mean, but I'd imagine it's meant to be interpreted more like "code points" than "bytes". So we might need to do some Unicode bookkeeping in generateSourceMappings as well.

If anyone's up to helping out on the source maps part, I'd appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed Stalled Issues and PRs that are stalled.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

flow-remove-types produces invalid JS after non-latin characters
6 participants