Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move digit separators to accepted. #4105

Merged
merged 1 commit into from
Sep 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
199 changes: 199 additions & 0 deletions accepted/future-releases/digit-separators/feature-specification.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
# Digit Separators

Author: Lasse Nielsen, Sam Rawlins

Status: In-progress

Version 1.0

## Motivation

To make long number literals more readable, allow authors to inject [digit
group separators][] inside numbers. Examples with different possible separators:

```none
100 000 000 000 000 000 000 // space
100,000,000,000,000,000,000 // comma
100.000.000.000.000.000.000 // period
100'000'000'000'000'000'000 // apostrophe (C++)
100_000_000_000_000_000_000 // underscore (many programming languages).
```

## Proposal

### Digit separators in number literals

Allow one or more `_`s between any two otherwise adjacent _digits_ of a NUMBER
or HEX\_NUMBER token. The following are not digits: The leading `0x` or `0X` in
HEX\_NUMBER, and any `.`, `e`, `E`, `+` or `-` in NUMBER.

That means only allowing `_`s between two `0-9` digits in NUMBER and between
two `0-9`,`a-f`,`A-F` digits in HEX\_NUMBER.

The grammar would be changing `<DIGIT>+` to `<DIGITS>` which is then `<DIGIT>`s
with optional `_`s between, and same for hex digits:

```bnf
<NUMBER> ::= <DIGITS> (`.' <DIGITS>)? <EXPONENT>?
\alt `.' <DIGITS> <EXPONENT>?

<EXPONENT> ::= (`e' | `E') (`+' | `-')? <DIGITS>

<DIGITS> ::= <DIGIT> (`_'* <DIGIT>)*

<HEX\_NUMBER> ::= `0x' <HEX\_DIGITS>
\alt `0X' <HEX\_DIGITS>

<HEX\_DIGIT> ::= `a' .. `f'
\alt `A' .. `F'
\alt <DIGIT>

<HEX\_DIGITS> ::= <HEX\_DIGIT> (`_'* <HEX\_DIGIT>)*
```

### Examples

```none
100__000_000__000_000__000_000 // one hundred million million millions!
0x4000_0000_0000_0000
0.000_000_000_01
0x00_14_22_01_23_45 // MAC address
555_123_4567 // US Phone number
```

**Invalid** literals:

```none
100_
0x_00_14_22_01_23_45
0._000_000_000_1
100_.1
1.2e_3
```

An identifier like `_100` is a valid identifier, and `_100._100` is a valid
member access. If users learn the "separator only between digits" rule quickly,
this will likely not be an issue.

### Why choose underscores

The syntax must work even with just a single separator, so it can't be anything
that can already validly seperate two expressions (excludes all infix operators
and comma) and should already be part of a number literal (excludes decimal
point).

So, the comma and decimal point are probably never going to work, even if they
are already the standard "thousands separator" in text in different parts of
the world.

Space separation is dangerous because it's hard to see whether it's just space,
or it's an accidental tab character. If we allow spacing, should we allow
arbitrary whitespace, including line terminators? If so, then this suddenly
become quite dangerous. Forget a comma at the end of a line in a multiline
list, and two adjacent integers are automatically combined (we already have
that problem with strings). So, probably not a good choice, even if it is the
preferred formatting for print text.

The apostrope is also the string single-quote character. We don't currently
allow adjacent numbers and strings, but if we ever do, then this syntax becomes
ambiguous. It's still possible (we disambiguate by assuming it's a digit
separator). It is currently used by C++ 14 as a digit group separator, so it is
definitely possible.

That leaves underscore, which could be the start of an identifier. Currently
`100_000` would be tokenized as "integer literal 100" followed by "identifier
`_000`". However, users would never write an identifier adjacent to another
token that contains identifier-valid characters (unlike strings, which have
clear delimiters that do not occur anywher else), so this is unlikely to happen
in practice. Underscore is already used by a large number of programming
languages including Java, Swift, and Python.

We also want to allow multiple separators for higher-level grouping, e.g.,:

```none
100__000_000_000__000_000_000
```

For this purpose, the underscore extends gracefully. So does space, but has the
disadvantage that it collapses when inserted into HTML, whereas `''` looks odd.

### Related work

* [Java digit separators](https://docs.oracle.com/javase/8/docs/technotes/guides/language/underscores-literals.html)
* [Python PEP 515 - underscores in numeric literals](https://peps.python.org/pep-0515/)

### Possible new lint rules

There are some possible new lint rule considerations, but none of these are
considered vital to the usability or general success of the feature.

The feature is designed to help the readability of long numbers. But a
developer can still make a mistake about where to place separators. For example:

```
var one = 1_000_000;
var two = 2_000_000;
var three = 3_000_000;
var four = 4_0000_000; // Whoops!
```

If a developer uses the Dart formatter to format their code, they cannot try to
vertically align the numbers with whitespace (extra space characters are
removed by the formatter). So we could offer a lint rule to only place
separators every three digits of a decimal number. Also possibly a similar rule
for hexadecimal numbers. If a developer ever uses digit separators for a
different purpose (as in separating the digits of a phone number), the rule may
not prove useful.

A separate lint rule could encourage _consistent_ digit separators, which
triggers if the digit groups do not have the same size (except the most
significant one, which can be shorter). If there are any `__` separators, the
number of `_`-separated groups between them should also be the same, and
repeatedly for higher numbers of `_`s.

### Possible new quick fixes

There are some possible new automated fix ("quick fix") considerations, but
none of these are considered vital to the usability or general success of the
feature.

#### Unexpected underscores

With the digit-separators feature, separators can be added between _digits_ of
a number literal, but nowhere else. In most error cases, the unexpected
underscore can be detected as such, and we can offer quick fixes to remove
unexpected errors (for example, `100_`, `100_e1.2`, `100._00`). In a few cases,
the intention is not as straightforward, such as `100._100`, where `_100` can
be a legal name of an extension member (though the presense of such a private
extension member can be detected).

#### Unexpected commas

The only legal digit separator that is introduced with this feature is the
underscore character. If a developer attempts to use another character, for
example commas, as a separator, we may be able to detect this, and offer a
quick fix to convert the commas to underscores.

### Non-breaking change

This change is strictly non-breaking. The feature can be thought of as a single
change from previous Dart syntax: some syntax which was previously illegal
(producing compile-time errors) becomes legal.

(The feature is still introduced with a [Dart language version][], so that
packages that start using the feature declare that they require some new lower
bound of the Dart SDK.)

### Formatting

As any number literal remains a single token, there are no formatting
considerations.

## Changelog

### 1.0

- Initial version

[digit group separators]: https://en.wikipedia.org/wiki/Decimal_separator#Digit_grouping
[Dart language version]: https://github.com/dart-lang/language/blob/main/accepted/2.8/language-versioning/feature-specification.md
Loading