Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excluding LF in 'allowed-comment-char' is confusing #995

Open
arp242 opened this issue Sep 30, 2023 · 11 comments · May be fixed by #996
Open

Excluding LF in 'allowed-comment-char' is confusing #995

arp242 opened this issue Sep 30, 2023 · 11 comments · May be fixed by #996

Comments

@arp242
Copy link
Contributor

arp242 commented Sep 30, 2023

I'm adding a test to toml-test for "Relax comment parsing; most control characters are again permitted.", and implementing it in my parser, and I'm rather confused what 0x0a (LF) is excluded. You need to support this if you want CRLF, no?

There's a long thread in #924 about this, but to be honest I stopped reading after a few comments – people shouldn't need to read entire threads to understand the specification.

I think the intent is that "CR" and "CRLF" are allowed, but "LF" isn't? That is what it does, I think, but it's really not very obvious IMHO.

At the very least there should be a comment in the ABNF, and the phrasing in toml.md ("that could cause problems during editing or processing: U+0000, and U+000A to U+000D") could also be improved. Now it's too easy to be misinterpreted as "only CR newlines are allowed and LF is forbidden, full stop", which isn't what was intended.

@arp242
Copy link
Contributor Author

arp242 commented Sep 30, 2023

Actually 0x0d (CR) is also excluded ... this is even more confusing. What's the point of excluding that?

@arp242
Copy link
Contributor Author

arp242 commented Sep 30, 2023

And a whole bunch of tests fail because now the allowed character set for strings and comments differ, and there was always the assumption those were the same.

I really wish we'd have "fix toml-test" as a requirement for any change.

None of the hypotheticals are even solved, because you still need to parse-by-character and exclude 0x00, 0x0b, 0x0c, you can still invalidate documents with a control character.

All for a problem no one actually reported. IMHO it's a completely pointless change that doesn't change anything for anyone, doesn't solve any practical problem, and is just ivory tower wankery and argueing for the point of argueing.

Literally no one using TOML will ever notice this change was made.

But I'm the one mucking about with toml-test and dealing with the churn (16 failing tests...)

And all the other implementations will need to be updated too.

So great, thanks.

arp242 added a commit to arp242/toml that referenced this issue Oct 1, 2023
This reverts commit ab74958.

I'm a simple guy. Someone reports a problem, I fix it. No one reports a problem? There is nothing to fix so I go drink beer.

No one really reported this as a problem, so there isn't anything to fix. But it *does* introduce entirely needless churn for all TOML implementations. Do we need to forbid *anything* in comments? Probably not. In strings we probably only need to forbid \x00. But at least before it was consistent with strings, and more importantly, what everyone wrote code for, which is tested, and already works.

And [none of the hypotheticals](toml-lang#567 (comment)) on why this is "needed" are practical issues people reported, and most aren't even fixed: a comment can still invalidate the file, you must still parse each character in a comment as some are still forbidden, the performance benefits are very close to zero they might as well be zero, and you still can't "dump whatever you like" in comments.

So it doesn't *actually* change anything, it just changes "disallow this set of control characters" to ... another (smaller) set. That's not really a substantial change. The only (minor) real-world issue that was reported (from the person doing the Java implementation) was that "it's substantially more complicated to parse out control characters in comments and raise an error, and this kind of strictness provides no real advantage to users". And that's not addressed at all with this.

---

And while I'm at it, let me have a complaint about how this was merged:

1. Two people, both of whom actually maintain implementations, say they don't like this change.
2. This is basically ignored.
3. Three people continue written a fairly large number of extensive comments, so anyone who wasn't already interested in this change unsubscribes and/or goes 🤷
4. "Consensus".

Sometimes I feel TOML attracts people who like to argue things from a mile-high ivory tower with abstract arguments that have only superficial bearing to actual pragmatic reality.

Fixes toml-lang#995
@ChristianSi
Copy link
Contributor

ChristianSi commented Oct 1, 2023

I think the intent is that "CR" and "CRLF" are allowed, but "LF" isn't?

No, CR and CRLF are not allowed in a comment, they end the line and hence the comment.

As for why LF, though not a valid line terminator, is not allowed in comments, i think this comment explains it best: #924 (comment).

arp242 added a commit to arp242/toml that referenced this issue Oct 1, 2023
This reverts commit ab74958.

I'm a simple guy. Someone reports a problem, I drink coffee and fix it. No one reports a problem? There is nothing to fix and I go drink beer.

No one really reported this as a problem, but it *does* introduce needless churn for all TOML implementations and the test suite. Do we need to forbid *anything* in comments? Probably not, and in strings we probably only need to forbid \x00. But at least before it was consistent with strings, and more importantly, what everyone wrote code for, which is tested, and already works.

[None of the hypotheticals](toml-lang#567 (comment)) on why this is "needed" are practical issues people reported, and most aren't even fixed: a comment can still invalidate the file, you must still parse each character in a comment as some are still forbidden, the performance benefits are very close to zero they might as well be zero, and you still can't "dump whatever you like" in comments.

So it doesn't *actually* change anything, it just changes "disallow this set of control characters" to ... "disallow this set of control characters" (but for a different set). That's not really a substantial or meaningful change. The only (minor) real-world issue that was reported (from the person doing the Java implementation) was that "it's substantially more complicated to parse out control characters in comments and raise an error, and this kind of strictness provides no real advantage to users". And that's not addressed at all with this, so...

---

And while I'm at it, let me have a complaint about how this was merged:

1. Two people, both of whom actually maintain implementations, say they don't like this change.
2. This is basically ignored.
3. Three people continue written a fairly large number of large comments, so anyone who wasn't already interested in this change unsubscribes and/or goes 🤷
4. "Consensus".

Sometimes I feel TOML attracts people who like to argue things from a mile-high ivory tower with abstract arguments that have only passing familiarity with any actual pragmatic reality.

Fixes toml-lang#995
@ChristianSi
Copy link
Contributor

ChristianSi commented Oct 1, 2023

As for being as permissive as possible in comments and not erroring out on users because of harmless control characters in comments, I think that's a good thing and certainly not something we should apologize for.

@arp242
Copy link
Contributor Author

arp242 commented Oct 1, 2023

Yeah, but the way it's phrased now is confusing, IMHO. Or at least: I was confused by it, but maybe that's just on my 🙃

Anyway, I just so happened to submit a PR to revert this; while this is entirely a fixable issue as such, I don't see the point in spending any time on this as I don't see how it fixes anything.

@ChristianSi
Copy link
Contributor

Hmm, if the language is confusing, suggestions to improve it are certainly welcome!

@ChristianSi
Copy link
Contributor

ChristianSi commented Oct 1, 2023

As for the "doesn't fix anything", I don't think that's true? Consider the case where somebody includes a harmless control char (say an ASCII bell) in a comment, and the TOML parser balk out on them with "invalid document, can't process!" That's unexpected and not particularly user-friendly, I'd say, considering that comments are supposed to be ignored anyway.

So I'd doing away with such error messages as much as possible is a net win, isn't it?

LF is probably the only control char that's not "harmless", since it could possibly lead to the user seeing a different document than the parser, which is why the special treatment for it makes sense (and was adapted).

@arp242
Copy link
Contributor Author

arp242 commented Oct 1, 2023

LF was already forbidden, so this PR didn't change anything about that:

Control characters other than tab (U+0000 to U+0008, U+000A to U+001F, U+007F) are not permitted in comments.

@ChristianSi
Copy link
Contributor

ChristianSi commented Oct 1, 2023

It didn't change that, and quite wisely so, I'd say, but it did allow the other control characters in, thus making TOML more robust and user-friendly.

@arp242
Copy link
Contributor Author

arp242 commented Oct 1, 2023

thus making TOML more robust and user-friendly.

Who is running in to problems with this? Have people reported problems?

@pradyunsg pradyunsg added this to the 1.1.0-rc0 milestone Apr 7, 2024
@ccuser44
Copy link

I believe LF should always be excluded from comments. What if a file has mixed line encodings, allowing line encodings of one type in a comment would make parsing mixed line encodings impossible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants