Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsed hclsyntax.Body has slightly different range depending on what type of comment is at the top #702

Open
rcjsuen opened this issue Oct 10, 2024 · 4 comments

Comments

@rcjsuen
Copy link

rcjsuen commented Oct 10, 2024

I would expect this to print the same thing six times but the /**/ seems to behave differently if it's on the first line.

false
true
true
true
true
true
package main

import (
	"fmt"

	"github.com/hashicorp/hcl/v2"
	"github.com/hashicorp/hcl/v2/hclsyntax"
)

func containsInitialPos(content string) bool {
	file, _ := hclsyntax.ParseConfig([]byte(content), "", hcl.InitialPos)
	body, _ := file.Body.(*hclsyntax.Body)
	return body.Range().ContainsPos(hcl.InitialPos)
}

func main() {
	fmt.Println(containsInitialPos("/**/\ntarget{}"))
	fmt.Println(containsInitialPos("#\ntarget{}"))
	fmt.Println(containsInitialPos("//\ntarget{}"))
	fmt.Println(containsInitialPos("\n/**/\ntarget{}"))
	fmt.Println(containsInitialPos("\n#\ntarget{}"))
	fmt.Println(containsInitialPos("\n//\ntarget{}"))
}
@rcjsuen
Copy link
Author

rcjsuen commented Oct 10, 2024

I guess this is the same thing as #551...? 🤔

Noticed it while trying to do LSP stuff just like hashicorp/terraform-ls#1052.

@apparentlymart
Copy link
Contributor

The "peeker" component is, when doing a normal parse, configured to filter out all comments (the boolean argument here is includeComments):

peeker := newPeeker(tokens, false)

This then interacts with a quirk of the grammar: the two single-line comment styles absorb their trailing newline because that's the marker that indicates the token has ended, but the multi-line comment style ends with */ and so anything that comes after it -- including a newline -- is not part of the comment token. (Notice that in the following only the first to productions actually include EndOfLine at the end.)

Comment = (
# The :>> operator in these is a "finish-guarded concatenation",
# which terminates the sequence on its left when it completes
# the sequence on its right.
# In the single-line comment cases this is allowing us to make
# the trailing EndOfLine optional while still having the overall
# pattern terminate. In the multi-line case it ensures that
# the first comment in the file ends at the first */, rather than
# gobbling up all of the "any*" until the _final_ */ in the file.
("#" (any - EndOfLine)* :>> EndOfLine?) |
("//" (any - EndOfLine)* :>> EndOfLine?) |
("/*" any* :>> "*/")
);

However, because newlines are sometimes significant in parts of the grammar (any time we're parsing the series of items in a body), the peeker has a special case where instead of dropping the single-line tokens entirely it quietly replaces them with TokenNewline:

hcl/hclsyntax/peeker.go

Lines 85 to 102 in 3883feb

if p.includingNewlines() {
if len(tok.Bytes) > 0 && tok.Bytes[len(tok.Bytes)-1] == '\n' {
fakeNewline := Token{
Type: TokenNewline,
Bytes: tok.Bytes[len(tok.Bytes)-1 : len(tok.Bytes)],
// We use the whole token range as the newline
// range, even though that's a little... weird,
// because otherwise we'd need to go count
// characters again in order to figure out the
// column of the newline, and that complexity
// isn't justified when ranges of newlines are
// so rarely printed anyway.
Range: tok.Range,
}
return fakeNewline, i + 1
}
}

With all of that in mind, if we remove the comment tokens in the same way the peeker would before entry into the parser, the six different inputs as observed by the parser are:

  1. "\ntarget{}"
  2. "\ntarget{}"
  3. "\ntarget{}"
  4. "\n\ntarget{}"
  5. "\n\ntarget{}"
  6. "\n\ntarget{}"

This is a somewhat-confusing outcome because it suggests that examples 1, 2, and 3 should be treated the same. However, I suspect that that the trick is in the comment in one of the code snippets I included above:

hcl/hclsyntax/peeker.go

Lines 91 to 98 in 3883feb

// We use the whole token range as the newline
// range, even though that's a little... weird,
// because otherwise we'd need to go count
// characters again in order to figure out the
// column of the newline, and that complexity
// isn't justified when ranges of newlines are
// so rarely printed anyway.
Range: tok.Range,

In examples 2 and 3, the initial TokenNewline has the byte sequence \n, but it has a source range covering the entire original comment, starting at byte 0. However, in example 1 the initial TokenNewline is a real \n rather than a weird trick, and so its source range is actually accurate and presumably starts at byte 5. Examples 4-6 all start with a "real" newline, regardless of any comments, so their first token also presumably has Byte: 0.

hcl.InitialPos is hcl.Pos{Line: 1, Column: 1, Byte: 0}, and hcl.Range.ContainsPos is really just asking whether the Byte of the given position is >= the Byte of the range's Start and < the Byte of the range's End. Therefore the first example (whose first parser-visible token starts at Byte: 5) does not contain hcl.InitialPos, while the others do.


Assuming all of the above is correct -- I've only read the code and imagined how it would behave with each input, not tried to validate this by executing it -- one possible way to make this work would be to decide that, despite what I wrote in that comment above, the complexity is justified to work to calculate the true location of the \n byte at the end of a single-line comment. That would then make 1-3 all return false and then 4-6 all return true, which is at least consistent, but not really what hashicorp/terraform-ls#1052 seems to want.

I guess the need for hashicorp/terraform-ls#1052 is to ensure that the root body resulting from a hclsyntax.ParseConfig always contains the given start position, even when the root body starts with a comment.

A different answer then is to change the rules for peeker.PrevRange in the special case where no tokens have been consumed yet (which is how the parser ultimately ends up deciding the "start" of the topmost body):

hcl/hclsyntax/peeker.go

Lines 65 to 71 in 3883feb

func (p *peeker) PrevRange() hcl.Range {
if p.NextIndex == 0 {
return p.NextRange()
}
return p.Tokens[p.NextIndex-1].Range
}

The if statement here is simply avoiding trying to access p.Tokens[-1] in the special case where no tokens have been consumed yet. The current rule is to just take the position of the first token instead, which is a fine answer unless the first token is a multi-line TokenComment that the peeker skips, because then that token gets excluded from the root body's range.

Instead then, perhaps the peeker should have a new field where it explicitly stores the starting position given to hclsyntax.ParseConfig (or any of the other similar hclsyntax.Parse* functions), and then peeker.PrevRange would just return that verbatim if no tokens have been consumed yet. That would allow the parser to keep the promise that the root body always starts at the initial parsing position, regardless of what the first token actually is.

This implies changing the signature of newPeeker to take the starting position as an argument, but all five of the functions that call it have a start hcl.Pos argument so that shouldn't be a problem.

I hope that's all useful!

@rcjsuen
Copy link
Author

rcjsuen commented Oct 11, 2024

The "peeker" component is, when doing a normal parse, configured to filter out all comments (the boolean argument here is includeComments):

peeker := newPeeker(tokens, false)

Thank you for your detailed analysis, @apparentlymart. I did notice this tidbit when stepping into the code. I wondered if the public API offered a Parse* function that called newPeeker(tokens, true) instead would that be a valid workaround...? 🤔

@apparentlymart
Copy link
Contributor

The parser is not designed to accept comment tokens in arbitrary places, so I don't think that simply pushing this problem down into the parser is a viable solution, but I've not tried it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants