Parsed `hclsyntax.Body` has slightly different range depending on what type of comment is at the top #702

rcjsuen · 2024-10-10T22:24:12Z

I would expect this to print the same thing six times but the /**/ seems to behave differently if it's on the first line.

false
true
true
true
true
true

package main

import (
	"fmt"

	"github.com/hashicorp/hcl/v2"
	"github.com/hashicorp/hcl/v2/hclsyntax"
)

func containsInitialPos(content string) bool {
	file, _ := hclsyntax.ParseConfig([]byte(content), "", hcl.InitialPos)
	body, _ := file.Body.(*hclsyntax.Body)
	return body.Range().ContainsPos(hcl.InitialPos)
}

func main() {
	fmt.Println(containsInitialPos("/**/\ntarget{}"))
	fmt.Println(containsInitialPos("#\ntarget{}"))
	fmt.Println(containsInitialPos("//\ntarget{}"))
	fmt.Println(containsInitialPos("\n/**/\ntarget{}"))
	fmt.Println(containsInitialPos("\n#\ntarget{}"))
	fmt.Println(containsInitialPos("\n//\ntarget{}"))
}

The text was updated successfully, but these errors were encountered:

rcjsuen · 2024-10-10T22:52:01Z

I guess this is the same thing as #551...? 🤔

Noticed it while trying to do LSP stuff just like hashicorp/terraform-ls#1052.

apparentlymart · 2024-10-11T00:53:56Z

The "peeker" component is, when doing a normal parse, configured to filter out all comments (the boolean argument here is includeComments):

hcl/hclsyntax/public.go

Line 22 in 3883feb

peeker := newPeeker(tokens, false)

This then interacts with a quirk of the grammar: the two single-line comment styles absorb their trailing newline because that's the marker that indicates the token has ended, but the multi-line comment style ends with */ and so anything that comes after it -- including a newline -- is not part of the comment token. (Notice that in the following only the first to productions actually include EndOfLine at the end.)

hcl/hclsyntax/scan_tokens.rl

Lines 68 to 80 in 3883feb

    
           Comment = ( 
        
               # The :>> operator in these is a "finish-guarded concatenation", 
        
               # which terminates the sequence on its left when it completes 
        
               # the sequence on its right. 
        
               # In the single-line comment cases this is allowing us to make 
        
               # the trailing EndOfLine optional while still having the overall 
        
               # pattern terminate. In the multi-line case it ensures that 
        
               # the first comment in the file ends at the first */, rather than 
        
               # gobbling up all of the "any*" until the _final_ */ in the file. 
        
               ("#" (any - EndOfLine)* :>> EndOfLine?) | 
        
               ("//" (any - EndOfLine)* :>> EndOfLine?) | 
        
               ("/*" any* :>> "*/") 
        
           );

However, because newlines are sometimes significant in parts of the grammar (any time we're parsing the series of items in a body), the peeker has a special case where instead of dropping the single-line tokens entirely it quietly replaces them with TokenNewline:

hcl/hclsyntax/peeker.go

Lines 85 to 102 in 3883feb

    
           if p.includingNewlines() { 
        
           	if len(tok.Bytes) > 0 && tok.Bytes[len(tok.Bytes)-1] == '\n' { 
        
           		fakeNewline := Token{ 
        
           			Type:  TokenNewline, 
        
           			Bytes: tok.Bytes[len(tok.Bytes)-1 : len(tok.Bytes)], 
        
           			// We use the whole token range as the newline 
        
           			// range, even though that's a little... weird, 
        
           			// because otherwise we'd need to go count 
        
           			// characters again in order to figure out the 
        
           			// column of the newline, and that complexity 
        
           			// isn't justified when ranges of newlines are 
        
           			// so rarely printed anyway. 
        
           			Range: tok.Range, 
        
           		} 
        
           		return fakeNewline, i + 1 
        
           	} 
        
           }

With all of that in mind, if we remove the comment tokens in the same way the peeker would before entry into the parser, the six different inputs as observed by the parser are:

"\ntarget{}"
"\ntarget{}"
"\ntarget{}"
"\n\ntarget{}"
"\n\ntarget{}"
"\n\ntarget{}"

This is a somewhat-confusing outcome because it suggests that examples 1, 2, and 3 should be treated the same. However, I suspect that that the trick is in the comment in one of the code snippets I included above:

hcl/hclsyntax/peeker.go

Lines 91 to 98 in 3883feb

    
           // We use the whole token range as the newline 
        
           // range, even though that's a little... weird, 
        
           // because otherwise we'd need to go count 
        
           // characters again in order to figure out the 
        
           // column of the newline, and that complexity 
        
           // isn't justified when ranges of newlines are 
        
           // so rarely printed anyway. 
        
           Range: tok.Range,

In examples 2 and 3, the initial TokenNewline has the byte sequence \n, but it has a source range covering the entire original comment, starting at byte 0. However, in example 1 the initial TokenNewline is a real \n rather than a weird trick, and so its source range is actually accurate and presumably starts at byte 5. Examples 4-6 all start with a "real" newline, regardless of any comments, so their first token also presumably has Byte: 0.

hcl.InitialPos is hcl.Pos{Line: 1, Column: 1, Byte: 0}, and hcl.Range.ContainsPos is really just asking whether the Byte of the given position is >= the Byte of the range's Start and < the Byte of the range's End. Therefore the first example (whose first parser-visible token starts at Byte: 5) does not contain hcl.InitialPos, while the others do.

Assuming all of the above is correct -- I've only read the code and imagined how it would behave with each input, not tried to validate this by executing it -- one possible way to make this work would be to decide that, despite what I wrote in that comment above, the complexity is justified to work to calculate the true location of the \n byte at the end of a single-line comment. That would then make 1-3 all return false and then 4-6 all return true, which is at least consistent, but not really what hashicorp/terraform-ls#1052 seems to want.

I guess the need for hashicorp/terraform-ls#1052 is to ensure that the root body resulting from a hclsyntax.ParseConfig always contains the given start position, even when the root body starts with a comment.

A different answer then is to change the rules for peeker.PrevRange in the special case where no tokens have been consumed yet (which is how the parser ultimately ends up deciding the "start" of the topmost body):

hcl/hclsyntax/peeker.go

Lines 65 to 71 in 3883feb

    
           func (p *peeker) PrevRange() hcl.Range { 
        
           	if p.NextIndex == 0 { 
        
           		return p.NextRange() 
        
           	} 
        
           	return p.Tokens[p.NextIndex-1].Range 
        
           }

The if statement here is simply avoiding trying to access p.Tokens[-1] in the special case where no tokens have been consumed yet. The current rule is to just take the position of the first token instead, which is a fine answer unless the first token is a multi-line TokenComment that the peeker skips, because then that token gets excluded from the root body's range.

Instead then, perhaps the peeker should have a new field where it explicitly stores the starting position given to hclsyntax.ParseConfig (or any of the other similar hclsyntax.Parse* functions), and then peeker.PrevRange would just return that verbatim if no tokens have been consumed yet. That would allow the parser to keep the promise that the root body always starts at the initial parsing position, regardless of what the first token actually is.

This implies changing the signature of newPeeker to take the starting position as an argument, but all five of the functions that call it have a start hcl.Pos argument so that shouldn't be a problem.

I hope that's all useful!

rcjsuen · 2024-10-11T14:11:19Z

The "peeker" component is, when doing a normal parse, configured to filter out all comments (the boolean argument here is includeComments):

hcl/hclsyntax/public.go

Line 22 in 3883feb

peeker := newPeeker(tokens, false)

Thank you for your detailed analysis, @apparentlymart. I did notice this tidbit when stepping into the code. I wondered if the public API offered a Parse* function that called newPeeker(tokens, true) instead would that be a valid workaround...? 🤔

apparentlymart · 2024-10-11T14:15:07Z

The parser is not designed to accept comment tokens in arbitrary places, so I don't think that simply pushing this problem down into the parser is a viable solution, but I've not tried it.

apparentlymart mentioned this issue Nov 6, 2024

Block's OpenBraceRange.Start differs if a proper block precedes it or not #708

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsed `hclsyntax.Body` has slightly different range depending on what type of comment is at the top #702

Parsed `hclsyntax.Body` has slightly different range depending on what type of comment is at the top #702

rcjsuen commented Oct 10, 2024

rcjsuen commented Oct 10, 2024

apparentlymart commented Oct 11, 2024

rcjsuen commented Oct 11, 2024

apparentlymart commented Oct 11, 2024

Parsed hclsyntax.Body has slightly different range depending on what type of comment is at the top #702

Parsed hclsyntax.Body has slightly different range depending on what type of comment is at the top #702

Comments

rcjsuen commented Oct 10, 2024

rcjsuen commented Oct 10, 2024

apparentlymart commented Oct 11, 2024

rcjsuen commented Oct 11, 2024

apparentlymart commented Oct 11, 2024

Parsed `hclsyntax.Body` has slightly different range depending on what type of comment is at the top #702

Parsed `hclsyntax.Body` has slightly different range depending on what type of comment is at the top #702