-
-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lexer produces wrong tokens when more input is provided #265
Comments
May also be related to #160 |
@jeertmans Thank you for making this project active again 👍. it would be great if this bug gets some attention🙂, because it can be a real big surprise if encountered (did cost me a good amount of debugging and making a workaround in my parser for my bachelor thesis) 🤯 . It was initially reported in logos 0.12 and I updated the reproduction code to 0.14 and verified it is still a problem. Also it doesn't seem to matter if there aren't any priorities set explicitly. |
Indeed I think this is related to #160, and probably caused by two patterns matching the same things, but one pattern allowing a longer match. This issue should be handled by the priorities, but seems like it fails to do so. I would be interesting to analyse the generated code by logos derive macro, and inspect it. Did you already do that? |
I guess you mean #160 right (maybe misspelled)? |
I don't have much time right now, but I am posting the expanded macro code here, #![feature(prelude_import)]
#[prelude_import]
use std::prelude::rust_2021::*;
#[macro_use]
extern crate std;
use logos::Logos;
pub enum SyntaxKind {
#[regex(r"[ \t]+", priority = 1)]
TK_WHITESPACE = 0,
#[regex(r"[a-zA-Z][a-zA-Z0-9]*", priority = 1)]
TK_WORD,
#[token("not", priority = 50)]
TK_NOT,
#[token("not in", priority = 60)]
TK_NOT_IN,
}
impl<'s> ::logos::Logos<'s> for SyntaxKind {
type Error = ();
type Extras = ();
type Source = str;
fn lex(lex: &mut ::logos::Lexer<'s, Self>) {
use ::logos::internal::{LexerInternal, CallbackResult};
type Lexer<'s> = ::logos::Lexer<'s, SyntaxKind>;
fn _end<'s>(lex: &mut Lexer<'s>) {
lex.end()
}
fn _error<'s>(lex: &mut Lexer<'s>) {
lex.bump_unchecked(1);
lex.error();
}
static COMPACT_TABLE_0: [u8; 256] = [
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
];
#[inline]
fn goto4_ctx4_x<'s>(lex: &mut Lexer<'s>) {
lex.set(Ok(SyntaxKind::TK_WORD));
}
#[inline]
fn pattern0(byte: u8) -> bool {
COMPACT_TABLE_0[byte as usize] & 1 > 0
}
#[inline]
fn goto5_ctx4_x<'s>(lex: &mut Lexer<'s>) {
while let Some(arr) = lex.read::<&[u8; 16]>() {
if pattern0(arr[0]) {
if pattern0(arr[1]) {
if pattern0(arr[2]) {
if pattern0(arr[3]) {
if pattern0(arr[4]) {
if pattern0(arr[5]) {
if pattern0(arr[6]) {
if pattern0(arr[7]) {
if pattern0(arr[8]) {
if pattern0(arr[9]) {
if pattern0(arr[10]) {
if pattern0(arr[11]) {
if pattern0(arr[12]) {
if pattern0(arr[13]) {
if pattern0(arr[14]) {
if pattern0(arr[15]) {
lex.bump_unchecked(
16,
);
continue;
}
lex.bump_unchecked(15);
return goto4_ctx4_x(
lex,
);
}
lex.bump_unchecked(14);
return goto4_ctx4_x(lex);
}
lex.bump_unchecked(13);
return goto4_ctx4_x(lex);
}
lex.bump_unchecked(12);
return goto4_ctx4_x(lex);
}
lex.bump_unchecked(11);
return goto4_ctx4_x(lex);
}
lex.bump_unchecked(10);
return goto4_ctx4_x(lex);
}
lex.bump_unchecked(9);
return goto4_ctx4_x(lex);
}
lex.bump_unchecked(8);
return goto4_ctx4_x(lex);
}
lex.bump_unchecked(7);
return goto4_ctx4_x(lex);
}
lex.bump_unchecked(6);
return goto4_ctx4_x(lex);
}
lex.bump_unchecked(5);
return goto4_ctx4_x(lex);
}
lex.bump_unchecked(4);
return goto4_ctx4_x(lex);
}
lex.bump_unchecked(3);
return goto4_ctx4_x(lex);
}
lex.bump_unchecked(2);
return goto4_ctx4_x(lex);
}
lex.bump_unchecked(1);
return goto4_ctx4_x(lex);
}
return goto4_ctx4_x(lex);
}
while lex.test(pattern0) {
lex.bump_unchecked(1);
}
goto4_ctx4_x(lex);
}
#[inline]
fn goto7_ctx5_x<'s>(lex: &mut Lexer<'s>) {
lex.set(Ok(SyntaxKind::TK_NOT));
}
#[inline]
fn goto8_ctx5_x<'s>(lex: &mut Lexer<'s>) {
lex.set(Ok(SyntaxKind::TK_NOT_IN));
}
#[inline]
fn goto16_at1_ctx5_x<'s>(lex: &mut Lexer<'s>) {
match lex.read_at::<&[u8; 2usize]>(1usize) {
Some(b"in") => {
lex.bump_unchecked(3usize);
goto8_ctx5_x(lex)
}
_ => goto5_ctx4_x(lex),
}
}
#[inline]
fn goto15_ctx5_x<'s>(lex: &mut Lexer<'s>) {
let byte = match lex.read::<u8>() {
Some(byte) => byte,
None => return goto7_ctx5_x(lex),
};
match byte {
byte if pattern0(byte) => {
lex.bump_unchecked(1usize);
goto5_ctx4_x(lex)
}
32u8 => goto16_at1_ctx5_x(lex),
_ => goto7_ctx5_x(lex),
}
}
#[inline]
fn goto13_ctx5_x<'s>(lex: &mut Lexer<'s>) {
match lex.read::<&[u8; 2usize]>() {
Some(b"ot") => {
lex.bump_unchecked(2usize);
goto15_ctx5_x(lex)
}
_ => goto5_ctx4_x(lex),
}
}
#[inline]
fn goto1_ctx1_x<'s>(lex: &mut Lexer<'s>) {
lex.set(Ok(SyntaxKind::TK_WHITESPACE));
}
#[inline]
fn pattern1(byte: u8) -> bool {
match byte {
9u8 | 32u8 => true,
_ => false,
}
}
#[inline]
fn goto2_ctx1_x<'s>(lex: &mut Lexer<'s>) {
while let Some(arr) = lex.read::<&[u8; 16]>() {
if pattern1(arr[0]) {
if pattern1(arr[1]) {
if pattern1(arr[2]) {
if pattern1(arr[3]) {
if pattern1(arr[4]) {
if pattern1(arr[5]) {
if pattern1(arr[6]) {
if pattern1(arr[7]) {
if pattern1(arr[8]) {
if pattern1(arr[9]) {
if pattern1(arr[10]) {
if pattern1(arr[11]) {
if pattern1(arr[12]) {
if pattern1(arr[13]) {
if pattern1(arr[14]) {
if pattern1(arr[15]) {
lex.bump_unchecked(
16,
);
continue;
}
lex.bump_unchecked(15);
return goto1_ctx1_x(
lex,
);
}
lex.bump_unchecked(14);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(13);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(12);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(11);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(10);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(9);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(8);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(7);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(6);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(5);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(4);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(3);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(2);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(1);
return goto1_ctx1_x(lex);
}
return goto1_ctx1_x(lex);
}
while lex.test(pattern1) {
lex.bump_unchecked(1);
}
goto1_ctx1_x(lex);
}
#[inline]
fn goto17<'s>(lex: &mut Lexer<'s>) {
enum Jump {
__,
J5,
J13,
J2,
}
const LUT: [Jump; 256] = {
use Jump::*;
[
__, __, __, __, __, __, __, __, __, J2, __, __, __, __, __, __, __, __, __, __,
__, __, __, __, __, __, __, __, __, __, __, __, J2, __, __, __, __, __, __, __,
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __,
__, __, __, __, __, J5, J5, J5, J5, J5, J5, J5, J5, J5, J5, J5, J5, J5, J5, J5,
J5, J5, J5, J5, J5, J5, J5, J5, J5, J5, J5, __, __, __, __, __, __, J5, J5, J5,
J5, J5, J5, J5, J5, J5, J5, J5, J5, J5, J13, J5, J5, J5, J5, J5, J5, J5, J5,
J5, J5, J5, J5, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __,
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __,
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __,
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __,
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __,
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __,
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __,
]
};
let byte = match lex.read::<u8>() {
Some(byte) => byte,
None => return _end(lex),
};
match LUT[byte as usize] {
Jump::J5 => {
lex.bump_unchecked(1usize);
goto5_ctx4_x(lex)
}
Jump::J13 => {
lex.bump_unchecked(1usize);
goto13_ctx5_x(lex)
}
Jump::J2 => {
lex.bump_unchecked(1usize);
goto2_ctx1_x(lex)
}
Jump::__ => _error(lex),
}
}
goto17(lex)
}
} |
It might also be worth trying a debugger, like |
I haven't figured out debugging of the logos generated source yet (somehow RustRover doesn't seem to recognise the macro code). But I was able to reduce the reproduction code a bit further (maybe makes it easier to grasp the generated code): use logos::Logos;
#[derive(Debug, Clone, Copy, PartialEq, Logos)]
#[allow(non_camel_case_types)]
pub enum SyntaxKind {
#[regex(r"[a-zA-Z][a-zA-Z0-9]*", priority = 2)]
TK_WORD,
#[token("not", priority = 10)]
TK_NOT,
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn single_not_works() {
let mut lexer = SyntaxKind::lexer("not");
assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_NOT)));
}
#[test]
fn single_word_works() {
let mut lexer = SyntaxKind::lexer("word");
assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_WORD)));
}
#[test]
fn but_this_does_not_work() {
let mut lexer = SyntaxKind::lexer("notword");
// FAILED because
// Left: Some(Ok(TK_WORD)
// Right: Some(Ok(TK_NOT)
assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_NOT)));
assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_WORD)));
}
} Logos macro expansion: use logos::Logos;
#[allow(non_camel_case_types)]
pub enum SyntaxKind {
#[regex(r"[a-zA-Z][a-zA-Z0-9]*", priority = 2)]
TK_WORD,
#[token("not", priority = 10)]
TK_NOT,
}
impl<'s> ::logos::Logos<'s> for SyntaxKind {
type Error = ();
type Extras = ();
type Source = str;
fn lex(lex: &mut ::logos::Lexer<'s, Self>) {
use ::logos::internal::{CallbackResult, LexerInternal};
type Lexer<'s> = ::logos::Lexer<'s, SyntaxKind>;
fn _end<'s>(lex: &mut Lexer<'s>) {
lex.end()
}
fn _error<'s>(lex: &mut Lexer<'s>) {
lex.bump_unchecked(1);
lex.error();
}
static COMPACT_TABLE_0: [u8; 256] = [
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
];
#[inline]
fn goto1_ctx1_x<'s>(lex: &mut Lexer<'s>) {
lex.set(Ok(SyntaxKind::TK_WORD));
}
#[inline]
fn pattern0(byte: u8) -> bool {
COMPACT_TABLE_0[byte as usize] & 1 > 0
}
#[inline]
fn goto2_ctx1_x<'s>(lex: &mut Lexer<'s>) {
while let Some(arr) = lex.read::<&[u8; 16]>() {
if pattern0(arr[0]) {
if pattern0(arr[1]) {
if pattern0(arr[2]) {
if pattern0(arr[3]) {
if pattern0(arr[4]) {
if pattern0(arr[5]) {
if pattern0(arr[6]) {
if pattern0(arr[7]) {
if pattern0(arr[8]) {
if pattern0(arr[9]) {
if pattern0(arr[10]) {
if pattern0(arr[11]) {
if pattern0(arr[12]) {
if pattern0(arr[13]) {
if pattern0(arr[14]) {
if pattern0(arr[15]) {
lex.bump_unchecked(
16,
);
continue;
}
lex.bump_unchecked(15);
return goto1_ctx1_x(
lex,
);
}
lex.bump_unchecked(14);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(13);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(12);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(11);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(10);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(9);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(8);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(7);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(6);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(5);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(4);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(3);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(2);
return goto1_ctx1_x(lex);
}
lex.bump_unchecked(1);
return goto1_ctx1_x(lex);
}
return goto1_ctx1_x(lex);
}
while lex.test(pattern0) {
lex.bump_unchecked(1);
}
goto1_ctx1_x(lex);
}
#[inline]
fn goto4_ctx2_x<'s>(lex: &mut Lexer<'s>) {
lex.set(Ok(SyntaxKind::TK_NOT));
}
#[inline]
fn goto7_ctx2_x<'s>(lex: &mut Lexer<'s>) {
let byte = match lex.read::<u8>() {
Some(byte) => byte,
None => return goto4_ctx2_x(lex),
};
match byte {
byte if pattern0(byte) => {
lex.bump_unchecked(1usize);
goto2_ctx1_x(lex)
}
_ => goto4_ctx2_x(lex),
}
}
#[inline]
fn goto6_ctx2_x<'s>(lex: &mut Lexer<'s>) {
match lex.read::<&[u8; 2usize]>() {
Some(b"ot") => {
lex.bump_unchecked(2usize);
goto7_ctx2_x(lex)
}
_ => goto2_ctx1_x(lex),
}
}
#[inline]
fn pattern1(byte: u8) -> bool {
const LUT: u64 = 576390375103528958u64;
match 1u64.checked_shl(byte.wrapping_sub(64u8) as u32) {
Some(shift) => LUT & shift != 0,
None => false,
}
}
#[inline]
fn goto8<'s>(lex: &mut Lexer<'s>) {
let byte = match lex.read::<u8>() {
Some(byte) => byte,
None => return _end(lex),
};
match byte {
b'n' => {
lex.bump_unchecked(1usize);
goto6_ctx2_x(lex)
}
byte if pattern1(byte) => {
lex.bump_unchecked(1usize);
goto2_ctx1_x(lex)
}
_ => _error(lex),
}
}
goto8(lex)
}
} |
Maybe it's somehow related to the greedy matching of the regex, but I wouldn't expect that if there is another token definition with higher priority 🤔. And as the other tests run just fine it is only a problem when there are more bytes after the |
I had been running into what seemed like a similar issue, where a longer match would fail and it would backtrack to the wrong token. What was also strange was that the The code to reproduce my issue is a lot simpler as I was able to reduce it down to not actually perform any regex matching at all, although it still needed a use logos::Logos;
#[derive(Debug, PartialEq, Eq, Logos)]
pub enum Token<'s> {
#[token("Foo")]
Foo(&'s str),
#[token("FooBar")]
FooBar(&'s str),
#[regex("FooBarQux")]
FooBarQux(&'s str),
}
#[cfg(test)]
mod tests {
use itertools::Itertools;
use logos::Logos;
use super::*;
#[test]
fn test() {
let lexer = Token::lexer("FooBarQ");
let expected: Vec<Result<_, ()>> = vec![Ok(Token::FooBar("FooBar"))];
let results = lexer
.zip_longest(expected)
.map(|x| x.left_and_right())
.unzip::<_, _, Vec<_>, Vec<_>>();
// left: [Some(Ok(Foo("FooBar")))]
// right: [Some(Ok(FooBar("FooBar")))]
assert_eq!(results.0, results.1);
}
} Potential Fix?I took some time today to look into it and the issue appeared to be in the I was able to resolve my issue by adding The change to the I'm not familiar enough with the logos internals to say if this is the correct fix though, as it doesn't fix the issue with @MalteJanz's minimal case, although perhaps that's actually a separate bug. The case originally posted by @MalteJanz does seem very similar to what I was doing, where I was matching a separator token ( |
I looked into this one a bit, and I do think that it's a separate issue than the original issue. The following code causes logos to generate the following graph. pub enum SyntaxKind {
#[regex(r"[a-z]+", priority = 0)]
TK_WORD,
#[token("not", priority = 100)]
TK_NOT,
}
Following the path for
Sidenote: For anyone else that wants to check out the graph that logos is generating for their code, it turns out logos already has a nice It does indeed seem that the regex is too greedy. I previously had an issue with a greedy regex where I had a token, like let mut tokens = Vec::<Token>::new();
let mut lexer = Token::lexer(input);
while let Some(token) = lexer.next() {
if let Ok(token) = token {
tokens.push(token);
} else {
let consumed = lexer
.remainder()
.find(|c| CHARS_TEXT_END.contains(c))
.unwrap_or(lexer.remainder().len());
tokens.push(Token::Text(
&input[lexer.span().start..lexer.span().end + consumed],
));
lexer.bump(consumed);
}
} For a case that does not require separators between tokens, the loop could probably be modified to collect the spans of errors, and then when the next matched token is found, those spans could be collapsed into a text token and inserted before the matched token. It seems like using greedy regexes with logos can cause some unexpected matching behaviours. In the |
After staring at the code some more, I believe the proper fix may actually be to just remove the Sidenote: After removing the From what I can see
For each of these cases I would expect I tested this change against my own library, the original code for this issue, and #160 and they are all passing. I'd be interested to hear your thoughts on this @jeertmans (and @maciejhirsz's if available). |
I browsed through some more of the GitHub issues and found that the fix mentioned above also fixes #279. |
Hey @jameshurst, thank you for your very deep analysis on this issue, and this might well solve a very long-running issue! A few comments / suggestions / questions.
Aah, I didn't know myself! I really should document this in the book :-)
My neither actually (internals were mostly written by one person, which is not me haha), but if we write enough tests (e.g., based on previous issues), and they now pass, I think it's safe to call it a fix!
Well, if that indeed fixes many issues, that is very great news!
I'd be more than happy to help you with this and review your PR(s). Thanks! |
My thoughts are that you likely have better understanding of what's going on there than I do currently since it's been a long while. If it fixes even one bug and all tests are still passing, I say ship it. |
@maciejhirsz I'll send @jameshurst an email, in the hope to have some better details about this fix :-) |
@jameshurst You're our savior! In my case with a |
While attempting to prepare this fix for a PR I had noticed that the change actually breaks the |
@elenakrittik thanks for your commit link! I tested locally (on #378) and it passed all tests, except one: |
Oh, my bad, i didn't even notice that the tests were a separate package. I only did |
hi @jameshurst Thanks for your contribution! It has solved the issue that was blocking me for several days. However, I noticed that the PR still does not address the following case: pub enum Token {
#[regex(r"-?0+", priority = 10)]
A,
#[regex(r"(0|#)+")]
B,
} The PR works well when the leading In my scenario, I assume that 'priority' should take precedence over 'longest match'. I'm curious about the design principle of this library: does 'longest match' or 'priority' prevail? If priority is supposed to prevail, I could find some time to refine the code and submit a PR. |
Edit: This is not how it should work. I've realized I don't know what code should be generated from enum Token {
#[token("a")]
A,
#[regex(r"ab*ac")]
Abac,
} Original post: I'm very confused about this. I tried to make a very simple version of the failing test from #377 at https://github.com/MoisesPotato/logos/blob/9193a8a1764f000fbc81afb8bbd4d3b264877667/tests/tests/issue_265.rs#L160-L171. It's led me to believe that merging graph nodes is not working how it's supposed to. I think that the test at https://github.com/MoisesPotato/logos/blob/d4068920489d5d6ab3d16040b983ab604dbbcf1f/logos-codegen/src/graph/mod.rs#L567-L590 should pass: #[test]
fn issue_265_merge() {
let mut graph = Graph::new();
let leaf1 = graph.push(Node::Leaf("::a"));
let leaf2 = graph.push(Node::Leaf("::abc"));
let mut to_merge = Rope::new("abc", leaf2).into_fork(&mut graph);
to_merge.merge(graph.fork_off(leaf1), &mut graph);
let mut branches = to_merge.branches();
let (_, new_rope) = branches.next().unwrap();
assert_eq!(to_merge, Fork::new().branch(b'a', new_rope).miss(leaf1));
let middle_node = graph.get(new_rope).unwrap();
match middle_node {
Node::Fork(_) | Node::Leaf(_) => panic!("Should be a rope"),
// It fails here: the node that is created fails instead of going to leaf1 on miss.
Node::Rope(rope) => assert_eq!(rope, &Rope::new("bc", leaf2).miss_any(leaf1)),
}
} Let me try to explain. I would love to know if what I'm saying makes any sense at all, since I'm 40% sure I've understood what is going on. One end of the graph matches
The output doesn't match |
My logos lexer implementation somehow does not match the
TK_NOT
token when there is more input (like a whitespace) after it. Instead it matches theTK_WORD
token in that case, which should be wrong when it has a lower priority.Reproducible example with tests:
I know that the situation with
TK_NOT
andTK_NOT_IN
is maybe not ideal (if I remove the latter it works again). But for my parser it would be way better to have these tokens rather than two separateTK_NOT
andTK_IN
tokens. I would be thankfully for any suggestions that don't require me to remove either ofTK_WORD
orTK_NOT_IN
to make the testbut_this_does_not_work
run.The text was updated successfully, but these errors were encountered: