Readline-based parser of markdown #3233

techee · 2021-12-22T17:46:15Z

We have a rather bad markdown parser in Geany and when looking at uctags, I realized there's just a regex-based parser. From the past experience, these tend to be rather slow so for us it would be better to have a hand-written parser.

I created a simple readline-based parser based on the asciidoc parser and tried to preserve all the features of the regex-based parser (all kinds, full scope, sectionMarker field, running subparsers for code). Would such a parser be interesting for uctags or is the regex-based one the preferred solution?

This parser is based on the asciidoc parser and tries to preserve all features of the regex-based parser (all kinds, full scope, sectionMarker field, running subparsers for code).

codecov · 2021-12-22T19:23:53Z

Codecov Report

Merging #3233 (48f431f) into master (05e6ab4) will increase coverage by 0.27%.
The diff coverage is 95.12%.

@@            Coverage Diff             @@
##           master    #3233      +/-   ##
==========================================
+ Coverage   85.01%   85.28%   +0.27%     
==========================================
  Files         206      206              
  Lines       49127    49084      -43     
==========================================
+ Hits        41765    41862      +97     
+ Misses       7362     7222     -140

Impacted Files	Coverage Δ
parsers/markdown.c	`95.12% <95.12%> (ø)`
main/lregex.c	`81.94% <0.00%> (-1.14%)`	⬇️
main/field.c	`92.73% <0.00%> (-0.29%)`	⬇️
optlib/markdown.c

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05e6ab4...48f431f. Read the comment docs.

masatake · 2021-12-23T02:18:15Z

Thank you. Of course, I will merge this C implementation.

Let's fill the end: fields.

diff --git a/parsers/markdown.c b/parsers/markdown.c
index 173130e06..89358d89a 100644
--- a/parsers/markdown.c
+++ b/parsers/markdown.c
@@ -69,17 +69,22 @@ static NestingLevels *nestingLevels = NULL;
 *   FUNCTION DEFINITIONS
 */
 
-static NestingLevel *getNestingLevel(const int kind)
+static NestingLevel *getNestingLevel(const int kind, int adjustment_when_pop)
 {
        NestingLevel *nl;
        tagEntryInfo *e;
+       unsigned long line = getInputLineNumber();
 
        while (1)
        {
                nl = nestingLevelsGetCurrent(nestingLevels);
                e = getEntryOfNestingLevel (nl);
                if ((nl && (e == NULL)) || (e && (e->kindIndex >= kind)))
+               {
+                       if (e && line > adjustment_when_pop)
+                               e->extensionFields.endLine = line - adjustment_when_pop;
                        nestingLevelsPop(nestingLevels);
+               }
                else
                        break;
        }
@@ -88,7 +93,7 @@ static NestingLevel *getNestingLevel(const int kind)
 
 static int makeMarkdownTag (const vString* const name, const int kind, const bool two_line)
 {
-       const NestingLevel *const nl = getNestingLevel(kind);
+       const NestingLevel *const nl = getNestingLevel(kind, two_line? 2: 1);
        int r = CORK_NIL;
 
        if (vStringLength (name) > 0)

Could you include this?
It seems that my change is not perfect.
I will work more.

masatake · 2021-12-23T02:27:07Z

Instead of nestingLevelsNew, let's use nestingLevelsNewFull.

We can add a callback function called when a level is popped.

We can fill the end: fields in the call back function.

deleteBlockData is an example of such a callback function.

I will work on this topic tonight, JST.

masatake · 2021-12-23T02:43:35Z

nesting level API must be extended to pass the line adjustment data.

diff --git a/main/nestlevel.c b/main/nestlevel.c
index d3403f78f..3ce87df09 100644
--- a/main/nestlevel.c
+++ b/main/nestlevel.c
@@ -29,7 +29,7 @@
 */
 
 extern NestingLevels *nestingLevelsNewFull(size_t userDataSize,
-										   void (* deleteUserData)(NestingLevel *))
+										   void (* deleteUserData)(NestingLevel *, void *))
 {
 	NestingLevels *nls = xCalloc (1, NestingLevels);
 	nls->userDataSize = userDataSize;
@@ -42,7 +42,7 @@ extern NestingLevels *nestingLevelsNew(size_t userDataSize)
 	return nestingLevelsNewFull (userDataSize, NULL);
 }
 
-extern void nestingLevelsFree(NestingLevels *nls)
+extern void nestingLevelsFreeFull(NestingLevels *nls, void *ctxData)
 {
 	int i;
 	NestingLevel *nl;
@@ -51,7 +51,7 @@ extern void nestingLevelsFree(NestingLevels *nls)
 	{
 		nl = NL_NTH(nls, i);
 		if (nls->deleteUserData)
-			nls->deleteUserData (nl);
+			nls->deleteUserData (nl, ctxData);
 		nl->corkIndex = CORK_NIL;
 	}
 	if (nls->levels) eFree(nls->levels);
@@ -89,13 +89,13 @@ extern NestingLevel *nestingLevelsTruncate(NestingLevels *nls, int depth, int co
 }
 
 
-extern void nestingLevelsPop(NestingLevels *nls)
+extern void nestingLevelsPopFull(NestingLevels *nls, void *ctxData)
 {
 	NestingLevel *nl = nestingLevelsGetCurrent(nls);
 
 	Assert (nl != NULL);
 	if (nls->deleteUserData)
-		nls->deleteUserData (nl);
+		nls->deleteUserData (nl, ctxData);
 	nl->corkIndex = CORK_NIL;
 	nls->n--;
 }
diff --git a/main/nestlevel.h b/main/nestlevel.h
index 18ac9927e..3154ae833 100644
--- a/main/nestlevel.h
+++ b/main/nestlevel.h
@@ -35,7 +35,7 @@ struct NestingLevels
 	int n;					/* number of levels in use */
 	int allocated;
 	size_t userDataSize;
-	void (* deleteUserData) (NestingLevel *);
+	void (* deleteUserData) (NestingLevel *, void *);
 };
 
 /*
@@ -43,11 +43,13 @@ struct NestingLevels
 */
 extern NestingLevels *nestingLevelsNew(size_t userDataSize);
 extern NestingLevels *nestingLevelsNewFull(size_t userDataSize,
-										   void (* deleteUserData)(NestingLevel *));
-extern void nestingLevelsFree(NestingLevels *nls);
+										   void (* deleteUserData)(NestingLevel *, void *));
+#define nestingLevelsFree(NLS) nestingLevelsFreeFull(NLS, NULL)
+extern void nestingLevelsFreeFull(NestingLevels *nls, void *ctxData);
 extern NestingLevel *nestingLevelsPush(NestingLevels *nls, int corkIndex);
 extern NestingLevel * nestingLevelsTruncate(NestingLevels *nls, int depth, int corkIndex);
-extern void nestingLevelsPop(NestingLevels *nls);
+#define nestingLevelsPop(NLS) nestingLevelsPopFull(NLS, NULL)
+extern void nestingLevelsPopFull(NestingLevels *nls, void *ctxData);
 #define nestingLevelsGetCurrent(NLS) nestingLevelsGetNthParent((NLS), 0)
 extern NestingLevel *nestingLevelsGetNthFromRoot(const NestingLevels *nls, int n);
 extern NestingLevel *nestingLevelsGetNthParent(const NestingLevels *nls, int n);
diff --git a/parsers/ruby.c b/parsers/ruby.c
index 2aab8d94d..2b2fd2594 100644
--- a/parsers/ruby.c
+++ b/parsers/ruby.c
@@ -695,7 +695,7 @@ static void attachMixinField (int corkIndex, stringList *mixinSpec)
 								  vStringValue (mixinField));
 }
 
-static void deleteBlockData (NestingLevel *nl)
+static void deleteBlockData (NestingLevel *nl, void *data CTAGS_ATTR_UNUSED)
 {
 	struct blockData *bdata = nestingLevelGetUserData (nl);

So we can pass two_line to the call back function.

masatake · 2021-12-23T07:30:41Z

@techee, let me take over this pull request.

techee · 2021-12-23T08:26:24Z

@techee, let me take over this pull request.

Sure, no problem, less work for me :-).

techee · 2021-12-23T08:28:27Z

What's the end: by the way? I thought it was something regex-specific.

masatake · 2021-12-23T14:31:03Z

end: is a name of a field. line: represents where the tag is defined. end: represents where the scope established by the tag is ended.

$ cat -n /tmp/foo.c
     1	struct point 
     2	{
     3	  int x;
     4	  int y;
     5	};
     6	
     7	int
     8	main(void)
     9	{
    10	  return 0;
    11 }
    12

$ ctags -o - --fields=+ne /tmp/foo.c
main	/tmp/foo.c	/^main(void)$/;"	f	line:8	typeref:typename:int	end:11
point	/tmp/foo.c	/^struct point $/;"	s	line:1	file:	end:5
x	/tmp/foo.c	/^  int x;$/;"	m	line:3	struct:point	typeref:typename:int	file:	end:3
y	/tmp/foo.c	/^  int y;$/;"	m	line:4	struct:point	typeref:typename:int	file:	end:4

techee · 2021-12-23T14:42:10Z

Nice, I can imagine this information could be interesting for Geany too. I assume this is currently available only for some parsers, not all parsers reporting scope, right?

masatake · 2021-12-23T14:47:46Z

Nice, I can imagine this information could be interesting for Geany too. I assume this is currently available only for some parsers, not all parsers reporting scope, right?

No, not all parsers. `grep endLine parsers/*.c' may report what you want to know:-).

masatake · 2021-12-23T15:42:55Z

See #3235.

masatake · 2021-12-23T15:53:16Z

parsers/markdown.c

+
+static int makeMarkdownTag (const vString* const name, const int kind, const bool two_line)
+{
+	const NestingLevel *const nl = getNestingLevel(kind);


In #3235, I move the getNestingLevel() to...

masatake · 2021-12-23T15:54:44Z

parsers/markdown.c

+	int r = CORK_NIL;
+
+	if (vStringLength (name) > 0)
+	{


...here. The nesting level should be popped when making a tag for the name.

masatake · 2022-01-02T13:20:35Z

The changes were merged via #3236.

techee added 3 commits December 22, 2021 18:43

Readline-based parser of markdown

7053ee6

This parser is based on the asciidoc parser and tries to preserve all features of the regex-based parser (all kinds, full scope, sectionMarker field, running subparsers for code).

Fix memory leak

1ea15d6

Fix vcxproj project files

48f431f

masatake mentioned this pull request Dec 23, 2021

Markdown: rewrite in C (based on #3233) #3235

Closed

masatake reviewed Dec 23, 2021

View reviewed changes

techee mentioned this pull request Dec 24, 2021

Markdown: improved version #3236

Merged

masatake closed this Jan 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readline-based parser of markdown #3233

Readline-based parser of markdown #3233

techee commented Dec 22, 2021

codecov bot commented Dec 22, 2021 •

edited

Loading

masatake commented Dec 23, 2021

masatake commented Dec 23, 2021 •

edited

Loading

masatake commented Dec 23, 2021

masatake commented Dec 23, 2021

techee commented Dec 23, 2021

techee commented Dec 23, 2021

masatake commented Dec 23, 2021

techee commented Dec 23, 2021

masatake commented Dec 23, 2021

masatake commented Dec 23, 2021

masatake Dec 23, 2021

masatake Dec 23, 2021 •

edited

Loading

masatake commented Jan 2, 2022

Readline-based parser of markdown #3233

Readline-based parser of markdown #3233

Conversation

techee commented Dec 22, 2021

codecov bot commented Dec 22, 2021 • edited Loading

Codecov Report

masatake commented Dec 23, 2021

masatake commented Dec 23, 2021 • edited Loading

masatake commented Dec 23, 2021

masatake commented Dec 23, 2021

techee commented Dec 23, 2021

techee commented Dec 23, 2021

masatake commented Dec 23, 2021

techee commented Dec 23, 2021

masatake commented Dec 23, 2021

masatake commented Dec 23, 2021

masatake Dec 23, 2021

Choose a reason for hiding this comment

masatake Dec 23, 2021 • edited Loading

Choose a reason for hiding this comment

masatake commented Jan 2, 2022

codecov bot commented Dec 22, 2021 •

edited

Loading

masatake commented Dec 23, 2021 •

edited

Loading

masatake Dec 23, 2021 •

edited

Loading