Markdown: improved version #3236

techee · 2021-12-24T19:07:44Z

I fixed the problem with underlined headings from #3233 and in addition cleaned up the code a bit. Also, when running unit tests on the fixed code, there popped up a problem with code blocks using just indenting I fixed, more in the commit message (I had no clue one can do that, I haven't seen this documented anywhere but even github markdown works this way).

Signed-off-by: Masatake YAMATO <[email protected]>

@masatake

This parser is based on the asciidoc parser and tries to preserve all features of the regex-based parser (all kinds, full scope, sectionMarker field, running subparsers for code). Comparing with the regex based implementation, the rewritten parser is more than 10 times faster: [yamato@dev64]~/var/codebase% CTAGS_EXE=/home/yamato/var/ctags-github/ctags-Regex ./codebase ctags Markdown version: 5e6ab4b6 features: +wildcards +regex +gnulib_regex +iconv +option-directory +xpath +json +interactive +sandbox +packcc +optscript +pcre2 log: results/5e6ab4b6,Markdown............,..........,time......,default...,2021-12-24-00:27:22.log tagsoutput: /dev/null cmdline: + /home/yamato/var/ctags-github/ctags-Regex --quiet --options=NONE --sort=no --options=profile.d/maps --totals=yes --languages=Markdown -o - -R code/markdown-corpus code/progit code/selinux-handbook 686 files, 322854 lines (24506 kB) scanned in 2.0 seconds (12462 kB/s) 19246 tags added to tag file real 0m2.014s user 0m1.967s sys 0m0.035s + set +x [yamato@dev64]~/var/codebase% CTAGS_EXE=/home/yamato/var/ctags-github/ctags-C ./codebase ctags Markdown version: 258b9cbd features: +wildcards +regex +gnulib_regex +iconv +option-directory +xpath +json +interactive +sandbox +packcc +optscript +pcre2 log: results/258b9cbd,Markdown............,..........,time......,default...,2021-12-24-00:27:34.log tagsoutput: /dev/null cmdline: + /home/yamato/var/ctags-github/ctags-C --quiet --options=NONE --sort=no --options=profile.d/maps --totals=yes --languages=Markdown -o - -R code/markdown-corpus code/progit code/selinux-handbook 686 files, 322854 lines (24506 kB) scanned in 0.1 seconds (164033 kB/s) 19112 tags added to tag file real 0m0.190s user 0m0.172s sys 0m0.015s + set +x @masatake editted the commit log, deleted old implementation, and implemented code for filling "end:" fields.

codecov · 2021-12-24T19:20:50Z

Codecov Report

Merging #3236 (78320f2) into master (05e6ab4) will increase coverage by 0.25%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #3236      +/-   ##
==========================================
+ Coverage   85.01%   85.27%   +0.25%     
==========================================
  Files         206      207       +1     
  Lines       49127    49304     +177     
==========================================
+ Hits        41765    42043     +278     
+ Misses       7362     7261     -101

Impacted Files	Coverage Δ
main/nestlevel.c	`88.23% <100.00%> (ø)`
parsers/markdown.c	`100.00% <100.00%> (ø)`
parsers/ruby.c	`98.22% <100.00%> (+0.39%)`	⬆️
main/lregex.c	`82.07% <0.00%> (-1.01%)`	⬇️
parsers/tex.c	`97.47% <0.00%> (-0.32%)`	⬇️
main/field.c	`92.73% <0.00%> (-0.29%)`	⬇️
optlib/man.c	`0.00% <0.00%> (ø)`
parsers/haxe.c	`100.00% <0.00%> (ø)`
parsers/gemspec.c	`98.71% <0.00%> (ø)`
parsers/asm.c	`99.32% <0.00%> (+0.04%)`	⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05e6ab4...78320f2. Read the comment docs.

masatake · 2021-12-25T05:21:54Z

Thank you for updating.

[yamato@dev64]~/var/ctags-github% cat input.md                                 
# A

- [a](#a)
  - [b](#b)
- [c](#c)

## B
[yamato@dev64]~/var/ctags-github% ./ctags --options=NONE --sort=no -o - input.md
ctags: Notice: No options will be read from files or environment
A	input.md	/^# A$/;"	c
- [b](#b)	input.md	/^  - [b](#b)$/;"	s	chapter:A
B	input.md	/^## B$/;"	s	chapter:A

I think tagging - [b](#b) is not expected. Could you fix this?

techee · 2021-12-25T09:53:37Z

I think tagging - b is not expected. Could you fix this?

Oh yeah, forgot about this possibility. Thanks for noticing, should be fixed now.

masatake · 2021-12-26T16:25:55Z

Thank you for updating.
With using the codebase (universal-ctags/codebase), I compared the difference between the old regex based parser and C parser.

I found 3 bug patterns. Could you try to fix them?

input0.md:

# LANGUAGE
## Java
## C++
## C\#

output:

LANGUAGE	/tmp/input0.md	/^# LANGUAGE$/;"	c
Java	/tmp/input0.md	/^## Java$/;"	s	chapter:LANGUAGE
C++	/tmp/input0.md	/^## C++$/;"	s	chapter:LANGUAGE
C\\	/tmp/input0.md	/^## C\\#$/;"	s	chapter:LANGUAGE

expectation:

LANGUAGE	/tmp/input0.md	/^# LANGUAGE$/;"	c
Java	/tmp/input0.md	/^## Java$/;"	s	chapter:LANGUAGE
C++	/tmp/input0.md	/^## C++$/;"	s	chapter:LANGUAGE
C\\#	/tmp/input0.md	/^## C\\#$/;"	s	chapter:LANGUAGE

input1.md:

# H1

```STR```

# H2

output:

H1	/tmp/input1.md	/^# H1$/;"	c

expectation:

H1	/tmp/input1.md	/^# H1$/;"	c
H2	/tmp/input1.md	/^# H2$/;"	c

The "frontmatter" area should be skipped.
#3031
a679933
#3027 (comment)
#3032

(@masatake must improve Units/parser-markdown.r/frontmatter.d/input.md to be able to detect the bug I will report here.")

input2.md:

---
title: Awesome Geek Podcasts
permalink: /
layout: default
---

The new parser makes a tag for layout: default unexpectedly.
It is part of the frontmatter.

techee · 2021-12-26T19:06:45Z

Hmm, I just found

https://github.github.com/gfm/

when looking for some formal specification because there are many ambiguities and there are a lot more things to handle. For instance, it's not clear to me how to handle things like

```sh```

I think this isn't a fenced code block

https://github.github.com/gfm/#fenced-code-blocks

but rather a code span

https://github.github.com/gfm/#code-spans

But it's not clear to me how in such a case I could distinguish these when I first encounter the opening ``` and how should I know the sh isn't an info string (i.e. the string representing the parser used as a subparser):

https://github.github.com/gfm/#info-string

Also, is there any formal specification for how the "frontmatter" should look like? This isn't anything markdown-related, I guess it's just used by some tools using markdown - is this always delimited by ---?

techee · 2021-12-26T19:20:00Z

Also, the markdown standard seems to suport empty headings - should I generate tags for these too? In addition, markdown seems to support multiline headings like

one
two
===

but I think I'll just give up on these.

masatake · 2021-12-26T19:39:18Z

Also, the markdown standard seems to suport empty headings - should I generate tags for these too?

No, at least this pull request.

If the old regex parser cannot do something, it is o.k. the new parser cannot do them.
However, if the old parser does something well, I would like to expect the new parser does them as well.
I myself will inspect the 3 patterns when I find a time.

techee · 2021-12-26T19:57:11Z

I myself will inspect the 3 patterns when I find a time.

I'll try to do something myself, it's not a problem to fix the issues you mentioned somehow, I was just wondering if you have some ideas as this "something" may not be the completely correct thing to do.

techee · 2021-12-26T20:09:20Z

However, if the old parser does something well, I would like to expect the new parser does them as well.

The question is what should happen in situations like ```foo``` - right now I think it's something where it is unspecified what should happen exactly. The fact that the previous parser did something that appeared like a correct output doesn't mean it did it well.

techee · 2021-12-26T23:59:18Z

@masatake I hope I fixed the problems. With the ``` problem it's more or less a workaround as handling code span and fenced code blocks (and distinguishing which one is which) would be quite complicated. I'll leave this to future generations of ctags coders :-).

techee · 2021-12-27T13:34:37Z

@masatake I just made minor formatting changes in the code to follow the style used by uctags more closely.

masatake · 2021-12-27T23:38:39Z

Could you split d37fea4 into smaller ones?

When you made a commit (commit C) changing the behavior of the parser, I would like you to include a test case for the new behavior to the commit C, or made a new commit for testing the new behavior just after the commit C.
It greatly helps me to understand your changes.
git push --force is welcome.

#3241 is an example following the above practice.

Headings defined using underline by either '=' or '-' may contain an arbitrary amount of '=' or '-' characters which don't have to be equal to the number of characters on the previous line (which was what the previous code did). Special attention has to be paid to code blocks using indentation (either by 4 or more spaces or a \t character). For instance abcd --- this is a code block followed by underline, while abcd --- is a 'abcd' heading.

The indent can be composed of combination of tabs and spaces and the previous code didn't take into account combinations like " \t". Fix that.

We have to check if a sequence of '-'s is followed by a visible character or not to distinguish a - b which is a list containing 'b' and a - which is a heading.

techee · 2021-12-29T21:03:00Z

@masatake Done.

masatake · 2021-12-31T16:15:23Z

I have tested the update changes.

In the following input:

  `O`
# A

(Two spaces before `O`)

the parser cannot extract A.
I found the root cause in 9e602a6.

diff --git a/parsers/markdown.c b/parsers/markdown.c
index 38e6122ef..6c9ee3d8f 100644
--- a/parsers/markdown.c
+++ b/parsers/markdown.c
@@ -219,7 +219,7 @@ static void findMarkdownTags (void)
 			char c = line[pos];
 			char otherC = c == '`' ? '~' : '`';
 			int nSame;
-			for (nSame = pos + 1; line[nSame] == line[pos]; ++nSame);
+			for (nSame = 1; line[pos + nSame] == line[pos]; ++nSame);
 
 			if (inCodeChar != otherC && nSame >= 3)
 			{

If you are o.k., could you include the above change to 9e602a6?

Could you do mkdir parser-markdown.r/backquote.d?
The directory includes

input.md:

  `O`
# A

args.ctags:

--sort=no

(though there is only one tag entry.)

expected.tags:

A	input.md	/^# A$/;"	c

Could you add this new test case to 9e602a6, too?

masatake · 2021-12-31T16:20:43Z

The next one is an extra request.
The original regex parser doesn't implement the request. So this is not a must item for merging.

Is it possible to skip  area?

<!--
# SKIP ME 1
-->

<!-- # SKIP ME 2 -->

# EXTRACT ME 1

techee · 2022-01-01T15:00:51Z

I found the root cause in 9e602a6.
If you are o.k., could you include the above change to 9e602a6?

Oh, yeah, totally, I didn't pay enough attention when updating the code, thanks for noticing, will fix.

Is it possible to skip area?

It should be no problem for the simple cases you provided but it would be much harder for something like

<!-- # SKIP ME 2 --> # INCLUDE ME <!--
# SKIP ME 3 --> # INCLUDE ME

I'll provide a patch for the simple cases only (it will basically be what is done for ``` right now).

…#,`,~,-,=

This isn't quite correct check as there may be multiple code spans on a single line and they can go over multiple lines but at least it fixes the case the previous regex parser handled.

…hen underlined

- use space between function name and ( - use camel case instead of underline in names

masatake · 2022-01-01T21:16:37Z

Thank you. I tried the new changes.

It seems that we can expect  is not.

With the following change, tags output became better:

diff --git a/parsers/markdown.c b/parsers/markdown.c
index effd72601..f8b664a3a 100644
--- a/parsers/markdown.c
+++ b/parsers/markdown.c
@@ -256,8 +256,7 @@ static void findMarkdownTags (void)
                        lineProcessed = true;
                }
                /* XML comment end */
-               else if (lineLen >= pos + 3 && line[pos] == '-' && line[pos + 1] == '-' &&
-                       line[pos + 2] == '>')
+               else if (inComment && strstr ((const char *)(line + pos), "-->"))
                {
                        inComment = false;
                        lineProcessed = true;

Could you include this change to 69be461?

techee · 2022-01-02T10:38:31Z

@masatake Yep, makes sense, done.

masatake · 2022-01-02T13:19:52Z

Thank you.

masatake and others added 2 commits December 23, 2021 23:54

main: pass an extra arg to CB func called when a nesting-level is popped

2e3d0c3

Signed-off-by: Masatake YAMATO <[email protected]>

techee mentioned this pull request Dec 24, 2021

Markdown: rewrite in C (based on #3233) #3235

Closed

masatake self-assigned this Dec 27, 2021

masatake mentioned this pull request Dec 28, 2021

Tex: parse some more latex macros #3239

Merged

techee added 4 commits December 29, 2021 21:57

Markdown: Improve indentation detection

aae0618

The indent can be composed of combination of tabs and spaces and the previous code didn't take into account combinations like " \t". Fix that.

Markdown: Fix incorrectly detected underlines

87d0d57

We have to check if a sequence of '-'s is followed by a visible character or not to distinguish a - b which is a list containing 'b' and a - which is a heading.

Markdown: Ignore leading whitespaces when getting heading and detect \#

bd8d8c4

techee force-pushed the markdown3 branch from 1b01b05 to 3bf393f Compare December 29, 2021 21:02

masatake self-requested a review December 30, 2021 09:05

techee added 3 commits January 1, 2022 17:56

Markdown: Allow up to 3 whitespaces in front of first character like …

d0d780e

…#,`,~,-,=

Markdown: Ignore preambule at file start between --- lines

2460571

Markdown: Detect ending ``` on the same line

35c6034

This isn't quite correct check as there may be multiple code spans on a single line and they can go over multiple lines but at least it fixes the case the previous regex parser handled.

techee added 2 commits January 1, 2022 18:19

Markdown: Make sure quotes (starting with >) aren't used as headers w…

3ec4209

…hen underlined

Markdown: Follow uctags formatting style

b1bd013

- use space between function name and ( - use camel case instead of underline in names

techee force-pushed the markdown3 branch from 3bf393f to 69be461 Compare January 1, 2022 19:46

Markdown: handle XML/HTML comments

78320f2

techee force-pushed the markdown3 branch from 69be461 to 78320f2 Compare January 2, 2022 10:38

masatake merged commit e7f2c05 into universal-ctags:master Jan 2, 2022

masatake mentioned this pull request Jan 2, 2022

Readline-based parser of markdown #3233

Closed

techee mentioned this pull request Apr 12, 2022

Use the upstream Markdown parser geany/geany#3161

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Markdown: improved version #3236

Markdown: improved version #3236

techee commented Dec 24, 2021

codecov bot commented Dec 24, 2021 •

edited

Loading

masatake commented Dec 25, 2021 •

edited

Loading

techee commented Dec 25, 2021

masatake commented Dec 26, 2021

techee commented Dec 26, 2021

techee commented Dec 26, 2021

masatake commented Dec 26, 2021

techee commented Dec 26, 2021

techee commented Dec 26, 2021

techee commented Dec 26, 2021

techee commented Dec 27, 2021

masatake commented Dec 27, 2021 •

edited

Loading

techee commented Dec 29, 2021

masatake commented Dec 31, 2021

masatake commented Dec 31, 2021

techee commented Jan 1, 2022

masatake commented Jan 1, 2022

techee commented Jan 2, 2022

masatake commented Jan 2, 2022

Markdown: improved version #3236

Markdown: improved version #3236

Conversation

techee commented Dec 24, 2021

codecov bot commented Dec 24, 2021 • edited Loading

Codecov Report

masatake commented Dec 25, 2021 • edited Loading

techee commented Dec 25, 2021

masatake commented Dec 26, 2021

techee commented Dec 26, 2021

techee commented Dec 26, 2021

masatake commented Dec 26, 2021

techee commented Dec 26, 2021

techee commented Dec 26, 2021

techee commented Dec 26, 2021

techee commented Dec 27, 2021

masatake commented Dec 27, 2021 • edited Loading

techee commented Dec 29, 2021

masatake commented Dec 31, 2021

masatake commented Dec 31, 2021

techee commented Jan 1, 2022

masatake commented Jan 1, 2022

techee commented Jan 2, 2022

masatake commented Jan 2, 2022

codecov bot commented Dec 24, 2021 •

edited

Loading

masatake commented Dec 25, 2021 •

edited

Loading

masatake commented Dec 27, 2021 •

edited

Loading