Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markdown: improved version #3236

Merged
merged 12 commits into from
Jan 2, 2022
Merged

Conversation

techee
Copy link
Contributor

@techee techee commented Dec 24, 2021

I fixed the problem with underlined headings from #3233 and in addition cleaned up the code a bit. Also, when running unit tests on the fixed code, there popped up a problem with code blocks using just indenting I fixed, more in the commit message (I had no clue one can do that, I haven't seen this documented anywhere but even github markdown works this way).

masatake and others added 2 commits December 23, 2021 23:54
This parser is based on the asciidoc parser and tries to preserve all
features of the regex-based parser (all kinds, full scope,
sectionMarker field, running subparsers for code).

Comparing with the regex based implementation, the rewritten parser
is more than 10 times faster:

  [yamato@dev64]~/var/codebase% CTAGS_EXE=/home/yamato/var/ctags-github/ctags-Regex ./codebase ctags Markdown
   version: 5e6ab4b6
  features: +wildcards +regex +gnulib_regex +iconv +option-directory +xpath +json +interactive +sandbox +packcc +optscript +pcre2
  log: results/5e6ab4b6,Markdown............,..........,time......,default...,2021-12-24-00:27:22.log
  tagsoutput: /dev/null
  cmdline: + /home/yamato/var/ctags-github/ctags-Regex --quiet --options=NONE --sort=no --options=profile.d/maps --totals=yes --languages=Markdown -o - -R code/markdown-corpus code/progit code/selinux-handbook
  686 files, 322854 lines (24506 kB) scanned in 2.0 seconds (12462 kB/s)
  19246 tags added to tag file

  real	0m2.014s
  user	0m1.967s
  sys	0m0.035s
  + set +x
  [yamato@dev64]~/var/codebase% CTAGS_EXE=/home/yamato/var/ctags-github/ctags-C ./codebase ctags Markdown
  version: 258b9cbd
  features: +wildcards +regex +gnulib_regex +iconv +option-directory +xpath +json +interactive +sandbox +packcc +optscript +pcre2
  log: results/258b9cbd,Markdown............,..........,time......,default...,2021-12-24-00:27:34.log
  tagsoutput: /dev/null
  cmdline: + /home/yamato/var/ctags-github/ctags-C --quiet --options=NONE --sort=no --options=profile.d/maps --totals=yes --languages=Markdown -o - -R code/markdown-corpus code/progit code/selinux-handbook
  686 files, 322854 lines (24506 kB) scanned in 0.1 seconds (164033 kB/s)
  19112 tags added to tag file

  real	0m0.190s
  user	0m0.172s
  sys	0m0.015s
  + set +x

@masatake editted the commit log, deleted old implementation, and
implemented code for filling "end:" fields.
@codecov
Copy link

codecov bot commented Dec 24, 2021

Codecov Report

Merging #3236 (78320f2) into master (05e6ab4) will increase coverage by 0.25%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3236      +/-   ##
==========================================
+ Coverage   85.01%   85.27%   +0.25%     
==========================================
  Files         206      207       +1     
  Lines       49127    49304     +177     
==========================================
+ Hits        41765    42043     +278     
+ Misses       7362     7261     -101     
Impacted Files Coverage Δ
main/nestlevel.c 88.23% <100.00%> (ø)
parsers/markdown.c 100.00% <100.00%> (ø)
parsers/ruby.c 98.22% <100.00%> (+0.39%) ⬆️
main/lregex.c 82.07% <0.00%> (-1.01%) ⬇️
parsers/tex.c 97.47% <0.00%> (-0.32%) ⬇️
main/field.c 92.73% <0.00%> (-0.29%) ⬇️
optlib/man.c 0.00% <0.00%> (ø)
parsers/haxe.c 100.00% <0.00%> (ø)
parsers/gemspec.c 98.71% <0.00%> (ø)
parsers/asm.c 99.32% <0.00%> (+0.04%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05e6ab4...78320f2. Read the comment docs.

@masatake
Copy link
Member

masatake commented Dec 25, 2021

Thank you for updating.

[yamato@dev64]~/var/ctags-github% cat input.md                                 
# A

- [a](#a)
  - [b](#b)
- [c](#c)

## B
[yamato@dev64]~/var/ctags-github% ./ctags --options=NONE --sort=no -o - input.md
ctags: Notice: No options will be read from files or environment
A	input.md	/^# A$/;"	c
- [b](#b)	input.md	/^  - [b](#b)$/;"	s	chapter:A
B	input.md	/^## B$/;"	s	chapter:A

I think tagging - [b](#b) is not expected. Could you fix this?

@techee
Copy link
Contributor Author

techee commented Dec 25, 2021

I think tagging - b is not expected. Could you fix this?

Oh yeah, forgot about this possibility. Thanks for noticing, should be fixed now.

@masatake
Copy link
Member

Thank you for updating.
With using the codebase (universal-ctags/codebase), I compared the difference between the old regex based parser and C parser.

I found 3 bug patterns. Could you try to fix them?


input0.md:

# LANGUAGE
## Java
## C++
## C\#

output:

LANGUAGE	/tmp/input0.md	/^# LANGUAGE$/;"	c
Java	/tmp/input0.md	/^## Java$/;"	s	chapter:LANGUAGE
C++	/tmp/input0.md	/^## C++$/;"	s	chapter:LANGUAGE
C\\	/tmp/input0.md	/^## C\\#$/;"	s	chapter:LANGUAGE

expectation:

LANGUAGE	/tmp/input0.md	/^# LANGUAGE$/;"	c
Java	/tmp/input0.md	/^## Java$/;"	s	chapter:LANGUAGE
C++	/tmp/input0.md	/^## C++$/;"	s	chapter:LANGUAGE
C\\#	/tmp/input0.md	/^## C\\#$/;"	s	chapter:LANGUAGE

input1.md:

# H1

```STR```

# H2

output:

H1	/tmp/input1.md	/^# H1$/;"	c

expectation:

H1	/tmp/input1.md	/^# H1$/;"	c
H2	/tmp/input1.md	/^# H2$/;"	c

The "frontmatter" area should be skipped.
#3031
a679933
#3027 (comment)
#3032

(@masatake must improve Units/parser-markdown.r/frontmatter.d/input.md to be able to detect the bug I will report here.")

input2.md:

---
title: Awesome Geek Podcasts
permalink: /
layout: default
---

The new parser makes a tag for layout: default unexpectedly.
It is part of the frontmatter.

@techee
Copy link
Contributor Author

techee commented Dec 26, 2021

Hmm, I just found

https://github.github.com/gfm/

when looking for some formal specification because there are many ambiguities and there are a lot more things to handle. For instance, it's not clear to me how to handle things like

```sh```

I think this isn't a fenced code block

https://github.github.com/gfm/#fenced-code-blocks

but rather a code span

https://github.github.com/gfm/#code-spans

But it's not clear to me how in such a case I could distinguish these when I first encounter the opening ``` and how should I know the sh isn't an info string (i.e. the string representing the parser used as a subparser):

https://github.github.com/gfm/#info-string

Also, is there any formal specification for how the "frontmatter" should look like? This isn't anything markdown-related, I guess it's just used by some tools using markdown - is this always delimited by ---?

@techee
Copy link
Contributor Author

techee commented Dec 26, 2021

Also, the markdown standard seems to suport empty headings - should I generate tags for these too? In addition, markdown seems to support multiline headings like

one
two
===

but I think I'll just give up on these.

@masatake
Copy link
Member

Also, the markdown standard seems to suport empty headings - should I generate tags for these too?

No, at least this pull request.

If the old regex parser cannot do something, it is o.k. the new parser cannot do them.
However, if the old parser does something well, I would like to expect the new parser does them as well.
I myself will inspect the 3 patterns when I find a time.

@techee
Copy link
Contributor Author

techee commented Dec 26, 2021

I myself will inspect the 3 patterns when I find a time.

I'll try to do something myself, it's not a problem to fix the issues you mentioned somehow, I was just wondering if you have some ideas as this "something" may not be the completely correct thing to do.

@techee
Copy link
Contributor Author

techee commented Dec 26, 2021

However, if the old parser does something well, I would like to expect the new parser does them as well.

The question is what should happen in situations like ```foo``` - right now I think it's something where it is unspecified what should happen exactly. The fact that the previous parser did something that appeared like a correct output doesn't mean it did it well.

@techee
Copy link
Contributor Author

techee commented Dec 26, 2021

@masatake I hope I fixed the problems. With the ``` problem it's more or less a workaround as handling code span and fenced code blocks (and distinguishing which one is which) would be quite complicated. I'll leave this to future generations of ctags coders :-).

@masatake masatake self-assigned this Dec 27, 2021
@techee
Copy link
Contributor Author

techee commented Dec 27, 2021

@masatake I just made minor formatting changes in the code to follow the style used by uctags more closely.

@masatake
Copy link
Member

masatake commented Dec 27, 2021

Could you split d37fea4 into smaller ones?

When you made a commit (commit C) changing the behavior of the parser, I would like you to include a test case for the new behavior to the commit C, or made a new commit for testing the new behavior just after the commit C.
It greatly helps me to understand your changes.
git push --force is welcome.

#3241 is an example following the above practice.

Headings defined using underline by either '=' or '-' may contain an
arbitrary amount of '=' or '-' characters which don't have to be equal
to the number of characters on the previous line (which was what the
previous code did).

Special attention has to be paid to code blocks using indentation
(either by 4 or more spaces or a \t character). For instance

    abcd
---

this is a code block followed by underline, while

  abcd
---

is a 'abcd' heading.
The indent can be composed of combination of tabs and spaces and
the previous code didn't take into account combinations like " \t".
Fix that.
We have to check if a sequence of '-'s is followed by a visible character
or not to distinguish

a
- b

which is a list containing 'b' and

a
-

which is a heading.
@techee
Copy link
Contributor Author

techee commented Dec 29, 2021

@masatake Done.

@masatake masatake self-requested a review December 30, 2021 09:05
@masatake
Copy link
Member

I have tested the update changes.

In the following input:

  `O`
# A

(Two spaces before `O`)

the parser cannot extract A.
I found the root cause in 9e602a6.

diff --git a/parsers/markdown.c b/parsers/markdown.c
index 38e6122ef..6c9ee3d8f 100644
--- a/parsers/markdown.c
+++ b/parsers/markdown.c
@@ -219,7 +219,7 @@ static void findMarkdownTags (void)
 			char c = line[pos];
 			char otherC = c == '`' ? '~' : '`';
 			int nSame;
-			for (nSame = pos + 1; line[nSame] == line[pos]; ++nSame);
+			for (nSame = 1; line[pos + nSame] == line[pos]; ++nSame);
 
 			if (inCodeChar != otherC && nSame >= 3)
 			{

If you are o.k., could you include the above change to 9e602a6?

Could you do mkdir parser-markdown.r/backquote.d?
The directory includes

input.md:

  `O`
# A

args.ctags:

--sort=no

(though there is only one tag entry.)

expected.tags:

A	input.md	/^# A$/;"	c

Could you add this new test case to 9e602a6, too?

@masatake
Copy link
Member

The next one is an extra request.
The original regex parser doesn't implement the request. So this is not a must item for merging.

Is it possible to skip <!-- ... --> area?

<!--
# SKIP ME 1
-->

<!-- # SKIP ME 2 -->

# EXTRACT ME 1

@techee
Copy link
Contributor Author

techee commented Jan 1, 2022

I found the root cause in 9e602a6.
If you are o.k., could you include the above change to 9e602a6?

Oh, yeah, totally, I didn't pay enough attention when updating the code, thanks for noticing, will fix.

Is it possible to skip area?

It should be no problem for the simple cases you provided but it would be much harder for something like

<!-- # SKIP ME 2 --> # INCLUDE ME <!--
# SKIP ME 3 --> # INCLUDE ME

I'll provide a patch for the simple cases only (it will basically be what is done for ``` right now).

techee added 3 commits January 1, 2022 17:56
This isn't quite correct check as there may be multiple code spans
on a single line and they can go over multiple lines but at least it
fixes the case the previous regex parser handled.
techee added 2 commits January 1, 2022 18:19
- use space between function name and (
- use camel case instead of underline in names
@masatake
Copy link
Member

masatake commented Jan 1, 2022

Thank you. I tried the new changes.

It seems that we can expect <!-- is at the beginning of lines.
However, we cannot expect --> is not.

With the following change, tags output became better:

diff --git a/parsers/markdown.c b/parsers/markdown.c
index effd72601..f8b664a3a 100644
--- a/parsers/markdown.c
+++ b/parsers/markdown.c
@@ -256,8 +256,7 @@ static void findMarkdownTags (void)
                        lineProcessed = true;
                }
                /* XML comment end */
-               else if (lineLen >= pos + 3 && line[pos] == '-' && line[pos + 1] == '-' &&
-                       line[pos + 2] == '>')
+               else if (inComment && strstr ((const char *)(line + pos), "-->"))
                {
                        inComment = false;
                        lineProcessed = true;

Could you include this change to 69be461?

@techee
Copy link
Contributor Author

techee commented Jan 2, 2022

@masatake Yep, makes sense, done.

@masatake masatake merged commit e7f2c05 into universal-ctags:master Jan 2, 2022
@masatake
Copy link
Member

masatake commented Jan 2, 2022

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants