Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide access to a parse tree as an alternative to HTML output #1086

Closed
derhuerst opened this issue Feb 19, 2016 · 94 comments · Fixed by #2404
Closed

Provide access to a parse tree as an alternative to HTML output #1086

derhuerst opened this issue Feb 19, 2016 · 94 comments · Fixed by #2404
Assignees
Labels
enhancement An enhancement or new feature parser
Milestone

Comments

@derhuerst
Copy link

I've just had a quick look at hightlight.js and didn't any discussion on this. So forgive me if this is not the right place or has already been discussed.

I'd like to propose to make highlight.js more generally usable, or at least pull out the core highlighting logic to make it independent from the output format. Just like the famous Pygments has formatters, the HTML generation would be done by a separate module/package.

highlight.js seems to be the de factor standard from syntax highlighting, even among porject not related to HTML/browsers. See the slap editor and hicat for examples: Both use it and later on manually walk/manipulate the HTML generated by highlight.js.

In the world of lightweight do-one-thing NPM packages, it would make sense to split highlight.js into two parts, one generating an treeish object representation (possibly streaming), and the second generating HTML from it.

@isagalaev
Copy link
Member

This idea is hanging in the air for a while. It's never been a priority as it lacked real use-cases, but now as you mentioned those two it makes very real sense. Another problem is that in one or two attempts at doing this it was done as a part of bigger (and controversial) refactorings of the whole core library, so it haven't got merged. I will probably get to it myself eventually, but if you'd fancy giving it a try, I could give some guidance.

@derhuerst
Copy link
Author

I will probably get to it myself eventually, but if you'd fancy giving it a try, I could give some guidance.

I'm having difficulties in understanding the inner workings of highlight.js, both because there is not a lot of documentation/comments and because the code is written in a really imperative jump-around way.

Also, it seems like HTML generation is really baked into the core of highlight.js.

@isagalaev
Copy link
Member

Yes, it was deliberately done so in the past to squeeze the code size as much as possible. Currently this "optimization target" is much less desirable and that's why I'm all for refactoring this.

I'll do a short writeup about the architecture of the core to make it more comprehensible.

@derhuerst
Copy link
Author

I'll do a short writeup about the architecture of the core to make it more comprehensible.

This sounds great, especially since there seems to be no render-agnostic snytax highlighter in JavaScript land right now.

@isagalaev
Copy link
Member

I'll do a short writeup about the architecture of the core to make it more comprehensible.

A status update on this… Before writing anything I decided to first do a pretty significant upgrade to the parser that I was planning to do since a long time ago anyway but a few bugs came up lately bumped the importance of this, and incidentally I've finally came up with an actual plan on how to implement it. And there'd be no point in documenting the current core as it will change immediately after that.

(If interested, the change in question is one I described in this blog post, under the "Complex modes" section.)

@dbkaplun
Copy link
Contributor

dbkaplun commented Jul 6, 2016

Author of slap text editor here. I was going to open a new ticket for this but looks like @derhuerst phrased it very well.

highlight.js has many other uses than highlighting in the browser. It would be great if parsing was decoupled from HTML generation. See also slap-editor/editor-widget#134. We are considering porting our current highlighting implementation using highlight.js, in combination with HTML manipulation with cheerio, to another option that would not require any HTML manipulation such as lowlight (which itself depends on highlight.js but does not seem to perform any HTML manipulation).

Glad to help out in any way if this feature request moves forward!

@okonet
Copy link

okonet commented Mar 5, 2017

I'm using https://github.com/wooorm/lowlight and writing my RTF formatter ATM so I'd be interested in support for AST out of the box. I think this would allow adding different formatters as pygment does.

@joshgoebel joshgoebel added the enhancement An enhancement or new feature label Oct 7, 2019
@joshgoebel joshgoebel self-assigned this Oct 7, 2019
@joshgoebel
Copy link
Member

Just like the famous Pygments has formatters, the HTML generation would be done by a separate module/package.

I think our HTML stuff has some unique benefits, but if there were two pieces people could always choice to use the one we bundled or replace it with a different once.

I've been digging into the parser a lot, so I'll keep this in mind.

@derhuerst Any idea what that parsed but not HTML format might look like?

@derhuerst
Copy link
Author

@derhuerst Any idea what that parsed but not HTML format might look like?

Usually you annotate Abstract Syntax Trees (ASTs) generated by a parser with additional fields, e.g. highlight.js: {color: '#123123', bold: true}.

The specific mechanism to add this info of course depends on the specific AST format being used. The most well-known one is ESTree used by Mozilla and the Babel AST format (an extension of ESTree). There's also the unist AST format, which tries to support generic trees of markup, e.g. for HTML and for Markdown.

All of these AST formats have dozens or even hundreds of libraries built around them, supporting all kinds of use cases, from parsing to transformations to formatting.

@joshgoebel
Copy link
Member

If we add this support to highlight.js natively, do we kill the whole lowlight project?

@wooorm
Copy link

wooorm commented Oct 15, 2019

Yes, probably!
Although, highlight.js main focus is to create an HTML string, whereas lowlights main focus is the syntax tree.
So I imagine highlight’s API to still be mostly the same, with new functions added on top, whereas lowlights internals will change but API stay the same?

@joshgoebel
Copy link
Member

Well sure, we wouldn't change our API. We'd likely add a new method or two to return some sort of node tree instead of HTML... and maybe some method to turn a node tree into HTML... then we'd just wrap ourselves by glueing the two together with the old api.

@joshgoebel
Copy link
Member

@wooorm Before building your own parser that I assume is derived from ours did you try to first add the functionality to our source? If so was there a reason you gave up on that approach and went a different way?

@wooorm
Copy link

wooorm commented Oct 20, 2019

I needed hast, and hast was new (my invention). If I recall correctly, people weren't interested in ASTs as much back then so I rolled my own instead of raising an issue requesting my own format to be supported

@joshgoebel
Copy link
Member

Well I meant the core functionality. Once you had a list of parsed tokens you could turn it 8nto any format pretty easily I’d think.

@wooorm
Copy link

wooorm commented Oct 20, 2019

The core functionality is the same for highlight.js and lowlight, except in code style, and in one creating HTML and the other a syntax tree. I think my above comment clarifies that no, I did not raise an issue, and gives reasoning for why not? Here is a brief history for lowlight.

@joshgoebel
Copy link
Member

joshgoebel commented Oct 20, 2019

The core functionality is the same for highlight.js and lowlight, except in code style,

Except it's not:

#2209
#2179
#2135

All 3 of these will likely bite you eventually unless you are watching closely and porting the same changes to your rewrite... if I were you I wouldn't want to keep up with things like that... that's one of the disadvantages of rewriting vs reusing the core engine.

@wooorm
Copy link

wooorm commented Oct 21, 2019

Yes, there are differences, and for the last almost-four-years I’ve worked on porting them over.

You assume that I want to maintain lowlight like this for years to come. I’d prefer for highlight to support syntax trees. You already asked this and I responded:

Yes, probably! [...] So I imagine highlight’s API to still be mostly the same, with new functions added on top, whereas lowlights internals will change but API stay the same?

@joshgoebel
Copy link
Member

Just making sure you were aware. :-) I might play around with this idea just to see what the hook in points are... I don' think this would be a difficult thing to do, just need to figure out how to do it nicely. Our existing system isn't very modular. :)

@joshgoebel
Copy link
Member

joshgoebel commented May 3, 2020

And again, all this is considered private API and might stay so... it's mostly to make advanced integrations with stuff like lowlight possible and MUCH simpler. Instead of forking our whole library like you used to now you should be able to do the same thing with 50-100 lines of code.

If we made an API public public we might start with the tiny builder API... so we'd support:

var builder = new PDFBuilder();
result.emitter.walk(builder)

Now you only implement (or whatever we named them):

  • open
  • close
  • text

Done. :-) And our internal Tree data structure handles all the messy stack details/close all nodes, etc...

@wooorm
Copy link

wooorm commented May 3, 2020

And again, all this is considered private API and might stay so...

This issue is about creating a public API for that. My suggestions aren‘t for how things work internally, but for what is exposed

@joshgoebel
Copy link
Member

So question to you: Can you not build lowlight on the builder API alone?

@joshgoebel
Copy link
Member

joshgoebel commented May 3, 2020

The layering is about how low-level you want to be. The fastest way to get output is a Builder, that would be what I'd steer most non-advanced implementers to, and I think it makes sense as a simple 3 method public API... the most complex and customizable way is an Emitter. Builder is trivially simple, Emitter requires more effort on the part of the implementer.

The way you're talking it sounds like you really want to write a Builder not an Emitter because you don't want the emitter responsibilities.

Builders are dumb, emitters have to have brains and be able to have an intelligent dialog with the parser. Such as "hey, add a keyword to the output for me".

The parser itself WANTS to deal with an emitter, not a builder. So that it doesn't concern itself with the output concerns.

@wooorm
Copy link

wooorm commented May 3, 2020

I was able to get up to a point, but there are/were other issues which I raised in other issues.
I don’t want to update to a private API though. My above comments are feedback on API design that would solve this issue.

If we made an API public public we might start with the tiny builder API... so we'd support:

You may do that! My suggestion is to support the bare minimum that we need now, in a low-level way that should be powerful enough to do most things. I posted HTML and hast making above, I can see how emphasize could also be made to work for ANSI coloring. I don’t know about PDFs so I can’t make suggestions for that.

Re adapter / builder / emitter: I didn’t study CS; I don’t understand the subtleties you intend between these terms.

@joshgoebel
Copy link
Member

Re adapter / builder / emitter: I didn’t study CS; I don’t understand the subtleties you intend between these terms.

No subtly there, those terms are mostly made up (they may have some meanings I dunno, I did try to give them meaningful names) - but it's just what I decided to call them. The separation of concerns and different layers of abstraction are the real things to take away from what I'm talking about.

I was able to get up to a point, but there are/were other issues which I raised in other issues.

Registration of aliases and getting a new instance are solvable - no objections there. :-) Though I worry about regular users making instances they don't need, but I guess that's a documentation issue?

I don’t want to update to a private API though.

I'd think a semi-private API is far better than forking the whole project. You update to the semi-private API and then pin to the latest compatible version of HLJS. Even if it remains private there are no huge plans to rejigger this stuff all the time. And now updating 90% of the time is just a matter of bumping the version and running your test suite. No more porting over deep internal parser bug fixes/changes.

That sounds worlds better than what you are doing now - treating the whole codebase as a HUGE private api. :-)

@joshgoebel
Copy link
Member

Now we have a chicken in egg problem because I don't feel these things are best designed/used in the dark. I'd want a few people to actually use the system before we ever declared it a public API and locked ourselves into it for at least a major revision cycle and perhaps MUCH longer.

So I think the cycle has to start with advanced integrators actual using the new stuff in the real world - even if it's not blessed for general public usage and not 100% API stable (from one version to the next). That's what NPM version pinning is more IMHO.

@joshgoebel
Copy link
Member

But even if no one ever uses it I still think the internals of the parser have been much improved by these changes... EVEN if we dropped the token tree in the future and went back to pure strings as of right I'd still use the same API for the emitter interface. (I use interface here in the TypeScript/API sense).

@joshgoebel
Copy link
Member

joshgoebel commented May 3, 2020

So question to you: Can you not build lowlight on the builder/walk API alone? My guess it would be a lot less code and a lot closer to what you're asking for.

@joshgoebel
Copy link
Member

Also there are stability/debugging issues as well related to how these concerns are separated. The plain reading of this issues title (I didn't go back and read the whole thread) was requesting a tree output alternative to string output.

That's what Emitter#walk provides. It does so fully separately from the parsing task. It's impossible for a broken Builder to break Highlight.js itself. Say you had a huge spike of memory or a freeze... it's very easy to draw the line and say:

  • Highlight.js is broken
  • Your builder is broken

Vs an emitter, which is running in parallel with the parsing process. It'd be very easy to turn the parse tree into JSON and pass it around, render it later, etc... so the idea of writing an Emitter (that the parser is talking to real-time) vs writing a Builder (that the Tree walker can use after the fact to generate another form of presentation) are different conceptually also - not just in their complexity of implementation.

@wooorm
Copy link

wooorm commented May 3, 2020

hljs now has an __emitter option. I assumed that was the way to go. I did that, and it mostly works. You asked for my thoughts, I gave them based on my existing experience buildings parsers and maintaining hundreds of open source projects.

For context: 53% of hljs downloads on npm come from lowlight, it’s used a lot, I don’t want to provide a potentially bad user experience by depending on private APIs. And I think if lowlight would implement this api, “I'd want a few people to actually use the system before we ever declared” is pretty much solved.

You are the maintainer of hljs and you are free to implement these features the way you want!

@joshgoebel
Copy link
Member

If I've offended I have not meant to. I'm just discussing and explaining my thinking.

“I'd want a few people to actually use the system before we ever declared” is pretty much solved.

I meant use/integrate with the API - I don't mean end users, I mean integrators. I assume before and after the end user experience for your library will remain 100% unchanged - assuming the new functionality on our end actually works, so those really aren't useful data points. It's not the end user experience of your library users we're talking about, it's the integrator experience - ie, your experience.

And at this point you and I are merely two individual data points as having implemented working Emitters. :-) If you can I still urge you to try writing a builder and see how that feels since you're working at a much higher level. It would be helpful to have feedback on that.

hljs now has an __emitter option. I assumed that was the way to go. I did that, and it mostly works.

This is the problem with features without docs - and truly sorry if I steered ya in the wrong direction originally. :-) It's a fine way to go, but I just don't think it was clear that you're working at the lowest-possible level available (short of forking the library). So the API is indeed a bit more complex at that level. The builder API is a higher level API and therefore much simpler.

If I'd had the hindsight of this whole discussion I may have suggested the builder from the get-go, but hindsight is 20/20. :-)

@joshgoebel
Copy link
Member

joshgoebel commented May 3, 2020

I don’t want to provide a potentially bad user experience by depending on private APIs.

Do you publish on NPM? Can you not pin your library against a "known good" version of ours? I guess I fail to see how forking the whole library is better/easier than pinning to a specific version of HLJS that provides a specific "known good" API - whether private or not.

Again, isn't forking a whole project kind of the "worse case" scenario - in that case you're treating the whole project as a giant private API... if we change a ton of stuff, you're "stuck" on the prior version until you manually update.

How is that any different than using a semi-private API - other than the fact that the amount of code you have to maintain is now reduced by an order of magnitude? If we change a ton of stuff in general, you now get all those changes for free. If we change a ton of stuff specifically related to the integration API... you're "stuck" on the prior version until you manually update.

Seems like option 2 is better than option 1, or at least no worse... no? Am I missing something? What is the big advantage of "vendoring it" (having a static/forked copy to integrate with) vs having a static versioned copy via npm?

I'm not sure the goal here was ever to make a 100% public API that we have to support to the same level of stability that we support our other public APIs. My goal was to simplify the parser, separate the concerns better, and while doing so make it so that integrators (like yourself) wouldn't necessarily have to maintain a whole custom copy of the library just to get between the parser <=> render process.

Even if you insist on maintaining a fork now (for whatever reason), I'm guessing that fork should be 95% easier to maintain. Hooking in your custom code is now a 5 line patch to the core library (assuming we go ahead and add registerAlias and newInstance) vs... well, whatever it was before. Even without that it's probably more like a 50 line patch (swap out emitter, add registerAlias, add newInstance)... plus your lowlight code itself of course. At least that's my understanding of things.


based on my existing experience buildings parsers and maintaining hundreds of open source projects.

Where do you find the time? :-) Then surely you understand my hesitation to add and document a public and stable API for a tiny, tiny subset or our users (advanced integrators) and then be locked into that API for a year (or longer). A year is the pace I currently imagine for major releases where we allow ourselves to truly "break" things.

@joshgoebel
Copy link
Member

joshgoebel commented May 3, 2020

It's entirely possible you know more about parsers than me if you maintain a lot of them - I'm just doing the best I can with the knowledge and experience I have. :-) I have been programming for professionally for 25+ years, so hopefully that's worth something... :-)

My suggestions aren‘t for how things work internally, but for what is exposed

So perhaps the confusion is over "what do we expose". I see emitter as a lower-level implementation concern. It's the interface the parser needs so that it can focus on parsing and not presentation. I made it swappable because I feel that's good design (clean separation) plus at the time I imagined we might have to ALSO ship a pure text emitter in additional to the tree emitter... until I did a few tests and the memory/CPU didn't seem to be all that different with the tree version.

A side benefit of that (for someone like yourself) is that now instead of maintaining a port of the whole project - you can just track the emitter/builder API - even if the internals aren't 100% API stable. So when we release a new version you now rewrite 100 lines instead of potentially having to rewrite the whole library. :-)

I'm not interested in changing this API all the time for no good reason. I expect it to be pretty stable, but that's different than making a promise to everyone that it's 100% stable.


Internally, you’re now building a token tree which you’re not using, that’s unneeded memory!

FYI: I do agree with the thinking here - and that was originally my concern also, but it didn't seem to matter enough to outweigh what I see as a big benefit of having the tree preserved as a potential output. And if someone is using this to constantly to parse millions of lines of code and finding memory/CPU with the new implementation a real pain point, then it's only 50 LOC or so for them to entirely replace the emitter - and problem solved.

And I'd perhaps even be fine with that emitter living in the core library (then we could debate the merits of each and which was a better default) but then we'd also need a way to run all the tests against both emitters, and that wasn't a high priority of mine to figure out that part.

@wooorm
Copy link

wooorm commented May 3, 2020

If I've offended I have not meant to. I'm just discussing and explaining my thinking.

You do communicate a lot, which is mostly a great thing, but it’s many messages with many edits, and for me it’s hard to keep up! I feel like I’m out of depth (I don’t know, or necessarily want to know, much about highlightjs or how it is maintained); I don’t want my opinion to hold too much weight. My way of dealing with many messages is going in reverse and sending short responses back to keep the conversation manageable for me.


Do you publish on NPM? Can you not pin your library against a "known good" version of ours? I guess I fail to see how forking the whole library is better/easier than pinning to a specific version of HLJS that provides a specific "known good" API - whether private or not.

The fork is from 4+ years ago. We’re already pinning hljs to feature releases, because over the last 4 years it hasn’t adhered to semver, where breaking changes occurred in feature released. I’m personally strongly against lockfiles and completely locking versions.
Yes, lowlight works with the new emitter. I want to release that. Or with (a combination of) my suggestions. I don’t care. But I do want it to be public and good API. What public and good means is up to you.


How is that any different than using a semi-private API - other than the fact that the amount of code you have to maintain is now reduced by an order of magnitude or two? If we change a ton of stuff in general, you now get all those changes for free. If we change a ton of stuff specifically related to the integration API... you're "stuck" on the prior version until you manually update.

Other that initially building lowlight, I have probably spent on 20 hours over 4 years on the 18 releases and it’s been very stable. Pretty sure I’m about to hit those hours discussing this issue. I want a working version that follows a contract where hljs provides something that lowlight can safely work with: it could look like anything, such as how it currently is (but public) or maybe a bit more simple


I'm not sure the goal here was ever to make a 100% public API that we have to support to the same level of stability that we support our other public APIs. My goal was to simplify the parser, separate the concerns better, and while doing so make it so that integrators (like yourself) wouldn't necessarily have to maintain a whole custom copy of the library just to get between the parser <=> render process.

ANYTHING you expose, is public. This emitter is public. The .emitter.rootNode is public. hljs 10 broke lowlight, which was working on public APIs for years, and is now public! Public doesn’t mean it never changes.

With the rest of the paragraph: I understand! And I’m 👍 with whatever you decide to implement.


You are free to create a tree! That’s what was originally asked, and what you delivered. But the original question was a tree instead of HTML. We now have both. And there are 4 ways to integrate. For a tree, you could go with something that is well defined and highly used. Such as hast, but it was decided to go with a tree. That means that anyone now has to adapt the custom hljs tree to what they want. Be it HTML or hast or ANSI or PDF. And you’re doing both a tree and HTML!

I think a simpler solution would be: return whatever is returned by that __emitter. Make the emitter powerful enough for hljs HTML, lowlight hast, or ANSI, but support only 4 methods in the API. And use the emitter internally by building that HTML so it’s 100% tested.

What I suggest is less public api surface than currently in v10.

P.S. lowlight (and thus hljs) is used in many static site generators, such as Gatsby or Next.js, and it’s running on some blogs with thousands of articles: performance is really important, bazillions of lines are being highlighted

@joshgoebel
Copy link
Member

You do communicate a lot, which is mostly a great thing, but it’s many messages with many edits, and for me it’s hard to keep up!

Blessing and a curse. :) One of your messages just sounded a bit frustrated so I was trying to defuse that if possible.

I don’t want my opinion to hold too much weight.

No worries. I try not to let any single voice hold too much weight. :)

The fork is from 4+ years ago. We’re already pinning hljs to feature releases, because over the last 4 years it hasn’t adhered to semver, where breaking changes occurred in feature released.

If we do that now, let me know, but we're trying not to. Though technically it might depend on what you consider a "breaking change". When you fork the library and have to hand or hand merge things back in I'd imagine far too many things are "breaking changes". On the other hand technically no releases actually break you since you're a fork... :-)

I’m personally strongly against lockfiles and completely locking versions.

I can understand that I suppose. I was suggested it as a way to "hedge your bets".

Other that initially building lowlight, I have probably spent on 20 hours over 4 years on the 18 releases and it’s been very stable.

Interesting and impressive, I'd have guessed more.

Yes, lowlight works with the new emitter. I want to release that. Or with (a combination of) my suggestions. I don’t care. But I do want it to be public and good API. What public and good means is up to you.

Well if it's up to me then I declare it "good" already, and we're left discussion "public". :-) I am considering simpler method names for the builder though...

I want a working version that follows a contract where hljs provides something that lowlight can safely work with

That's the hope. :-)

ANYTHING you expose, is public. This emitter is public. The .emitter.rootNode is public. hljs 10 broke lowlight, which was working on public APIs for years, and is now public! Public doesn’t mean it never changes.

I don't mean "public" in the purely technical sense. Way too much in JS is public in that sense (although I'm aware there are ways to make things more private)... I'm instating referring to "stable" and "documented" vs "beta" and "undocumented"... sometimes I hate technical talk and if the confusion here is my fault, I'm sorry.

For example:

function highlight(languageName, code, ignore_illegals, continuation) {

Someone calling this from v9 to v10 should get almost exactly (or exactly) the same output keys (well we added a few) - other than grammar improvements, bug fixes, or actual regressions (which we'd fix), etc. That API is documented and very stable. I plan to deprecate it by v11, but it'll probably still continue to work past until v12... this despite the fact that probably over 50% of the internals have been rejiggered from v9 to v10.

That's the type of documented stability/public that I'm talking about when I say stable/public. A guarantee that sites can upgrade from v10 to v10.1 to v10.2 to v10.283 and everything should "just work" with 0 changes by end users.

If you merely want a technically "public" API - defined as "not hidden inside a closure" then you've already got that as of v10 and I have no plans to take it away. But I also don't consider it stable, yet. v10.3 might break it, if it was necessary or useful for us to do so.

You are free to create a tree! That’s what was originally asked, and what you delivered. But the original question was a tree instead of HTML. We now have both.

Yes, use whichever you prefer. I think for many use cases this does no harm. If you can show otherwise concretely I'm all ears.

And there are 4 ways to integrate.

LOL. I was speaking to you very technically as someone whose been forking the project for YEARS of what you could do INSTEAD of that. If someone with no context asked me the same question "how do I use the tree" I'd say the easy/stable way was "write a builder", not give them those 4 choices I gave you. :-)

For a tree, you could go with something that is well defined and highly used. Such as hast, but it was decided to go with a tree.

Yes, it was an intentional choice - I wanted the parser and emitter isolated from the presentation layer. Hast (at a quick glance) is awful close to HTML semantics, that seems to be it's key intention:

hast is a specification for representing HTML (and embedded SVG or MathML) as an abstract syntax tree.

It also looks a fair bit more complex than the simpler low-level solution I implemented.

That means that anyone now has to adapt the custom hljs tree to what they want. Be it HTML or hast or ANSI or PDF.

Well no, you get HTML for free... :-) but... Yes, again this is mostly intentional. I'm trying to build foundational layers, allow easy access for plugins and extendability - not solve (and then have to maintain!) everything directly. Previously getting any type of tree from the parser was nigh on impossible because of the tight coupling - you decided the easiest way to do it was to fork the whole project and maintain that fork. (I wonder if just parsing the HTML back into a tree would have been simpler.)

  • Now you can implement Hast with ~50 LOC (as an emitter it seems). (that's pretty much what lowlight is doing, right?)
  • The HTML renderer is ~50 LOC (and could possibly be simplified).
  • My whole tree class + default emitter is ~80 LOC.
  • Building an HTML emitter and skipping the intermediate tree would likely be ~50 LOC.
  • Other simple tree-like emitters (JSON, XML, YAML) would probably have similar requirements.

And you’re doing both a tree and HTML!

Again, this could change with time - IF it proves to be an issue in reality.


I think a simpler solution would be: return whatever is returned by that __emitter. Make the emitter powerful enough for hljs HTML, lowlight hast, or ANSI, but support only 4 methods in the API. And use the emitter internally by building that HTML so it’s 100% tested.

I agree completely - you are suggesting a simpler solution, I just think that is at odds with the fact this is a low-level API. I designed it purposely for integration with the parser. That's it's primary mission - help simplify and make the parser happy. If that means you have to write an addKeyword method that is just a more elegant abstraction around calling 3 other methods, I think that's a reasonable trade-off...

And if we provided a subclass to extend from then (and assuming addKeyword is really always open/text/close, an assumption I question) you'd have exactly the 4 method API you want.
You'd just extend the subclass and then not worry about those 3 extra methods if you didn't need them. Some types of output DO require them. (like HTML, which needs to be able to close all it's tags)

What I suggest is less public api surface than currently in v10.

Again, try writing a builder (3 methods only) and then tell me your thoughts. I'm not saying that's the best approach for Lowlight, but it is a simple API and for many people it'd work just fine. Many use the library to highlight a small amount of code on a website page/article. If you want the maximum speed and ability to "bypass" default HLJS behavior, then you are naturally working at a lower-level and I believe for a low-level API Emitter is already pretty darn simple/elegant. (for your use case: 5 required methods, and 3 NOOPs).

P.S. lowlight (and thus hljs) is used in many static site generators, such as Gatsby or Next.js, and it’s running on some blogs with thousands of articles: performance is really important, bazillions of lines are being highlighted

If you had a benchmark suite or server setup to measure this stuff more precisely that'd be nice. Or even if you had a large project that you could build with say the built-in rendered plus a custom text only emitter and show numbers from both, that could prove useful to know. Right now we just use the test suite itself as a rough benchmark - since I run it all the time and would get annoyed if it were any slower.

@joshgoebel
Copy link
Member

Again: (pretty sure I said it earlier) If the tree based emitter is shown to be significantly worse/slower (a significant regression compared to a text only emitter) I think we'd definitely accept a PR to make a text only emitter the default option - if it solved the problem of how to run the test suite against both emitters.

But so far I'm unconvinced the low-level API needs any real changes - other than perhaps a small subclass to prevent writing the NOOPs by hand. (I will think about the naming... text vs addText)... that's something I could possibly imagine changing - though it's also possible we just have different ideas on what good naming is. :)

@joshgoebel
Copy link
Member

joshgoebel commented May 3, 2020

And give me a few days to think/dream on things too... sometimes I do come around a few days later... :-)

Though you do have me debating the necessity of toHTML() vs a more generic output() or output getter... the assumption then being the the emitter output() returns whatever eventually ends up in the result object as the value attribute. I think that is one of the things you mentioned above. :-)

That might remove one method from the interface IF you were happy with exposing your raw data object as an attribute on your emitter vs a method or getter.

IE:

+++ b/src/highlight.js
-      result = emitter.toHTML();
+      result = emitter.output;
+++ b/src/lib/token_tree.js
-  toHTML() {
+  get output() {

I'll give it more thought. Again that whole method being there bothers me since it tightly couples the emitter to the renderer, which isn't good IMHO. This would be far better:

result = new options.__renderer({tree: emitter.output }).render()

The tree emitter knows nothing about rendering and it's output is merely a tree which HTMLRenderer knows how to render.

But then that means you need both a __renderer key in additional to an __emitter key...

@joshgoebel
Copy link
Member

I think at some point you said you expected a different API for trees vs HTML, and I don't completely disagree but I think that's something we'd solve in the future with a better API overall for highlight (if we decide to make emitters a more public facing feature):

var result = highlight(code, {language: "pascal", emitter: Lowlight.Emitter })
// result.value is HAST tree generated by lowlight

@joshgoebel
Copy link
Member

joshgoebel commented May 3, 2020

Three example cases for addKeyword (other than it being required by the parser):

  1. A binary output format of some type... a keyword might merely be say 0x00, 'K',"function",0x00... no open tag, no close tag.
  2. A text stream, not sure WHY LOL, but open/close wouldn't be needed at all here. Technically it might not hurt to call the NOOPs, but that's just busy work.
  3. An emitter that was capturing the list of emitted commands for some other purpose (perhaps to replay them later, rewind them, re-order them, etc)... "add a keyword" is a very different concept than "new node", "add text", "close node".... and hence the emitter can preserve that fidelity if it chooses to.

I think your making some assumptions about the output stream (that it's HTML or an AST of some kind) and what's important to preserve that I simply am not making, and hence the addKeyword being a required method - because a keyword is not the same thing as a "named node with text".

Actually at a later point if all you had were nodes there would be no trivial way to distinguish a keyword from a named node...

@vladshcherbin
Copy link

It would be nice to have a function which returns just the tree which highlight function uses inside, this separates tree creation and html generation for cases when it's not needed. Something like:

// returns just the tree
function buildTree(language, code) {}

// uses this function inside and producess html
function highlight(language, code) {
  const tree = buildTree(language, code)

  return { value: produceHtml(tree) }
}

I slightly read the above conversation and I believe this was one of wishes. I need same thing, just the tree without html stuff, a public function for doing just that.

@joshgoebel
Copy link
Member

I need same thing, just the tree without html stuff, a public function for doing just that.

The current behavior is intended more for advanced integrators - so I'd love to hear more about your actual use case - how and why you're using Highlight.js, etc... Often more is learned by hearing the exact problem itself than the desired solution - because often there are much better solutions to be found.

I'd also love to see some evidence that the HTML generation itself is actual a measurable computational expense. I'd wager the parsing/tree building is where MOST of the expense is.

If you ONLY need the tree always you could have it with a one line patch to TokenTreeEmitter's prototype - simply stub/mock out toHTML to do no work. Or better yet drop your own toWhatever function that does whatever you'd like - preferable for maximum stability it'd use walk - not access the tree data directly. Ie, write a Renderer (20-50 LOC for simple cases).

One other choice (if we prove that there is a real cost here) would just be to make value a lazy calculation instead of doing it in advance. If you don't want the HTML, don't access value and you won't get it.

Simply providing your own emitter is another choice for only 30-60 LOC for simple cases.

@joshgoebel
Copy link
Member

Until I talk to more people who are using this or trying to I'm hesitant to lock down an official public API and what the "tree" is exactly...

@joshgoebel
Copy link
Member

joshgoebel commented May 19, 2020

I think it's been written before but if not my driving motivation here was to clean up the Highlight.js codebase and make it more modular - separating the output concern from the parsing concern. Previously to do anything like this you needed to fork the whole project and then maintain that fork across time, etc. What a pain. Now you can build a custom emitter with only 50-60 LOC or less - using the standard distribution via NPM, because it exposes __emitter as a beta/undocumented API.

The end goal was never necessarily to build a highlightAndReturnTree function. (that may have been the request, but it was never the goal). If someone wanted to publish a 3rd party highlightTree() plugin on their own, they could trivially do so (compared to before). @wooorm's project does that - and quite a bit more I think.

Related, my thread on building a plug-in culture rather than doing everything in the core library:
#2225

The fact that we ship an internal parse tree was perhaps [or not] my getting a bit carried away. Though I did want to make sure the API I built could easily support a tree based system. In hindsight there is some question of whether we should just ship a simpler string emitter... And we may still end up doing that again in the future.

The major winning point is that now advanced integrations are FAR easier - and the code is more modular. And that will remain no matter how the data is represented internally.

@vladshcherbin
Copy link

The current behavior is intended more for advanced integrators - so I'd love to hear more about your actual use case - how and why you're using Highlight.js, etc...

My use case is actually simple, I have a sql query and want to highlight it to show in console with colors. For console I don't need any html generation, just parsed tree to iterate and add colors.

@joshgoebel
Copy link
Member

joshgoebel commented May 19, 2020

Sounds like a perfect opportunity to me - 3rd party ANSI highlighter plugin. :-) That might actually be super helpful for a CLI use of HLJS.

@wooorm
Copy link

wooorm commented May 19, 2020

@vladshcherbin if you're in need of something right now, than maybe https://github.com/wooorm/emphasize can help

@joshgoebel
Copy link
Member

@wooorm Figures that you already went and did it. 🥇

@vladshcherbin
Copy link

@wooorm yeah, I found it from your discussion above and already tried (it works). Thought that since there's tree generation in highlight itself now, you can probably try to drop lowlight dependency. However, I don't know the details so it's just a thought 🤔

Thank you

@joshgoebel
Copy link
Member

joshgoebel commented May 19, 2020

Lowlight wants a very specific kind of tree. :-) But that can be good thing for those who want to integrate with it. I think once Lowlight switches to the new API (hopefully?) it might be the perfect place to point anyone who is asking "just give me the tree" as you were originally.

@wooorm Did you not create a separate issue for getting a new instance of HLJS, I thought you had - but I can't find it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement or new feature parser
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants