Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tree-sitter based highlighter #5099

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

pjungkamp
Copy link
Contributor

Description

This introduces a tree-sitter highlighter that maps tree-sitter captures to kakoune faces.
This allows kakoune to highlight some recursive grammars that the regions based highlighting is not able to parse (e.g. Nix, Shell, Python).

  • The default mapping from query captures to kakounes faces is based on nvim-treesitter.
  • A subset of language injections are supported, these can be added as tree-sitter-injection sub-highlighters.

Example - Nix

This is an example from a Nix codebase i recently visited.

The problem here is that a Nix string can contain interpolated nix expression, which in turn can have nested strings with the same delimiter (e.g "outer string ${ let var = "inner string"; in var }").

  • Regions based highlighting:
    Screenshot from 2024-02-06 14-08-50

  • tree-sitter based highlighting:
    Screenshot from 2024-02-06 14-08-16

Building

This highlighter is optional and can be excluded by passing tree_sitter=no to make.

Related Issues

TODO

  • proper tree edits with cached trees (currently reparses the whole file instead of adjusting trees incrementally)
  • lazily compile grammars (if precompiled grammars are part of the the configuration tree a config couldn't be shared across architectures)
  • decide on a file path for grammars and queries (currently %val{runtime}/grammars)
  • support more injections (especially combined injections this would allow us to e.g. highlight the embedded Bash in the example above)

@tototest99
Copy link

Hello,
are you aware of kak-tree-sitter and merge initiative with redondant LSP features ?

{ create_tree_sitter_highlighter, &tree_sitter_desc } });
registry.insert({
"tree-sitter-injection",
{ create_tree_sitter_injection_highlighter, &tree_sitter_injection_desc } });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I don't care about about highlighting but I'm interested in commands to select syntax tree nodes etc.
LSP provides some of that but it's not optimized for it.

In general, integrations with external tools live in scripts in rc/.
I wonder if that could work for this feature too?
We can add any missing generic highlighter types like the InjectionHighlighterApplier to support these cases.

I think there is great value in having an obvious boundary between C++ core and scripts. It keeps us honest.
With the shared library approach, tree sitter can do something that other integrations cannot.

I wonder what's the difference to https://github.com/phaazon/kak-tree-sitter ?
As a user, I think it would be great if we concentrate most effort on one approach.

In either case, I think tree sitter integration is highly valuable and I'd probably follow whichever approach gains traction.

Thanks

@hadronized
Copy link
Contributor

Hello,

I think it’s an interesting approach, but in the same time, it bothers me a bit. Not because I’m the author of kak-tree-sitter and I spent a lot of time working on it. But because of the fact that I love the philosophy behind Kakoune, and I think you’re bending it. The reason for that is that Kakoune ships with zero dependencies. What makes it use external tools is basically just calling-out to external programs via %sh{} or kak -p. You would argue that you have :git commands, but look closer: those are just Kakoune commands wrapping around the git, awk, perl etc. commands.

However, I do think that what I made with KTS is super complex and that Kakoune needs more toolings to make it easier to integrate (and I think @krobelus would also benefit from that for kak-lsp). For instance, in KTS, I have to start a daemonized server to handle a parsed representation of buffers for a set of sessions. That means that Kakoune must stream its buffers’ content to KTS (which can be expensive). I optimized that by writing directly to FIFO opened by KTS (i.e. that means that there is no shell creation to do so, it’s just a basic, low-level write to the FIFO). Even with that, I think we could do better, because you need to do that for all « integrations » (KTS, kak-lsp, kak-whatnot, etc.).

@mawww
Copy link
Owner

mawww commented Feb 10, 2024

This is indeed an impressive PR, but I do not intend to merge it for a few reason.

First I do not want to have multiple, competing, built-in highlighting to maintain, and I do not want to solely rely on tree-sitter for highlighting. Having both will likely lead to one bitrotting with time.

Second (and most importantly), as noted by @phaazon, this goes against Kakoune's design principles. Introducing a dependency directly in core for a functionality that could be implemented externally. I do agree that there are some limitations at the moment, I have not looked at kak-tree-sitter but I suspect it is a far more complex codebase than what you did in that PR, I hope we can find a way to simplify how external plugins that rely on buffer content work.

@pjungkamp
Copy link
Contributor Author

Motivation

I just spend some time with typst which is a language that suffers extraordinarily bad from the limitations of the regions highlighter. I tried to write kakoune highlighters that worked both for the markdown and code regions in typst but I ran into so many unfixable highlighting errors that I looked into other types of highlighting.

kak-tree-sitter

The most promising approach that I saw for accurate highlighting is probably tree-sitter. I did check out kak-tree-sitter. I'm using it and love it! But it feels very much less responsive than what I'd expect from kakoune, especially when using a power-saving governor and platform profile on my laptop...

The inherent problem there is that it can't use the ts_tree_edit API efficiently to make reparsing of large buffers cheap instead reparsing the entire buffer sent over a pipe and it has to use a kak -p process to actually report the highlighted ranges back to kakoune.

I do have some ideas for the second problem, e.g. a kak -P <session> flag that in contrast to kak -p <session> keeps running and allows multiple commands to be passed to kakoune.

The first problem though lead me to write the code in this PR. I checked out kak-tree-sitter's code and the tree-sitter project itself. This draft is just a POC of integrating tree-sitter with kakoune's internal structures. My main goal is to see where I'd have to introduce IPC interfaces to expose the necessary information to drive a more efficient tree-sitter plugin without introducing the tree-sitter dependency.

I wouldn't want to have this tree-sitter highlighter PR merged into master either. It does not fit kakounes goals. You can't even compile statically because of the dynamic loading of parsers.

Some thoughts

My goal is to supply the ts_tree_edit API with the changes in a buffer. The code in this PR can't do that, it reparses the entire buffer.

The tree-sitter library takes all positions on edits as both point (row & column) based and byte offset based coordinates. The kakoune struct Buffer is optimized for line-based editing and does only provide point based views of changes. See Buffer::changes_since. I don't quite see a way to add efficient annotations of byte offsets to struct Buffer.


TLDR: I opened this PR because I was tinkering around and wanted to see who's interested and active on this topic.

@hadronized
Copy link
Contributor

About the partial updates / edit in place, I have this still open about that topic. It’s not something I have started working on because I want to stabilize the performance and features already (and I think it’s more important to have semantic text-objects first before going full optimizations), but clearly yes, it can have a negative impact on “how fast you see highlighting”. Also, the speed at which kak -p applies and blocks the editor is probably something that could be worked on completely independently of KTS or anything else.

@krobelus
Copy link
Contributor

The inherent problem there is that it can't use the ts_tree_edit API efficiently to make reparsing of large buffers cheap instead reparsing the entire buffer sent over a pipe

if the slow part is parsing you can probably work around it by computing a diff so you can use the incremental API.
It would probably be more elegant if Kakoune provided the changes in some diff format in a DidChangeIdle hook.. something like %val{history} but since the previous timestamp. But it's hard to tell if that actually makes a difference. Probably a test case would help.

it has to use a kak -p process to actually report the highlighted ranges back to kakoune.

not necessarily; you can write to Kakoune's socket directly, see https://github.com/tomKPZ/pykak

The tree-sitter library takes all positions on edits as both point (row & column) based and byte offset based coordinates. The kakoune struct Buffer is optimized for line-based editing and does only provide point based views of changes. See Buffer::changes_since. I don't quite see a way to add efficient annotations of byte offsets to struct Buffer.

That's an interesting problem indeed.
I think until Kakoune provides buffer diffs, addressing this won't make much of a difference.

@Song-Tianxiang
Copy link

can we have something like vim's text-properties

@DarkArc
Copy link

DarkArc commented Jan 24, 2025

So, I wanted to share some thoughts as a relatively new kakoune user that finds the kakoune design very interesting.

I think it’s an interesting approach, but in the same time, it bothers me a bit. Not because I’m the author of kak-tree-sitter and I spent a lot of time working on it. But because of the fact that I love the philosophy behind Kakoune, and I think you’re bending it. The reason for that is that Kakoune ships with zero dependencies.

I'm not sure this is entirely fair. tree-sitter as it's used here is not an external binary or tool. It is a third-party dependency, it is beyond the standard C & C++ libraries, as well as the POSIX APIs ... but it's a third-party dependency that's very well regarded for solving a very hard problem that kakoune already tries to solve as part of its core offering.

This seems like less of breaking the kakoune ideology (as I understand it) and more outsourcing a hard problem to the broader community (which arguably makes kakoune easier to maintain in the long term). The tree-sitter dependency isn't particularly onerous either as it itself is dependent only on the standard C library. So, it's not a language or platform level standard library/API but it's increasingly the standard library for this domain.

This also seems like an area where pragmatism may be necessary. There are great command line tools for finding files, listing files, source control, etc. However, that sort of tool just does not exist at an edit level.

Also as someone that works in a code with large C++ source files, I would be concerned about the performance implications of marshaling syntax highlighting annotations in a way that doesn't have a lot of overhead or have the potential for a litany of escape-sequence related bugs with arbitrary input.

A daemon is potentially viable (and that seems to be the community's taken approach), however a tree-sitter daemon is (while impressive!) particularly exotic (AFAIK). This project is also written in Rust, which is a significant increase in the complexity (in terms of the size of the dependency chain) required to get this functionality vs directly integrating tree-sitter.

I've tried to look more into exactly how the kak-tree-sitter project works while writing this up, but unfortunately the project's website seems to be having issues that make browsing the source very difficult (so I've worked off the old GitHub sources).

First I do not want to have multiple, competing, built-in highlighting to maintain, and I do not want to solely rely on tree-sitter for highlighting. Having both will likely lead to one bitrotting with time.

I found this old comment when investigating the state of tree-sitter as it relates to kakoune: #50 (comment)

It really seems like tree-sitter may be the answer to that research problem; I think it would be unfortunate to not integrate tree-sitter since the CST parser and parse rules are handled by a third party and the existing functionality provided via regex highlighters is upgraded/outclassed.

My thought on this would be (even if not via this specific PR's implementation):

  • take tree-sitter as an optional (but recommended) dependency as the new standard syntax highlighting implementation
  • drop the builtin regex highlighters registration rules from the mainline
    • make them a "community maintained best effort" for folks that can't use tree-sitter (... which really should be no-one given the low technical and legal requirements of tree-sitter)
    • folks that need these can just copy and paste the rules into their personal configuration files with no effort as the highlighting engine will still be there (those regex highlighters are still super useful for personal highlighting rules -- so I wouldn't expect the "core" code to really have much if any bitrot here)

That to me, seems like it would provide a nice compromise between ideology and pragmatism. Kakoune would still say far far away from the "kitchen sink" that's emacs, but would pick up an easy to use, mature, and highlighly competent syntax highlighting engine.

Thanks for reading; I hope these are some helpful thoughts.

@arrufat
Copy link
Contributor

arrufat commented Jan 24, 2025

My two unsolicited two cents:

I honestly, don't see the point in tree-sitter as more and more LSP servers start adding semantic highlighting.
Whenever I try editors that use tree-sitter, they always get the syntax highlighting wrong in common places, albeit less often than regexes.

A classic example that I think it's unsolvable is, for example, in Zig you can do:

const debug = @import("std").debug;
const print = @import("std").debug.print;

Where debug is a struct/namespace and print is a function. Only kakoune-lsp with zls get this right, there's no way a regex can know that, and AFAIK, the same goes for tree-sitter. Maybe the main advantage is that we could all reuse the same tree-sitter definitions instead of each editor defining their own?

However, as more and more LSPs are adding semantic highlighting, what's the point of getting wrong highlighting, no matter how fast you go? I am surprised this is not brought up more often, but maybe I am missing something here.

@DarkArc
Copy link

DarkArc commented Jan 24, 2025

@arrufat

Thanks for the read! So just to respond to this... I think language servers are really interesting, but they're ultimately different. Using a language server is the ideal because it understands the semantics of the language and can do "the best" that can be done.

However, language servers have a lot of overhead associated with them (both technical and human) and there's wide variance in the quality of various language server projects. They can be hard to correctly configure in projects with abnormal build graphs.

Personally, my attempts to use clangd (attempted in emacs, Zed, and Helix) on the code base I work on professionally has been very frustrating. When I worked on clang itself I did get it working but the clangd overhead was just absurd (I forget specifics as it's been a few years) since clang is such a large code base.

The CSTs used by TreeSitter are pretty much as close as you can get without having a true compiler running for semantic-aware highlighting (and as someone that works on compilers professionally ... those can be expensive). So, I think from a sort of "technology" stand point tree-sitter is "the" solution for something that can actually be reasonable integrated into the binary.

It's a good baseline balance of crowdsourced language parsing, performance, competency, and relatively low dependency overhead. It's a big step up from just using regex.

@arrufat
Copy link
Contributor

arrufat commented Jan 24, 2025

@DarkArc Yes, I entirely agree with you on clangd, whenever I work on large C++ projects, all my CPUs go to 100% when I modify heavily templated code. ZLS is really fast and lightweight, though. Also, go-pls and js/ts language servers work fine with semantic highlighting with no noticeable overhead. It's a shame Python doesn't have an open-source LSP with semantic highlighting, though.
In those cases, tree-sitter might be the better approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regions are matched greedily, not recursively
8 participants