Tree Sitter is amazing. The parsing is fast enough to run on every keystroke. The parse tree is extremely concise and readable. It resembles an AST more than a parse tree (ie no 11 levels of binary op precedence rules in the tree). The parse tree emits specific ERROR nodes, so you can get a semi-functional tree even with broken syntax.
I can't wait for the tools to get built with this. Paredit for TypeScript. Syntax-tree based highlighting (vs regex highlighting). A command to "add an arg to current function" which works across languages. A command to add a CSS class to the nearest JSX node, or to walk up the tree at the className="| ..." position, adding a new className if it doesn't exist.
There's a nicely documented Emacs package for this [1]. The documentation is at [2]. The parse trees work great. There's syntax highlighting support and tree-walking APIs. There's a bit of confusion about TSX vs typescript langs but it's fixable with some config change [3].
Worth calling out that the syntax highlighting support is used to highlight several languages in github.com. (Linguist is still used for the long tail of languages, but we plan to migrate more and more over to tree-sitter-based highlighting over time.)
The query language is also what's used to drive the fuzzy/ctags-like Code Navigation feature. Both of those are powered by tree-sitter query files defined in each language's repo, like these for Go: https://github.com/tree-sitter/tree-sitter-go/tree/master/qu...
Awesome to hear that amazing tech like tree-sitter lives on even though Atom, the product it was built for, is pretty much on life support at this point.
Curious if there's any efforts to bring tree-sitter to VSCode? Exposing tree-sitter to extensions could open up so many possibilities like OP mentioned.
Can someone point some examples of what `paredit` for other languages provide? I do various lisp programming occasionally but have not used `paredit` yet.
Looks like it's mainly tree/code manipulation. Typing code on the keyboard is probably the least taxing thing when it comes to software development. But I guess it will be nice once it has become a "reflex" rather then a conscious key-combo.
It's not so much about reducing the amount of characters typed, and instead moving the way you think about code from the character level to a more structural level.
Calling it a "reflex" is an interesting phrase! Tools like magit let me encode complicated processes into muscle memory, in a way where retrieval doesn't have to go through remembering and typing a string. Structural editing is similar.
It's about typing code, as opposed to typing text, with all the structural, highlighting, auto-formatting, auto-completion, error-detection, etc advantages this brings.
I only started using it a few months ago. It's such a natural way to edit code, it only took me about a day for it to become reflexive.
Now it just feels vaguely annoying to work without it. It's fine, it's just one of those ergonomic changes that nags at you a bit. Kind of like the opposite of that feeling of taking off uncomfortable business clothes at the end of the day. Or what I imagine people who are better at vim than me keep talking about.
It's not just saving keystrokes. It eliminates a whole class of errors. I recently did ~4-500 lines of Clojure in CodeMirror and wanted to kill myself by the end of it.
- indentation may be fine for a final doc, but not always while editing. Especially for new lines starting new code-blocks.
- adding new syntax not already known by tree-sitter requires up streaming to at least 2 repos before we can use it in a released version of our package. This can feel less hands on and slow than working in a single repo where you have full control.
Neovim nightly already has some tools available as plugins. I'm using tree-sitter for syntax highlighting, text objects, and folding right now. Pretty satisfied so far.
The official release of built-in treesitter comes with neovim 0.5. Which looks like it'll be out pretty soon. I've been watching a fairly steady march toward release here: https://github.com/neovim/neovim/milestone/19
I'm so excited for this to become built-in in more places! I think once non-lisp users can experience the Power of Structural Editing they'll say, "Hey, I understand now why you all feel so passionate about your parentheses!"
And I can stop feeling like my fingers have all lost a knuckle when I'm writing Typescript :)
I know plenty of "old people" who don't talk like this. In fact, the "old people" that I respect tend to be a lot more open-minded.
It's one thing to joke about it a little, but this is just arrogance on display with obvious derision for us children who find that traditional syntax highlighting is beneficial.
I recall reading once that Vim tabs were a crutch for people who "can't remember what they're working on". It's the same kind of arrogance and presumptiveness.
Don't have a list, but Dark is doing some really cool things with an editor that depends pretty much exclusively on structural editing (i.e. you can't even make a syntax error if you tried): https://darklang.com/
I'm an engineer on the code intelligence team at Sourcegraph.
We've been busy building out true precise code intelligence/navigation support, but we also have a mode for zero-configuration code navigation based on text search, universal-ctags, and hand-rolled regular expressions (which works surprisingly well!). Tree-sitter would definitely give better results than our current ctags-based approach. It's been catching our attention more and more lately, and we have plans to use it to upgrade our out-of-the-box, instant code navigation experience.
It's not the exact right fit for our primary goals though, since it's designed around being extremely fast while editing and robust against errors. Sourcegraph is only used for navigating committed code, so we're leveraging formats like LSIF to generate complete semantic graphs of codebases and their entire dependency tree. That'll enable a lot of features that are out of reach for tree-sitter, but is a lot harder to get working out of the box and it's a much bigger technical investment.
It's very interesting to see the topological space that houses these solutions fill out. Every tool has its own set of unique trade-offs and fall somewhere on these spectrums:
- fast vs slow
- precise vs imprecise
- zero-configuration vs configuration required
We've visited a few islands in this space but still very curious to see what other islands can be discovered. We're especially excited about tools and formats like tree-sitter and LSIF around which a large and supportive community can grow so that all the products we love and rely on as developers can all make forward progress.
What are those features out of reach of tree-sitter? I can see that you theoretically want something that's optimized for parsing well-formed code all at once, rather than potentially malformed code incrementally, but what trade-offs does tree-sitter make in practice that limit its potential for your use case? On the face of it, it seems to me like tree-sitter could server as a perfectly fine building block for generating LSIF or whatever from a code file.
I wish there was a more universal format for parsers, but I just don't think there enough people who know their stuff.
Take PHP, a language that a lot of people use: the tree-sitter-php extension doesn't support features added in 2019, let alone features added towards the end of 2020.
If you want an up-to-date PHP parser, there's really only one open-source parser[0] that's accurate enough to be used on PHP codebases old and new, and it's written in PHP. Then if you want to parse in a robust fashion you have to adopt a number of hacks to get everything working.
I hadn't encountered LSIF before – can GitHub be configured to use those maps?
We've looked at LSIF before, and decided against it for a few reasons, mostly around COGS, operational overhead, and indexing latency. I gave a talk at last year's FOSDEM [1] going into some of the details. (Caveat that that talk was from when we were using a different open-source library, Semantic, to power fuzzy Code Nav. It's much easier to support new languages using the now-current tree-sitter query approach!)
I tried to use this to ease the front end work load of students in a compiler project (building a C compiler) for a University course, so that the project could be focused on the more interesting middle and back end parts of the compiler.
However, reported bugs in the C grammar that saw no activity at all [1] made this impossible. From this small sample of experiences, I was left with the impression that Tree Sitter is great for things like syntax highlighting, where wrong results are annoying but not dramatic, but not so suitable for tools that need a really correct syntax tree.
Hi there! You're right that the C grammar in particular is one that could use some love. C is not one of the languages that we're syntax highlighting with tree-sitter yet, nor is it one of the languages that we support Code Navigation for. That means that my team has had to prioritize their work in other places, and no community members have stepped up to take over or help out with maintenance of the C grammar. Not a satisfying answer, I realize, but an honest one.
There's been some recent discussion as to whether tree-sitter grammars can be used to parse markdown with some hacks or not (currently it's being done by working around all the tree-sitter machinery, resulting in a lot of problems), with no consensus among plugin authors:
I’ve been using tree-sitter via FFI from Common Lisp, but what I’d really like would be a way to write my own code generator so that the generated parser could be “native” lisp code. Otherwise, it’s an amazing tool: my only other complaint would be the lack of a grammar for objective-c which would be useful for a lisp/objective-c bridge I’ve been working on.
I think that it'd be pretty easy to generate parser code in other languages besides C, but it would be a lot of work to do to port the core library itself[1] to those other languages.
There's an architecture for compilers that I've been wanting for years where a keystroke change to the sourcecode results in an incremental change to the AST, and then the compiler can consume that AST delta to generate a binary patch to the compiled executable.
Would tree-sitter be able to be used for that? (What I want is to feed tree-sitter a stream of keystroke changes and get out a stream of minimal AST changes as a result).
You don't get the AST _diff_ as the result (you get a new tree whose structure is shared with the old tree), but tree-sitter is specifically designed to support this kind of incremental edit use case: https://tree-sitter.github.io/tree-sitter/using-parsers#edit...
I've done two grammars for my own use in the last few months (well, one isn't quite complete yet) and it's been quite an enjoyable (learning) experience. Thanks for sharing this tool!
When I played around with tree sitter a bit I noticed there were situations where ast elements didn't exactly contain what I'd expect them to. For example: comments are represented in the AST but unfortunately they don't have the contents of the comment parsed out following the laguanges conventions.
I was wondering if this is a case I could open an issue about? Is this for the main tree sitter repo or should I open one language-by-language?
I was looking into automating some stuff across all languages with tree-sitter but handling all of the languages comments syntaxes made it very hard.
Most tree-sitter grammars just parse comments as a single token. Can you give an example of what you mean when you say "contents of the comment parsed out"?
Are you talking about conventions like JSDoc, for putting structured data inside of comments? On GitHub, we handle that by parsing JSDoc comments in a separate pass, using a separate parser. We do it this way because JSDoc isn't really part of the JavaScript language, not all projects use JSDoc, and not all applications are interested in parsing the text inside of comments.
I don't think you can do this without recompiling, since the grammars get translated into C code before use. But the built-in command line tools (‘tree-sitter parse’, etc) all support a mode where they will detect local changes to a checked-out grammar definition, and recompile on the fly if needed. (This happens each time the CLI program is started up; it doesn't happen during a long-running process.)
The obvious answer is to embed TCC or another C compiler and either generate a dynamic library or generate wasm and load it directly into the process.
exec_wasm(generate_wasm(generate_c(grammar)))
Now if you can make that whole fn chain incremental, then a delta_grammar -> delta_c -> delta_wasm -> delta_recomputed_wasm_call stack, this will propagate deltas down to exec_wasm and you could dynamically execute the generated code as the grammar changes.
One day, I would love to generalize the web-based playground so that you could edit the grammars. But it's complicated, because we use C as our output language, so you would always need to recompile the C after changing the grammar.
So, I would say that it's not on our near-term roadmap.
I'm curious if tree-sitter can handle c++/c. I think it's supper difficult with meta programming. Without the preprocessor, I think it is not possible to parse c++ correctly.
We do have C and C++ grammars [1,2] but they need some love. You're right that these two languages are among the hardest to support. You could get a tree-sitter external scanner to mimic the preprocessor without too much difficulty, but you'd still run into the problem that your macro definitions might appear in another file. Parsing in general is much easier to implement and reason about if the parse result depends only on the content of the single file that you're looking at.
Thanks for building this. I had not heard of it before, but it looks great Are there more tutorials elsewhere on the Internet you would recommned, besides what is in the documentation?
In the near future, we'll create some more GitHub-specific documentation that walks you through how to add advanced language support for any programming language on GitHub, by writing a Tree-sitter grammar, and then by writing the tree queries that are used for syntax highlighting, simple code navigation, and someday soon... precise code navigation.
To me, the most impressive use of tree-sitter was an iOS text editor that uses it to parse huge JSON files / mixed language files and highlight them in a very robust way. [0][1] I’m hoping tree-sitter becomes more common like LSP and Emacs can get exact highlighting and other tools with it…
I find it absolutely amazing that a grammar for something as complicated as Ruby can be so concise. Less than a thousand lines. The corresponding Bison grammar is 13k lines. And I think the tree-sitter one is scannerless so also includes the lexer?! How do they do it?
Not a ruby developer here: that sounds terrifying! Does it make it harder to have a proper mental model of the language (note: not the libraries) or is this mainly because of flexibility (too many ways to skin one cat)?
I don't write Ruby regularly either, but I wouldn't say that syntactic complexity, is necessarily equivalent to semantic complexity. And the syntax is the only part that's relevant to Tree-sitter: it's not an interpreter/compiler.
Note also that (as I alluded to above) the parsing technique that Tree-sitter uses, "LR parsing", makes some things more difficult to parse than they'd be with another kind of parser. This is a deliberate trade-off, because LR parsing makes certain features of Tree-sitter, like fast re-parsing in response to input changes, much much easier.
So, a syntactic tree is a list of elements, grouped by their ordering, which are to be parsed from their arguments, as they appeared in the input. Or a grammar tree, which is a set of elements.
There's many things we can do to make Tree-sitter simpler to read and write. Perhaps, like in Perl, there are syntactic categories of types that make it much easier to find things like nodes in a tree, since they're the ones that come in the input. Or I'd be willing to say that maybe, like in Haskell, certain aspects of the language, are syntactic categories, like the parser. So some things that might not be obvious in code, like what the syntax for a class of names is, might be obvious in theory, too. Or, at least they might be obvious in a particular way. Or some aspects of the compiler are really special, and we can infer those in terms of what the compiler does.
Or, of course, we can do all these other things, too. We can rewrite the parser, or the compiler, to try to do more or less anything that the parser does. Or maybe we can make Tree-sitter a lot simpler in general. Which I think is probably what you've been thinking about.
Flexibility. “Too many” is debatable: most organizations wind up settling on a subset of the idioms that Ruby provides, and some of the more esoteric constructs see infrequent use anywhere.
There has been, however, discussion about the need to clean up some of the lesser-used language feature, but obviously doing so carries risks.
> Not a ruby developer here: that sounds terrifying! Does it make it harder to have a proper mental model of the language
It is a little terrifying in the sense that I'd not want to write language level tools (eg: syntax highlighter).
But if you have scheme on one end and natural language on the other, ruby leans à bit towards natural language - but in a good way. In some ways ruby isn't that different from Smalltalk - but it has a lot (sometimes I think too many, sometimes not) conveniences.
Parantheses and brackets are largely optional "where it makes sense". Conditionals support postfix, eg these are equivalent:
if should_send?()
send_mail({to: 'u@x.com'})
end
send_mail to: 'u@x.com' if should_send?
My mental model of Ruby is one the simplest of any of the languages I've worked with, but it's also the hardest to put into any words. JS actually does beat it out, and then Scala and Python come after.
Everything is kind-of-but-not-really an object, a reference, and a function, all at the same time - which sounds complicated but in my head... turns out to be pretty simple. Everything's just kind of different flavors of the same thing. `attr_accessor` is a good place to see this in action.
The flexibility comes more from the variety of available core language options (procs, blocks, and lambdas) and core libraries (map/each/collect, for example), not from a variety of underlying concepts.
It's mostly to work less surprising to the programmer, AFAIR. Probably the most complexity is from having to differentiate local variables and methods depending if the symbol had an assignment before in the scope.
This is more a function of Ruby than of tree-sitter. The tree-sitter grammars for other languages are hopefully less inscrutable. For Ruby, we basically just ported whitequark's parser [1] over to tree-sitter's grammar DSL and scanner API.
I didn't mean the tree-sitter grammar was not understandable - it's very understandable - I just can't work out how to managed to find such a concise way to express grammars. Even compared to Whitequark it's 1/3 the size. What's the unique thing you do that makes it so concise?
It also seems somehow to be completely declarative? How have you managed to transform Ruby parsing to be context-free? For example where's the set of what's currently a local variable so you can distinguish from method calls?
But for example how do you parse the difference between `x = 14; x` and `y = 14; x`? In the latter case `x` is a method call, and in the former it's a local variable read. I can't see where the parser maintains a set of local variables and where it queries this set. Is it somehow done declaratively? If so that's a huge achievement I don't think that's really been done before in a parser generator.
I really want to try tree-sitter for using in an actual Ruby implementation because it's so beautiful!
In both cases the bit after the semicolon just parses as (identifier).
For some use cases (e.g. syntax highlighting, depending on your colorization rules) it doesn't matter, and so we don't want to pay the cost. If it does matter (like in an actual implementation), then you'd have to implement this yourself and drive it by the parse tree you get from tree-sitter.
Right you could just have a phase to fix-it-up after parsing. Much better than trying to shoe-horn an imperative action into a nice more-pure parser. Great idea!
The code is obviously much simpler than its syntax - most importantly, its syntactical simplicity makes it way easier to deal with. So when you write the code to parse it you don't have to try to parse it in one fell swoop like you do in Whitequark.
So you can't read anything from a method call!
I can make it so, if you're doing a class method (of any kind) you have to invoke the constructor, as described in "What is a method?" There's also a few new techniques like "new_class_method", which requires creating an object (of some kind) for that class... but what about that?
It's not "I've just fixed Tree-sitter's problem"; it's that Tree-sitter hasn't yet resolved the problem yet - there are other parsing problems besides Tree-sitter in Ruby itself like those of classes (and classes are not part of Tree-sitter) and things that are known as "type-traits" and so on - so as it's not quite enough it can be done by other things. The reason for using LR grammar is that when it comes to this - what do I want from that grammar?
The point I'm making here is that LR doesn't give a reason for what you're doing. As a programmer you are trying to write code that is portable because - if it works in a domain you don't understand (such as Ruby) - then you don't know what you're doing is wrong. There can be a domain (as in any language) that's a lot more complex than this - but since we've got that, how can I be sure it won't mess up the code I'm writing?
Hey thanks! I'm one of the primary developers of this grammar along with @maxbrunsfeld. It was the driving force for supporting an external scanner and while there are still some Ruby edges cases, I'm pretty happy with how it came out. I will say we spent a lot of time on this and I read both the bison Ruby grammar and whitequark's ruby parser (which is excellent) in great detail to understand how to deal with certain parts of the language.
One thing I love about tree-sitter is how both the grammar and the resulting ASTs are so readable. I can come back to this project after months of not contributing and pick up right where I left off.
The trickiest (and most verbose) parts of the external scanner have to do with heredocs and the various ways to declare literals (strings, symbols, regexes, etc).
I recently used this to put together a unified PL classification model. It's nice because any language treesitter grows to support we'll support pretty effortlessly and treesitter captures more than enough nuance per language to derive high quality classifications.
It's fair to say we can classify a snippet of code based on either single or multiple AST paths produced by treesitter. Right now only doing the programming language but extending it to function classification or description etc isn't out of the question we just don't need it right now.
I'm curious to see if Tree-sitter can be used to provide fast and rich code navigation. I was able to implement simple goto definition/references [1], not sure if it can be used for more advanced navigation features in a language-agnostic way.
If you're interested, GitHub is already using it [2] for that purpose and Sourcegraph is experimenting it [3]
At GitHub, we're in the process of building a more precise code navigation system on top of Tree-sitter, that models language-specific name-resolution rules in detail.
Our currently-available code navigation system also uses Tree-sitter, but it is pretty simple; it just matches up references and definitions by their name.
So far it's the amazing tool and we are happy to use it in our projects. The only two complaints I have is the dependency on JavaScript[1] and missing Rust runtime option[2].
We also have several of the language grammars published as crates: https://crates.io/search?q=tree-sitter (And doing the same for other grammars is a fairly painless process.)
So if you're writing a tool for a single language (like a language server), it should be as easy as adding tree-sitter and tree-sitter-blah to your cargo manifest.
Awesome! Though my thinking was that it would have an especially large impact for languages that aren't popular enough to have their own LSP yet; you no longer have to be an expert in writing interactive compilers to set up a respectable LSP for a niche language, or even a home-grown one
Yes! This is a great point. It's similar to what I mentioned over on this thread [1] about how we're working on a more precise version of Code Navigation based on tree-sitter. The tl;dr is that you'd write something like tree-sitter queries [2], just like you do for the current fuzzy Code Nav, but the query DSL would be a bit more sophisticated, allowing you to specify the actual name resolution rules of your language. One of the things we're using to test this is an LSP shim that lets us test our rules in VS Code (or any other LSP-compliant editor).
That's the current plan! In particular, because we want to allow language communities to implement support for their own languages, and not have to be blocked on my team finding the time to do it. (Just like they can do now with the parser and syntax highlighting / fuzzy code nav rules.) Linguist is our role model here — it currently includes language detection and (regex-based) syntax highlighting rules for 500+ languages. Most of those are contributed by the community. There's no way that my team can migrate all of those in any reasonable amount of time, especially while having to balance that with other feature development and operational responsibilities.
Wrote tree-sitter-svelte. Was a good experience. I am also writing a programming language of my own similar to TypeScript and I am using tree-sitter for the same. Its a delight to work with it. Removes a lot of the worries.
Tree sitter will basically always generate a parse tree, even for malformed input, in which case it will add ERROR nodes for the bits it doesn't like (it will also inform you that there were problems with the parse by setting a boolean attribute). So you have some information you can use to construct a useful error message yourself, but some parser generators will handle this better (although it has to be said that the difficulty of obtaining good error messages from a parser generator are still one of the main the reasons production parsers are mostly written by hand).
Why would it not be appropriate? The only annoyance I see is that currently you will have to generate a good error message from it yourself, but a first pass at the problem shouldn't be too onerous.
We are also using this to power a lot of the program analysis features on github.com. We use it to generate the symbol list for Code Navigation, as an example, and are starting to look at extracting more semantic information about some languages using tree-sitter parse trees as intermediaries.
While we're in this discussion: Say I want to implement "SQL" for my app (if you've used Jira, I want to make my own JQL). Is this the tool for that? I'm looking for something much simpler than ANTLR.
This is really cool, I 100% agree that as programmers we’re editing and thinking in terms of ASTs. It just happens that text is a high density way to represent those ASTs.
I’m going to play with this and see if I can make a generic language server for vscode that works across languages. Unless someone has already done that.
What would be really cool is that tree-sitter (or a sister package) that provides incremental formatting primitives across languages.
The closest language agnostic formatter that comes to mind is prettier.js with its extensions.
incremental parser —> language server -> formatter across languages would be super rad.
I half-wrote a tree-sitter grammar for a niche DSL (the PRISM probabilistic model checking language). It was a very nice experience. It's part of another half-written side project to create a language server for PRISM; I still haven't gotten around to making the whole end-to-end pipeline work.
With its syntax tree query frontend I wonder whether tree-sitter would make a good interpreter frontend for some niche languages, or you need something more powerful.
They are! tree-sitter itself is open-source [1], as are all of the language parsers we've listed on the homepage [2]. The syntax highlighting support is documented here [3].
I tried looking through the docs, and couldn't find any mention of which algorithm you are using. It seems like some LR, grammar, but which kind? LALR? GLR? It seems like a very important bit of information, that's suspiciously missing.
My team is only writing tree-sitter parsers as part of working on GitHub developer productivity features like Code Navigation. So the short version is that we (i.e., my team at GitHub) haven't written a tree-sitter parser for SQL because we haven't targeted SQL for Code Nav support yet.
That said, this is exactly why we've released tree-sitter as an open-source project. That way there's no need for anyone to be blocked on my team finding the time to work on an SQL parser. Most extant tree-sitter parsers [1] have been developed by external language communities, and not by the core tree-sitter maintainers.
(Also note that SQL is a particularly wrinkly language, since there are so many different dialects. Are you looking for an ANSI SQL parser? A MySQL SQL parser? One that covers all of them to some degree?)
Hi, can you consider adding Kotlin to the list of supported languages?
Since the feature launched there is now a Kotlin tree sitter implementation https://github.com/fwcd/tree-sitter-kotlin
(but maybe that it needs some improvments)
I can't wait for the tools to get built with this. Paredit for TypeScript. Syntax-tree based highlighting (vs regex highlighting). A command to "add an arg to current function" which works across languages. A command to add a CSS class to the nearest JSX node, or to walk up the tree at the className="| ..." position, adding a new className if it doesn't exist.
There's a nicely documented Emacs package for this [1]. The documentation is at [2]. The parse trees work great. There's syntax highlighting support and tree-walking APIs. There's a bit of confusion about TSX vs typescript langs but it's fixable with some config change [3].
[1]: https://github.com/ubolonton/emacs-tree-sitter [2]: https://ubolonton.github.io/emacs-tree-sitter/ [3]: https://github.com/ubolonton/emacs-tree-sitter/issues/66#iss...