Hacker News new | past | comments | ask | show | jobs | submit login
Tbsp – treesitter-based source processing language (peppe.rs)
209 points by hiyer 5 months ago | hide | past | favorite | 42 comments



This is great, and a step in the right direction. I wish tree-sitter had an official higher level API that allowed processing and pattern matching for use cases other than those required for text editors.

I’m currently using tree-sitter at work to build AST-based tools, as performance is amazing, even with huge codebases, but I’m finding it slightly frustrating to have to manually write recursive descent processors keyed by strings, with no compile time guarantees on the structure of the grammar.

This is compounded by the fact that grammars themselves don’t really follow any standard structure, some have named fields (presumably the ones created after GitHub contributed this feature), while others require hierarchical pattern matching.

I wish there existed a tool to consume a grammar and output a rust ADT that we can simply match on. This would at least save me from redundant error handling. I’d build one myself, but I’m that good at rust yet.


You may already be aware of it, but in case not - it sounds like tree-sitter-graph could be something you'd be interested in: https://docs.rs/tree-sitter-graph/latest/tree_sitter_graph/r...

I haven't gotten into it yet but it looks pretty neat, and it's an official tool.


> I wish tree-sitter had an official higher level API that allowed processing and pattern matching for use cases other than those required for text editors.

Is the pattern matching API not sufficiently high level? In my experience, it's a huge improvement over implementing visitors for everything.

https://tree-sitter.github.io/tree-sitter/using-parsers#patt...


I’ve also encountered this problem using various tree-sitter grammars. I would love a data set that showed various implementations for different languages, along with some kind of consistent test coverage for each language that shows compatibility versus the compiler’s parser. And, of course, links to precompiled wasm modules. Basically, a tree-sitter package manager.


So an awk but that knows how to walk structures instead of just lines. Excellent!

I'm a big fan of semgrep letting me query ASTs, this feels like something in a similar space. Down with lines, up with everything being trees!


Have you checked ast-grep and gritql?


Are these alternatives to semgrep?


More or less, yes. CLI, offline, no need for a cloud account. Used ast-grep successfully to locate bad code blocks (dynamic typing, don't even get me started) and also to replace them with others. Highly recommended.


Semgrep also a CLI, that can run offline and without a cloud account.

At work, we use it for enforcing a bunch of custom lint rules configured as a yaml file committed directly to our repo, entirely cloud-free.

(I may be overreading your comment as suggesting that these were reasons to use ast-grep over semgrep.)


ast-grep is based on treesitter. I found Semgrep great for simple things but impossible due to edge cases for complicated things. ast-grep is more difficult for simple cases but all the information you need is there for complex cases.


Semgrep is also based on tree sitter


As the other sibling commenter said, both `ast-grep` and `gritql` are based on Treesitter which means that you can in fact just look for certain function call and it will be found no matter how it's formatted, something that plain grep and sometimes semgrep I am not sure can do.

I have used `ast-grep` to devise my own linters with crushing success.


This is so cool.

Question (caveat: first export to treesitter and tools like this): Is there a reason the example demonstrates the use of depth as a variable instead of it being built in?

Nesting level of a particular "type" is general enough that it might be included OOTB. What you want to do with this might be generalizable - for example instead of

```

    enter section {
        depth += 1;
    }
    leave section {
        depth -= 1;
    }

    enter atx_heading {
        print("<h");
        print(depth);
        print(">");
    }
    leave atx_heading {
        print("</h");
        print(depth);
        print(">\n");
    }
```

It could simply be:

```

    enter atx_heading {
        print("<h");
        print(depth);
        print(">");
    }
    leave atx_heading {
        print("</h");
        print(depth);
        print(">\n");
    }
```

So depth is always of the nested levels of the same node type, but available out of the box. For markdown, it's headings, sections and lists come to mind - but I might be wrong.

In any event, this looks really well thought-out and now to checkout the other tools mentioned in the comments.....


The depth here can be context dependent. For example if you had a bunch of brackets and parens in your grammar, you might only care about paren depth. Or if your language had brackets and parens and function definitions, your "expression depth" might ignore function definitions (or even reset at a function definition boundary if you have inner functions!)


For those that want to explore the grammars listed at https://github.com/tree-sitter/tree-sitter/wiki/List-of-pars... in a more friendly railroad diagram format I made https://mingodad.github.io/plgh/json2ebnf.html that reads the "src/grammar.json" and try it's best to generate an EBNF understood by (IPV6) https://www.bottlecaps.de/rr/ui or (IPV4) https://rr.red-dove.com/ui where we get a nice navigable railroad diagram (see https://github.com/GuntherRademacher/rr for offline usage).


Impressive! The grammar.json file is just a little bit too underspecced to automate some things. Not to mention it's self-referential. How did you deal with extras and other 'specialisms' that are secretly hidden away in the C-level scanner and so on?

I ask because I wrote Combobulate [1], a structured editing and movement tool for Emacs using TS.

1: https://github.com/mickeynp/combobulate


Also there was several requests to create a more formal grammar to describe the grammars but the tree-sitter developers doesn't like the idea and reject then.

But some people did nice attempts like https://github.com/eatkins/tree-sitter-ebnf-generator that I also adapted and exposed it here https://mingodad.github.io/lua-wasm-playground/ to allow play with it online (select "Tree-sitter-ebnf-generator" from examples then click "Run" to see a "grammar.js" generated from the content in "Input Text (arg[1])").


I've added more non trivial grammars Javascript, Java, Kotlin, PHP, C, CPP, Rust, Ruby, CSS, HTML, Python using a quickjs script to convert "src/grammar.json" to an EBNF understood by https://mingodad.github.io/lua-wasm-playground/ (the script is here https://github.com/mingodad/plgh/blob/main/json2ebnf-lua.js).


I simple ignore then as right now they doesn't seen relevant in most grammars to generate an usable railroad diagram.


Hi, in case you're not already aware of the name clash, there's already a `rr` in the programming world. It's "record and replay": https://rr-project.org/.

Very different, but a very fine tool tool too.


It doesn’t seem like the rr that GP linked to is their own project, just something they’ve found useful.

In any case, in the non-software world, “RR” stands for railroad, as it does in the name of that tool. You can’t own a common two-letter abbreviation.


Awesome!

Just yesterday I started some experiments in that direction, to visualize grammars, but now I can rather do something else ..


As someone writing a neovim plugin using treesitter thank you! Languages like this help leverage treesitter in more interesting ways whereas current apis are still a bit low-level


What neovim plugin are you writing?


The md-to-html demo is a good one, but worth mentioning that the Markdown parser[1] being used may not be suitable for more complex documents. From the README:

> "...it is not recommended to use this parser where correctness is important. The main goal for this parser is to provide syntactical information for syntax highlighting..."

There's also a separate block-level and inline parser, not sure how `tbsp` handles nested or multi-stage parsing.

[1]: https://github.com/tree-sitter-grammars/tree-sitter-markdown


Even worse, the README implies tree-sitter is just not going to work for markdown at all[1], this is not a matter of a little polish and bugfixing:

> These stem from restricting a complex format such as markdown to the quite restricting tree-sitter parsing rules.

[1]: Outside of something like tree-sitter v2 with a much more complex grammar support. And frankly I personally don't think making more complex grammars in Javascript+C is a good way forward.


Adding a way to query the path at the current node would let you skip out on doing stuff like keeping track of `in_section`.

I wonder if the `enter|exit ...` syntax might be too limiting but for a lot of stuff it seems nice and easy to reason about. Easier than tree-sitter's own queries.

I think if you really wanted performance and whatnot, you might end up compiling the queries to another target and just reuse them.

I could see myself writing a lua DSL around compiling these kinds of queries `enter/exit` stanzas or an SQL one too.


Not a technical comment (as cool as this is), but I love the name.

We always say naming things is one of the hard parts of programming. They avoided the default option of something like tawk.


Though, being the abbreviation for tablespoon, make searching for this a fair bit harder. As long as code files using this language don't get called recipes...


trawk (tree awk) was one of the initial names for this (not author, but know him personally)


I mean I'll be calling (pronouncing) it Tablespoon, that's a great name:)


Always kudos towards taking a self-hosted-forge approach



This is really cool! I have a lot of short projects that are essentially “parse out 2 or 3 tags of HTML and convert that to CSV. This will be perfect for that; in the past I’ve done it by hand with vim. Next time I’ll give this a shot.




Is it formerly peppe.rs ?

Here is the new account and doc for tbsp below.

https://oppi.li/posts/introducing_tablespoon/


The git is still hosted at peppe.rs.


very interesting paradigm of programmin i would recommend checking out, for inspiration: https://rosettacode.org/wiki/Category:Bracmat and https://www.egison.org/

they define themselves as non linear patter matching pretty niche and unique way to program and i enjoyed playing with thier code

thanks for posting very nice


Awesome! I'd love to see this flourish.


That's a lot of work to write lisp without parentheses /j

I joke, really interesting project, props to the team


tablespoon - of course....




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: