Hacker News new | past | comments | ask | show | jobs | submit login

This is more a function of Ruby than of tree-sitter. The tree-sitter grammars for other languages are hopefully less inscrutable. For Ruby, we basically just ported whitequark's parser [1] over to tree-sitter's grammar DSL and scanner API.

[1] https://github.com/whitequark/parser




I didn't mean the tree-sitter grammar was not understandable - it's very understandable - I just can't work out how to managed to find such a concise way to express grammars. Even compared to Whitequark it's 1/3 the size. What's the unique thing you do that makes it so concise?

It also seems somehow to be completely declarative? How have you managed to transform Ruby parsing to be context-free? For example where's the set of what's currently a local variable so you can distinguish from method calls?


Ahh my mistake! :-)

To be fair, we're cheating a little bit because the Ruby grammar relies so heavily on an external scannar, which is just under 1,000 lines of C++: https://github.com/tree-sitter/tree-sitter-ruby/blob/master/...


But for example how do you parse the difference between `x = 14; x` and `y = 14; x`? In the latter case `x` is a method call, and in the former it's a local variable read. I can't see where the parser maintains a set of local variables and where it queries this set. Is it somehow done declaratively? If so that's a huge achievement I don't think that's really been done before in a parser generator.

I really want to try tree-sitter for using in an actual Ruby implementation because it's so beautiful!


[EDITED to make the example actually line up with OP's test]

There's no symbol table in the parser, so at parse time, we don't distinguish those cases:

  $ cat test.rb
  module Test
    def test1
      x = 14; x
    end

    def test2
      y = 14; x
    end
  end
  $ tree-sitter parse test.rb
  (program [0, 0] - [9, 0]
    (module [0, 0] - [8, 3]
      name: (constant [0, 7] - [0, 11])
      (method [1, 2] - [3, 5]
        name: (identifier [1, 6] - [1, 11])
        (assignment [2, 4] - [2, 10]
          left: (identifier [2, 4] - [2, 5])
          right: (integer [2, 8] - [2, 10]))
        (identifier [2, 12] - [2, 13]))
      (method [5, 2] - [7, 5]
        name: (identifier [5, 6] - [5, 11])
        (assignment [6, 4] - [6, 10]
          left: (identifier [6, 4] - [6, 5])
          right: (integer [6, 8] - [6, 10]))
        (identifier [6, 12] - [6, 13]))))
In both cases the bit after the semicolon just parses as (identifier).

For some use cases (e.g. syntax highlighting, depending on your colorization rules) it doesn't matter, and so we don't want to pay the cost. If it does matter (like in an actual implementation), then you'd have to implement this yourself and drive it by the parse tree you get from tree-sitter.


Right you could just have a phase to fix-it-up after parsing. Much better than trying to shoe-horn an imperative action into a nice more-pure parser. Great idea!


So what's cool is that while we don't handle that during parsing, you can use another set of tree-sitter features to do tree queries to achieve this. Here's the query for detecting Ruby locals: https://github.com/tree-sitter/tree-sitter-ruby/blob/32cd5a0... and here's some better documentation for how the query language works: https://tree-sitter.github.io/tree-sitter/syntax-highlightin....


The code is obviously much simpler than its syntax - most importantly, its syntactical simplicity makes it way easier to deal with. So when you write the code to parse it you don't have to try to parse it in one fell swoop like you do in Whitequark.

So you can't read anything from a method call! I can make it so, if you're doing a class method (of any kind) you have to invoke the constructor, as described in "What is a method?" There's also a few new techniques like "new_class_method", which requires creating an object (of some kind) for that class... but what about that? It's not "I've just fixed Tree-sitter's problem"; it's that Tree-sitter hasn't yet resolved the problem yet - there are other parsing problems besides Tree-sitter in Ruby itself like those of classes (and classes are not part of Tree-sitter) and things that are known as "type-traits" and so on - so as it's not quite enough it can be done by other things. The reason for using LR grammar is that when it comes to this - what do I want from that grammar?

The point I'm making here is that LR doesn't give a reason for what you're doing. As a programmer you are trying to write code that is portable because - if it works in a domain you don't understand (such as Ruby) - then you don't know what you're doing is wrong. There can be a domain (as in any language) that's a lot more complex than this - but since we've got that, how can I be sure it won't mess up the code I'm writing?


> The code is obviously much simpler than its syntax

What code? The parser? How can it be simpler than its syntax? It has syntax and semantics, which is strictly more than the syntax.

> The point I'm making here is that LR doesn't give a reason for what you're doing.

What do you mean 'what you're doing'?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: