Hacker News new | past | comments | ask | show | jobs | submit login
Postgres Language Server: Implementing the Parser (supabase.com)
162 points by todsacerdoti on Dec 8, 2023 | hide | past | favorite | 46 comments



(i'm on the supabase team)

this is an update from the launch here:

https://news.ycombinator.com/item?id=37020610

The focus for the past few months has been getting the Parser right. It has been a lot of work. I'll ping Philipp and get him to join the discussion if there are any questions

We'd also love more contributors, if this sort of thing is up your alley


> ...getting the Parser right. It has been a lot of work.

True.

I have an ANTLR4 grammar for SQL. Am noob. Getting things "correct" was some effort. I have the quixotic goal of accepting multiple dialects. It's a lot of time referencing docs, playing with SQL fiddles, and learning how other grammars work.

Now working on performance. Just as tough for noob me. TIL That means removing any potential ambiquities at runtime. ANTLR4's ALL(*) algorithm does some runtime magic whenever the happy path fails. Which tanks performance. Great for making "good enough" grammars. Not so great for chewing thru a corpus of tests.

SQLite has a terrific test suite. Including generated 'random' machine queries. My naive grammar took > 1000ms on these monsters, when it worked. After a few, it'd just ABEND with out of memory errors. Whoops!

By removing ambiquities, those monsters now take < 100ms. I really shouldn't make any claims until all the tests pass. (I'm still trying to figure out if I can handle both T-SQL style bracketed [names] and Postgres style name[index] array references, in the same grammar, without symmantic predicates.)

Any way. This experience has piqued my interest in PEG. I'm worried my LL(k) grammar is too brittle. Making it hard to for any one to maintain and adapt. As you well know, for sure. (For a taste of this challenge, peek at some of the other available ANTLR SQL grammars.) Oh well; that's a concern for later.

I'm looking forward to checking out Supabase's grammar. I'll share mine (Show HN) when I think it's good enough.


Where is the SQLite test suite, please? I'd be very interested.

There are already SQL grammars, check https://github.com/antlr/grammars-v4 specifically in here I think https://github.com/antlr/grammars-v4/tree/master/sql I contributed to one of them, and I wrote my own for some personal work. Be warned, it's very involved, very complex and MSSQL is rather ill-defined.

Names bracket identifiers) in SQL are bloody awful. Sometimes square brackets are even compulsory, and why you can usually replace [...] with the SQL standard "..." , not always! Trust me, it gets worse.

I don't find antlr grammars to be brittle, and while they can lose in performance (by how much I don't know, perhaps quite considerably) they are very easy to maintain and I am very fortunate to have antlr to work with.

Edit: unless you are very familiar with the SQL and also a masochist, please don't try and write your own grammar. For someone with my background it was very time-consuming, without that background it will be a nightmare. Friendly warning.


> Where is the SQLite test suite

Here's the scrapped tests:

https://github.com/bkiers/sqlite-parser/tree/master/src/test...

I didn't find a scrapper. I was cobbling together my own scrapper when I stumbled onto these files. My scrapper is turrible.

This SQLite grammar is included in grammars-v4. Looks like a copy vs a fork. I don't know why these tests weren't copied too.

> please don't try and write your own grammar

I must be a masochist.

Aside: Ages ago, during an job interview, when asked to talk about some of my prior work, I mentioned my SQL grammar. The self-described "bar raiser" cut me off, dismissed my effort as trivial, because the grammar generator (ANTLR) does all the work. I had no response.


A, thanks for these tests! That will be really useful for my work.

A couple of thoughts: what do you mean by a 'scrapper'? At first I thought you meant scraper, but I don't think it's that. I haven't come across the term before in this context, can you explain please.

I'm a bit concerned about the times reported for your parsing. It's unfortunately well-known that the python output of antlr is agonisingly slow – what is your output language? (I don't think that's your problem, but if it is then switching to Java output will fix this immediately).


Oof. Yes, I meant scraper.

> ...concerned about the times reported

Agreed.

FWIW...

My wall time for this beast (randtest1) is ~55ms.

    SELECT coalesce((select max(11- -19-f-t1.b+a) from t1 where exists(select 1 from t1 where 11-~(d)-c*a*~t1.a-t1.e-t1.e+coalesce((select coalesce((select t1.c from t1 where case (c) when d then e else 11 end=t1.f),t1.d) from t1 where (t1.d)>b), -11) | f+t1.f not in (((c)),b,13))),f) FROM t1 WHERE case when 19+c>=t1.a then t1.c when not case when not exists(select 1 from t1 where +f | b*b*19+19*13-a | case when t1.e not in (t1.f,t1.c,b) then 11 when 17>t1.c then a else e end<>e) then b when 17=t1.e then b else e end<>t1.b then a else d end-t1.b=(13)
Whereas simple statements like these are < 20ms each.

    select ((select 1) union (select 1));

    select a between b and c and d;
I need to add profiling metrics to the build, to (proactively) catch performance regressions.

It's fairly easy for ambiguity to sneak in. Mostly because I'm noob. But also because I'm stubbornly using left-recursion and I'm still navigating ANTLR's magical rule rewriting.

> ...python output of antlr is agonisingly slow

Still just using Java.

Edit: Now I'm curious. I'll try ANTLR's Python runtime asap.

My tool is a language translation widget. Input SQL and output templates (vs something like an ORM).

In the future, I'd like to support dynamic languages (stacks) like Python and nodejs/deno. One, just to be a good citizen ("when in Rome..."). But also to try some half-baked notions for supporting dynamic SQL at runtime. Which isn't (yet?) feasible with a compile-time stack like Java.

Thanks for your interest. I put my email in my HN profile. I can send you updates on my work. And once I get to a "beta" release (usable by other people), I intend to have weekly "office hours" on Zoom (or some such).


If we're all being postgres-specific, isn't there presumably a grammar somewhere in postgres's code that is used to parse the query SQL? Wouldn't you all want to use that same file as a source so it's 100% identical to what Postgres is actually doing?

Genuine question. But maybe the tooling is incompatible.

Also, have you looked at Azure Data Studio? I'm wondering how much you can reuse from the work they did.


I'll check out Azure Data Studio, thank you.

For a bunch of reasons, probably all bad, I decided to roll my own, using ANTLR.

Being just a simple bear, LALR (yacc, bison) is beyond me.

With ANTLR4 the parse tree can be pretty close to the desired abstract syntax tree. Which is very nice.

Also, ANTLR4 has automagic left-recursion, so grammars can be very similar to the (reference) EBNF.

Lastly, ANTLR4 allows grammar inheritance. My hope is to have a subgrammar for each dialect which extends the one amazing core supergrammar. Looking at all the example SQL grammars, they're all pretty different, so it's not immediately clear how to generalize. Which is a bummer.



Not necessarily. It might be a recursive descent parser (to generate better error messages).


that's a huge task to take upon, looking forward to go through it! compared to you, we have gone the "easy" way and use the actual parser from the Postgres server. so no grammar definition and the like. our work was mainly around adapting libpg_query (which is build to parse executable SQL) for our use case.


> use the actual parser from the Postgres server. so no grammar definition and the like.

Doesn't Postgres use Lex/Yacc to describe the grammar?


It actually uses Flex/Bison, and does take advantage of the extended features they provide.


Thanks for the correction. But either way, doesn't this mean that a formal lexicon/grammar are available for the language server to draw upon?


Only tangentially related, but if it's of any interest I'm about 90% done with a PEG grammar for the PostgREST query DSL for similar reasons. Happy to share it if it'd be useful.


Do you have a link to the DSL? Curious to learn more


Yep, posted it in response to sibling comment.


it would definitely be interesting. if it's private and can't share a link, my email is in my profile


Sure, still a WIP but it's getting there [1]

[1] https://github.com/superscribe-io/superscribe/blob/developme...


thanks for sharing - we'll check it out


hey, author here. Thanks for posting it!

A bit of background: a few months ago we announced a Postgres language server[0]. A language server adds features like syntax error diagnostic and autocomplete to your editor (vscode, neovim, etc). We have iterated a lot on the parser over the past few months and want to share an update today.

the parser is a core piece of any language server that constructs syntax trees from the raw input string. Usually first an untyped concrete syntax tree (cst) that represents the syntactic structure of the input, and subsequently a typed abstract syntax tree (ast) containing the meaning of the source.

In our implementation, we leverage the actual Postgres parser to-do the heavy lifting. However, the parser is designed to parse executable SQL — not to provide language intelligence. For example, it does not handle incomplete inputs, and outputs just the ast, not the cst. To use it for a language server we had to work around these limitations as good as possible.

While we leverage procedural macros in rust to generate a lot of the repetitive parser code, there remains a portion that requires a bit of manual work. But the groundwork is completed, and we can finally start working on the data model and the actual server next. Our aim is to bring this to a usable state as swiftly as possible.

Huge shout-out to pg_analyze for creating and maintaining libpg_query[1], without which this project would not be possible!

[0] https://news.ycombinator.com/item?id=37020610

[1] https://github.com/pganalyze/libpg_query


Thank you for doing this!

Every year I get frustrated with a PostgreSQL formatter, look out into the webs for hope, and begrudgingly return to my sub-par editing experience.

Please take your time to do a good job, and thank you, again!


Have you considered using something like a treesitter grammar? It could solve the editor specific uses cases like highlighting and even linting as it creates asts that are more amenable for a language server implementation


At Splitgraph we compiled tree-sitter to wasm for Postgres autocomplete. Indeed, one major reason we opted for it is that it has error handling so it can work with incomplete syntax (which is what your code is 90% of the time you're editing it).

https://www.splitgraph.com/blog/parsing-pgsql-with-tree-sitt...


thanks for the link, very interesting read! and you are right, libpg_query has its limitations.

the idea is to first implement the parser with libpg_query and work around its limitations as good as possible. Since the scan api also returns all tokens for invalid sql, the language server will then have basic features and syntax error diagnostics for invalid statements, and advanced features for valid ones. once the server itself is done, we want to go back to the parser and replace the libpg_query-based parser with a more resilient alternative statement by statement. ultimately, the libpg_query-based parser should just be the fallback.

that being said, very excited that there is so much development in postgres dx.


Yes, I think the optimal solution here is a combination of tree-sitter for real time (as-you-type) with a fallback to libpg_query. I mean it's technically the other way around, since libpg_query is preferred when it parses correctly. But yeah I think you inevitably need a combination. Pretty sure TS does similar things in VSCode.


Have you investigated generating your parser from postgres' gram.y instead of basically basing it on the bison output?


that's a very interesting idea!

for now, our goal is to take the "easy" route with libpg_query and build a language server that provides basic lsp features for invalid sql, and advanced lsp features for valid sql as fast as possible. we then want to go back to the parser and replace the libpg_query-based approach with a more resilient alternative. as of now, the plan is to implement a handwritten recursive-descent statement by statement. will definitely do research to what extend we could leverage gram.y there, especially to potentially fast-track it.


If some annotations or such would make that easier, it might be possible to do that upstream. Postgres currently has hand-generated code for tab-completion in psql and that's uh, not great (it has gotten large enough that msvc has trouble compiling the file).


I can't wait for this. Just getting out a great AST would be amazing for things like formatters, but I want to build things like a library that knows everywhere you use a table in a particular way or reference a column, or linting that can help you identify problems with the way you do a join or forgot to apply the right filter for multi-tenant queries.


that is definitely the goal, both a formatter and a linter. we want to add something like squawk and plpgsql_check directly to the language server, so you get eslint-like dx. with both the ast and the database schema at hand, you can basically add any rule you like.

and this is far out, but eventually we are maybe even able to combine the language server with declarative schema management and have go-to-definition etc working.

[0] https://github.com/sbdchd/squawk/tree/master [1] https://github.com/okbob/plpgsql_check


For a formatter you only need tokens and rules. ASTs are unnecessary unless you want to do transformations at the same time.


Out of curiosity, why are parser combinators not the choice approach for something like this? Is it mainly a performance thing?

PS— this is super exciting!


in our specific case, we needed something handwritten for the statement-level parser anyways. And the requirements for the LL parser are very simple: extract individual sql statements from a source input. we just compare the next n tokens with a list of tokens from which any statement starts, which is straightforward to implement with a handwritten LL parser.

so after all, I would say its a decision based on the special requirements we have working around the limitation of libpg_query. I think its also the fastest one, but this was not the main reason.


Thanks for your response!

I'm a little confused by it though... Maybe I am misunderstanding what you're telling me.

I would definitely consider a parser implemented with parser combinators to be "handwritten". The main difference for me is that parser combinators allow you to express a language grammar in a more declarative way, which makes refactoring and correctness verification much simpler. I think what you describe as "straightforward" without combinators would be "completely trivial" with combinators.

I know from reading recent comp sci papers (particularly in the Journal of Functional Programming) that "the parser problem" is not fully solved yet (!) and that there are various theoretical limits to all currently known formal approaches. I suspect that performance could be a limiting factor for certain types of syntaxes, although I'm not sure if Postgres' would be particularly problematic in this regard.


sorry, I think I misunderstood your question!

after all, we did not implement a "real" parser. we just use libpg_query, the actual Postgres parser, and work around its limitations as good as possible. The implementation thereby required maximum flexibility. we never define any grammar other than "a select statement starts with a SELECT keyword".


Parser combinations are very powerful! For example your function whitespace_tokens could be something like:

whitespace_tokens = many1(choice(‘ ‘, ‘\t’, ‘\n’)).map(m -> token(m))


I Really like this!


When will PG finally start using tasks instead of superheavyweight threads? When will they automatically precompile/hash statements to avoid reparsing? And when materialized views will be automatically updated? Where is an alternative to SQL Server Always On Availability Group released >10 years ago? PostgreSQL is the only database where guys recommend you to configure a separate connection pooler, because their database is unable to handle even mid. size number of connections.


it's a great list. No database is perfect, and this is a good set of features that are missing from Postgres

that said, the flexibility/extensibility of Postgres still makes it a great choice. You could easily point at other databases and produce a list similar (or longer) than this.


Seems like a rant rather than a comment to the post?

Timescale has continuous aggregates that might be interesting, writing a table built by a trigger isn't hard, and some might say having a separate connection pooler is a feature rather than a bug... Idk enough about SQL server for the group part.


When will you contribute?


The way Postgres is worshiped as The Holy Grail here on HN, some reality check on where it is lacking should be ok to rant about, just to balance the discussion, with no expectations on fixing it. One can also choose a different tool.

In this thread, it’s not exactly on topic though.


I'll take vertical sharding until no longer possible and then dropping all further requests over touching Always On Availability Groups ever again


vertical scaling*


lol, salty




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: