Show HN: ClangQL – Query C++ codebases using SQLite

pkhuong · on May 22, 2021

For something less PoCy, https://www.sourcetrail.com/ 's internal representation of the reference graph is a sqlite db file with pretty much a triple store schema.

frabert · on May 22, 2021

The idea for doing this experiment was actually that I tried times and times again to produce a SourceTrail graph for the LLVM codebase but always failed due to one reason or another. Then I discovered that they provided the gRPC interface to their clangd index and I came up with this.

EDIT: also, doing things this way you don't need to reimplement C++ indexing because you can leverage the existing clang features

thechao · on May 22, 2021

Man. I love SQLite, but the current virtual table extension is a performance dumpster fire. SQLite doesn’t understand multiple-column indexes on virtual tables, and the secret sauce to make it pick the “best” index is found only in the Necronimicon. Inevitably, with even fairly trivial joins, SQLite bails out to a polynomial sequential scan.

frabert · on May 22, 2021

I'd be very interested in this SQLite Necronomicon you're talking about :)

thechao · on May 23, 2021

I don't know if you're serious, but the `xBestIndex` function is supposed to be a literal description of the size & efficiency of the various tables. Lets say you have four tables: A,B,C are indexed (log access); table D is linear scan. Then, the `xBestIndex` function should return log(num-rows(X)) for A, B, and C; it should return num-rows(X) for D.

The issue is that SQLite considers the entire table when doing query plans, rather than the specific query that's about to be performed; this means if D is especially short, then it'll choose D as the "driver" table, and then linearly scan A, B, and C. This is not the behavior is uses for its own internal tables. Instead, internal tables are log-scanned based off the "best" table.

I suppose, what I'd really like is a strong guarantee that the plans SQLite compiled always used the index. I understand that there's N! possible plan orders for a join, so we can't consider all orders, but whatever mechanism is exposed through `xBestIndex` is just bonkers bad.

frabert · on May 23, 2021

Thanks, I was serious and any insight from anyone who has more experience than me in this stuff is appreciated!

gaze · on May 23, 2021

I tried doing this using prolog last summer to extract some features from a codebase. I loved it. Being able to query a codebase like a database is extremely useful.

marco_craveiro · on May 23, 2021

Pretty cool! One question though: if this was based on LSP in general, it could be generalised to any language, right? I wonder why they wired it to clangd specifically.

frabert · on May 23, 2021

I wired it to clangd specifically because of two main reasons:

- LLVM provides a remote interface to their index accessible through a gRPC connection

- The clangd protocol is very simple, and the bindings can be generated automatically

I don't know enough about LSP to say whether it would be a suitable protocol to use for this purpose

cryptonector · on May 22, 2021

Cscope as a SQL DB? I would really like that. But it's important not to stop at parsed code, especially in C (for obvious reasons).

laymonage · on May 23, 2021

This is cool! You can do a lot of things with SQLite these days.