Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: ClangQL – Query C++ codebases using SQLite (github.com/frabert)
73 points by frabert on May 22, 2021 | hide | past | favorite | 11 comments



For something less PoCy, https://www.sourcetrail.com/ 's internal representation of the reference graph is a sqlite db file with pretty much a triple store schema.


The idea for doing this experiment was actually that I tried times and times again to produce a SourceTrail graph for the LLVM codebase but always failed due to one reason or another. Then I discovered that they provided the gRPC interface to their clangd index and I came up with this.

EDIT: also, doing things this way you don't need to reimplement C++ indexing because you can leverage the existing clang features


Man. I love SQLite, but the current virtual table extension is a performance dumpster fire. SQLite doesn’t understand multiple-column indexes on virtual tables, and the secret sauce to make it pick the “best” index is found only in the Necronimicon. Inevitably, with even fairly trivial joins, SQLite bails out to a polynomial sequential scan.


I'd be very interested in this SQLite Necronomicon you're talking about :)


I don't know if you're serious, but the `xBestIndex` function is supposed to be a literal description of the size & efficiency of the various tables. Lets say you have four tables: A,B,C are indexed (log access); table D is linear scan. Then, the `xBestIndex` function should return log(num-rows(X)) for A, B, and C; it should return num-rows(X) for D.

The issue is that SQLite considers the entire table when doing query plans, rather than the specific query that's about to be performed; this means if D is especially short, then it'll choose D as the "driver" table, and then linearly scan A, B, and C. This is not the behavior is uses for its own internal tables. Instead, internal tables are log-scanned based off the "best" table.

I suppose, what I'd really like is a strong guarantee that the plans SQLite compiled always used the index. I understand that there's N! possible plan orders for a join, so we can't consider all orders, but whatever mechanism is exposed through `xBestIndex` is just bonkers bad.


Thanks, I was serious and any insight from anyone who has more experience than me in this stuff is appreciated!


I tried doing this using prolog last summer to extract some features from a codebase. I loved it. Being able to query a codebase like a database is extremely useful.


Pretty cool! One question though: if this was based on LSP in general, it could be generalised to any language, right? I wonder why they wired it to clangd specifically.


I wired it to clangd specifically because of two main reasons:

- LLVM provides a remote interface to their index accessible through a gRPC connection

- The clangd protocol is very simple, and the bindings can be generated automatically

I don't know enough about LSP to say whether it would be a suitable protocol to use for this purpose


Cscope as a SQL DB? I would really like that. But it's important not to stop at parsed code, especially in C (for obvious reasons).


This is cool! You can do a lot of things with SQLite these days.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: