I was going over some of the code in the core folder for concurrency, threading and compression, what surprised me is that there’s absolutely no comments whatsoever. Agree that unless there’s excellent documentation, open source maintenance might be challenging.
Having said that, this definitely does look to be an impressive feat of engineering!
This looks very impressive!
As another commenter echoed, the code base is ~5million lines of C++ code, but almost no comments at all. Unless the documentation is excellent, maintenance/open source work is going to be difficult.
P.S. I wonder if LLMs could be used to generate docs and comments for big hairy codebases. Seems that the current generation of LLMs lack context to do it, but maybe it's "just one or two more papers down the line"®...
While transforming the plan into vectors is interesting, I wish they'd gone into more detail about how the ML model prunes and filters the best plan. It is also not clear what attributes of a plan the corresponding vector encodes.
I do not know much about Databloom, but it looks like this "Learning-Based Query Optimizer" is built for specific use-cases in a Data engineering/analytics setting(like K-means as cited in the article). It might not be a replacement for optimizers in traditional Databases.
> not clear what attributes of a plan the corresponding vector encodes
Fig 5, page 4 from [1]:
> Topology Features
> Operator Features
> Data Movement Features
> Dataset Features
For a single logical plan, meaning it will vary in its length for another query. (which is a part I don't get: you learn a new model per query? Can you learn with a variable feature length?)