Hacker News new | past | comments | ask | show | jobs | submit login

I’ve been doing a lot of work on semantic data architecture that better supports LLM analytics, did you use any framework or methodology to decide how exactly to present the data/metadata to the LLM context to allow it to make decisions?



A pre-processing phase does a lot of heavy lifting, where we stuff the table and column comments, additional metadata, and some hand-tuned heuristics into a graph-like structure. Basically using LLMs itself to preprocess the schema metadata.

Everything is very boring tech-wise, using vanilla postgres/pgvector and a few hundred lines of python. Every RAG-searchable text field (mostly column descriptions and a list of LLM-generated example queries) is linked to nodes holding metadata, at most 2 hops out. The tool is available to 10.000 users, but load is only a few queries per minute at peak... so performance wise it's fine.


Enhancing the comments on the existing data model seems to be the most common approach for sure. I'm implementing this as a data architecture at several clients and I've found creating a whole new logical structure designed for the LLM is really effective. Not being bound by the original data model lets you solve several problems related to the "n-hops" question, avoiding needing the comments, and the semantics of how data engineers define columns. Some more details here [1], but obviously you can implement this totally yourself by hand.

[1] (https://github.com/eloquentanalytics/pyeloquent/blob/main/RE...)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: