Hi HN community. We are excited to open source Dataherald’s natural-language-to-SQL engine today (
https://github.com/Dataherald/dataherald). This engine allows you to set up an API from your structured database that can answer questions in plain English.
GPT-4 class LLMs have gotten remarkably good at writing SQL. However, out-of-the-box LLMs and existing frameworks would not work with our own structured data at a necessary quality level. For example, given the question “what was the average rent in Los Angeles in May 2023?” a reasonable human would either assume the question is about Los Angeles, CA or would confirm the state with the question asker in a follow up. However, an LLM translates this to:
select price from rent_prices where city=”Los Angeles” AND month=”05” AND year=”2023”
This pulls data for Los Angeles, CA and Los Angeles, TX without getting columns to differentiate between the two. You can read more about the challenges of enterprise-level text-to-SQL in this blog post I wrote on the topic: https://medium.com/dataherald/why-enterprise-natural-languag...
Dataherald comes with “batteries-included.” It has best-in-class implementations of core components, including, but not limited to: a state of the art NL-to-SQL agent, an LLM-based SQL-accuracy evaluator. The architecture is modular, allowing these components to be easily replaced. It’s easy to set up and use with major data warehouses.
There is a “Context Store” where information (NL2SQL examples, schemas and table descriptions) is used for the LLM prompts to make the engine get better with usage. And we even made it fast!
This version allows you to easily connect to PG, Databricks, BigQuery or Snowflake and set up an API for semantic interactions with your structured data. You can then add business and data context that are used for few-shot prompting by the engine.
The NL-to-SQL agent in this open source release was developed by our own Mohammadreza Pourreza, whose DIN-SQL algorithm is currently top of the Spider (https://yale-lily.github.io/spider) and Bird (https://bird-bench.github.io/) NL 2 SQL benchmarks. This agent has outperformed the Langchain SQLAgent anywhere from 12%-250%.5x (depending on the provided context) in our own internal benchmarking while being only ~15s slower on average.
Needless to say, this is an early release and the codebase is under swift development. We would love for you to try it out and give us your feedback! And if you are interested in contributing, we’d love to hear from you!
There's just one thing I worry about. It's losing expertise in your data model and gaining organizational false confidence in bad data. Let's consider Bob. Bob is a Product Manager.
Bob always used to bother his software engineers to write SQL queries, but now he just uses this tool. Bob didn't write the tables or the data structures, so Bob doesn't know the nuances of the data model. Bob just types English and gets result sets back. Bob doesn't know that field order_status can also be in "pending_legal", and neither does the "sql compiler" know when it's appropriate to add or elide that field. Bob then presents his data to leadership to make changes to the Pending Order Logic, based on bad data.