Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Dataherald AI – Natural Language to SQL Engine (github.com/dataherald)
215 points by aazo11 on Aug 23, 2023 | hide | past | favorite | 103 comments
Hi HN community. We are excited to open source Dataherald’s natural-language-to-SQL engine today (https://github.com/Dataherald/dataherald). This engine allows you to set up an API from your structured database that can answer questions in plain English.

GPT-4 class LLMs have gotten remarkably good at writing SQL. However, out-of-the-box LLMs and existing frameworks would not work with our own structured data at a necessary quality level. For example, given the question “what was the average rent in Los Angeles in May 2023?” a reasonable human would either assume the question is about Los Angeles, CA or would confirm the state with the question asker in a follow up. However, an LLM translates this to:

select price from rent_prices where city=”Los Angeles” AND month=”05” AND year=”2023”

This pulls data for Los Angeles, CA and Los Angeles, TX without getting columns to differentiate between the two. You can read more about the challenges of enterprise-level text-to-SQL in this blog post I wrote on the topic: https://medium.com/dataherald/why-enterprise-natural-languag...

Dataherald comes with “batteries-included.” It has best-in-class implementations of core components, including, but not limited to: a state of the art NL-to-SQL agent, an LLM-based SQL-accuracy evaluator. The architecture is modular, allowing these components to be easily replaced. It’s easy to set up and use with major data warehouses.

There is a “Context Store” where information (NL2SQL examples, schemas and table descriptions) is used for the LLM prompts to make the engine get better with usage. And we even made it fast!

This version allows you to easily connect to PG, Databricks, BigQuery or Snowflake and set up an API for semantic interactions with your structured data. You can then add business and data context that are used for few-shot prompting by the engine.

The NL-to-SQL agent in this open source release was developed by our own Mohammadreza Pourreza, whose DIN-SQL algorithm is currently top of the Spider (https://yale-lily.github.io/spider) and Bird (https://bird-bench.github.io/) NL 2 SQL benchmarks. This agent has outperformed the Langchain SQLAgent anywhere from 12%-250%.5x (depending on the provided context) in our own internal benchmarking while being only ~15s slower on average.

Needless to say, this is an early release and the codebase is under swift development. We would love for you to try it out and give us your feedback! And if you are interested in contributing, we’d love to hear from you!




I want to preface this comment so that I don't detract from the idea. This is a really cool product, and obviously has value.

There's just one thing I worry about. It's losing expertise in your data model and gaining organizational false confidence in bad data. Let's consider Bob. Bob is a Product Manager.

Bob always used to bother his software engineers to write SQL queries, but now he just uses this tool. Bob didn't write the tables or the data structures, so Bob doesn't know the nuances of the data model. Bob just types English and gets result sets back. Bob doesn't know that field order_status can also be in "pending_legal", and neither does the "sql compiler" know when it's appropriate to add or elide that field. Bob then presents his data to leadership to make changes to the Pending Order Logic, based on bad data.


It's a risk for Bob, I think he should be the one judging whether it's worth it or not, since he'd probably get in trouble over this. What I take away from this is that it'd be helpful if the tool clearly warned him of this risk.

Other than that, I'm sceptical for using LLMs for lots of things, but this I find cool. In my consulting work (and even before as a CTO), I usually train PMs, CEOs, really anyone who didn't make it up a tree by the count of 3 some basic code/tech literacy.

Imagine if Bob used this tool to learn about SQL and the data scheme. Now he got a glimpse of the inherent complexities, knows better when and how to ask for help, can have more trust in and better conversations with those engineers. Sounds naive probably, but I've usually seen these results from my (manual) efforts.

Of course that only works if the tool exposes the generated SQL in some fashion, it looks like it can but I might be wrong. If it's like you say and it's just natural language in, data out, I'd say that's more likely to do harm than good :)

Edit: Took a minute to look at the code, and it looks like there's no API to generate SQL from a question, just the "question" API that generates SQL on the fly and returns the data. I really think there'd be a lot of potential in at least returning the actual SQL query along with the data. Hope I didn´t miss something!


Hi -- the /question endpoint does return the generated SQL under the sql_query field of the JSON response: https://dataherald.readthedocs.io/en/latest/api.question.htm...

You can even see the entire chain of LLM calls in the intermediate_steps field to help debug the agent.


Yes -- I think transparency on the limitations would be a useful step in mitigation. But I ultimately think that it will become a black box for most of those who use it, with a good deal of blind faith in the results. Sounds pessimistic, but grounded in reality, I think. That's not a deal breaker, though! Just a caveat about how much you can trust the output.


I think a tool that allows you to see the generated SQL and maybe edit it should help achieve what I wrote about! And looking at the code, it shouldn't even be a big change in this one. Sure, some people might not bother/think, there's no helping that. But it'd help those who do. I found non-technical people pretty curious and willing to learn once they get over this imagined barrier of "it's too hard".

I fix small stuff around my house. Does that make me a builder, or threaten builder jobs? Don't think so. I know when to call one and I know to trust them based on understanding the nature and struggles of their work a little bit better.


I agree that allowing a user, particularly a non-technical one, to see the auto-generated SQL (i.e. the SQL created from natural language) might actually encourage understanding of the data. It can create an easier on-ramp for non-tech folks to understand data and get deeper into it.


At the risk of generalizing too much, there's always a chance that layering on a new technology can limit a user's understanding by creating an additional degree of separation. More specifically though, if a PM has been using a BI tool (tableau, looker, etc) or Product Analytics tool (e.g. amplitude, mixpanel) and clicking through dashboards without ability to find their answer, then I don't think this makes them lose additional expertise in the data model.

And perhaps even more importantly, my experience has been that people ultimately wish to understand trends. In other words, an ad-hoc question that is posed and answered in natural language might ultimately be made into a dashboard that lives on the BI tool (so that trends become clearer).


Very nice! I added a similar feature to https://sitespeak.ai recently that allows my users to connect their virtual assistant to a Sqlite database to retrieve data to answer visitor questions. Initially this was just a way around the difficulty of "chatting to a CSV" so the solution was to convert any CSV to a Sqlite database and using that, but it's proved very useful.

Will definitely look at using Dataherald for adding proper SQL support!


Thanks. Let us know if we can help out in any way. We just set up a Discord server you can join as well https://discord.gg/A59Uxyy2k9


Very cool! is this mostly for unstructured data search? Any interest in adding tabular / structured data NL search?


Yes, most customers are using it for unstructured data, but a few do use it with structured data. For those I currently use Langchain's API and SQL chains.


Neat! How difficult would it be to extend this to SQL like query languages like Hasura and PostgREST (via Supabase)?

- https://hasura.io/docs/latest/api-reference/graphql-api/quer...

- https://postgrest.org/en/stable/references/api/tables_views....

^ The advantage would be, these tools handle authorization.


It shouldn't be difficult, if we provide many examples and the db/table schema the agent can provide an accurate response


Interesting. So far we have been focused on allowing non-technical users to self-serve from enterprise data warehouses. It might be possible to enable this scenario with a simple addition to the NL-2-SQL engine to translate the generated SQL to API calls. If that does not work, a custom imeplementation of the NL-2-SQL module would def solve this.

Do you have a specific use-case in mind? Would love to collaborate on adding it to the open-source library if you are open to it.


Mostly my concerns are around authorization (who can access what) and those two tools handle that before generating a database query. So maybe the approach would be to go from natural lang -> SQL (as you do now) then to Hasura's or PostgREST's specific query syntax (which are already pretty SQL like). And specifically with Hasura, since it's GraphQL - you can get all the available types. But that might all be to far out of scope of what you're trying to do.


Re: security - assuming your RDBMS supports row-level security then you should be fine as-is - otherwise let users only query VIEW objects with security filters baked-in to them.


From the prompt context:

DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to the database.

I've found that spelling out exactly what I don't want an LLM to do dramatically increases the probability that it will do exactly that.


"Don't think of a pink elephant".


Yeah, I hope this is paired with something that actually stops those statements from running otherwise this is a pretty naive way to do this.


Yes, opening the connection to the DB read-only would also work. That's what we're planning on doing.


SQL injection meets prompt-injection…


NEVER CREATE CARTESIAN PRODUCTS

For example:

SELECT clicks.session_id, clicks.page_url, sessions.user_id FROM clicks, sessions;

(This context is guaranteed to generate lots of cartesian products.)


Couldn’t you solve this simply by executing from a user without these permissions?


Sure, but it's something that the customer has to do. I wouldn't rely on customers to configure everything correctly, especially if there is no way to discover the configuration error before the damage has been done.


Congrats on the initial launch!

Here a few thoughts, feedback, and questions:

* you do a good job in this post of describing why you need more than just ChatGPT to get acceptable quality, but much less of that is in the readme. I wouldn't be afraid to sell the project a bit harder, even in early days

* likewise, I think a small visual of the architecture would be helpful, just to make clear what the relationship is to chatGPT, and the additional features of a context store, etc

* Between the notes here and your product pages, it seems a strategy for commercialization with this repo being a lower level tool, with your commercial service having UI, simplified integration, and a hosted version to make this easier for semi-techical teams? If that is the case, I wouldn't be afraid to make that more explicit. GitHub is more and more a place for discovery, even to semi-techical people, but to do that well, I think focusing on a readme that makes it clear who the open source is for is important

* How are you planning on solving the data access problem? Is this a full SaaS service that will need access to the customer data directly somehow? Do you deploy this in the customer's environment? In thinking beyond this open source release and to the commercial side, that would be what I would want to know

* This point has been hit by others in this thread, but I would be curious to know what your plans are to help protect against valid but incorrect queries? I am not as convinced that this problem is insurmountable, as it does seem like you could build features to remove ambiguity by asking questions, show alternatives, etc that most semi-techical people could reason through, but ultimately it seems like thinking about how to involve the teams that own the data might be an important part of the problem.

Anyways, congrats again! This is an area I am really excited to see how it evolves (and will be exploring more!). My email is in my profile if you are interested in chatting more :)


Thanks for the feedback regarding the Readme. We have some additional detail in the docs here https://dataherald.readthedocs.io/en/latest/index.html (including an architecture doc), but will definitely be incorporating some of the feedback as we iterate on the documentation which.

You can connect the hosted version to your data warehouse using the SQLAlchemy connectors which are in the open source engine, and the connection info will be stored after being encrypted. An on-prem version of the enterprise product is also on the roadmap.

For protecting against errors the approach is to allow for a human to be in the loop as you alluded to. One of the core pieces of the engine is the Evaluation module which combines white box and black box techniques. You can block the response from going back to the question asker if the confidence level is below a certain threshold until a human verifies the response manually (or always put the human from the data team in the loop).


Will shoot you an email to connect and chat further.


I want to double-down on what AAzo11 replied. There is a human in the loop via an admin dashboard that allows quick approval / editing in the event of incorrect queries. This significantly decreases time for a member of the data team to field requests since they typically block out an hour to approve/edit these queries before going back to the question-asker.


When I look at all the LLM SQL tools, I think: what a cheap and accessible way to get the wrong answers.

SQL is easy. Knowledge management is hard. Does the LLM know that there was a bug in June that changed the data? Does it know that this one column is improperly named and confusing? Does it know that you recently released a mobile app the data from which is in a different table?

No, of course not, those things are never explicitly documented and so are invisible to an LLM.


SQL is easy

If it was that easy you wouldn't have a tradition of devs trying for any alternative.

There's obviously a use case for this sort of product, objections appealing to the ease of use of any technical language or toolset are unlikely to be convincing to the majority who are not comfortable with it.


The history of programming is devs frustrated at arbitrary limitations of syntax or modelling and forming new ones, with their own arbitrary limitations of syntax or modelling.

When your only tool is a FOR loop hammer, every set based operation frustratingly looks less like a nail than a screw.


Love this! I don't think you could be more right about the practical challenges of implementing something like this. In my experience, this same problem is what makes it so challenging to onboard new data scientists/analysts.

It takes a lot of training to get a team member up to speed - with the same amount of training, do you think an LLM can compete?


SQL is easy, the problem is that some enterprise database schemas have gotten incredibly complex over time. I think LLMs might help the maintainers navigate such a complex landscape. Especially if comments are added to each table with clarifications. The only serious limit here is the LLM's context length...


Hit the nail on the head! Not only is the context length a limitation, but the speed of response gets impacted as well.

With a human in the loop, even providing a "mostly" correct SQL that takes a swing at the correct joins between relevant tables reduces the data practitioner's work significantly. Of course, as more questions are asked, the tool gets better at writing the SQL better. Almost like a human in a Database Management and SQL class...


>Does the LLM know that there was a bug in June that changed the data?

No. Does "the SQL"? Also no.

That requirement falls on the person using the tool, doesn't it?


Is there a reason why an LLM can't learn these nuances the same as a human?


A less sarcastic take:

This only works if you have perfect data engineering. But if you have perfect data engineering, point and click tools like Looker will work too.


Sort of. Having perfect data engineering is a requirement if you want to connect an LLM straight to your data warehouse. For real world scenarios, you need a way to add context over time (including examples of how to answer questions from messy data). The same way a new team member would need to be on-boarded the tool needs to learn the context of the data and business logic, store it under supervision from an admin and be able to retrieve it in generating the SQL.


Personally I think that's a great response. Continuous schema mapping, gotcha-patching, and formalization of undocumented knowledge is imperative, and this can be solved by engineers using an ongoing process. Kudos on the launch.

I did an NL-to-SQL startup, but now I think is a much better time to do this.


Curious as to why you pivoted away from the previous NL-to-SQL startup?


> select price from rent_prices where city=”Los Angeles” AND month=”05” AND year=”2023”

I have to say the best implementations out there are already far ahead. Naturally there are some context limitations with large database schema but those can be handled:

- having an easy way to include/exclude tables ad hoc

- importing database schema in multiple different data sources that you can switch between depending on usage

And of course if AI isn't using the correct tables or columns, simple prime the prompt using autosuggest feature containing the tables and columns.

Generating JOINS isn't really a hurdle if you have a good prompt template:

- https://aihelperbot.com/snippets/cllovmlat000kmj0fedssj8c5 (notice it even uses the correct "film" tables even-thought it was instructed to find "movies" and naturaly category and category bridges table)

- https://aihelperbot.com/snippets/cllnztzky0002mj0fa78hw0o2 (it can also find lat/lng for Boston for AWS Athena, was demonstrating this for a Boston company)

- https://aihelperbot.com/snippets/cllo1vwl6000wlb0fj5669zt4 (It can generate in Mandarin as well, but less capable than English. Same goes for German, French, Spanish and so)

You can also have AI generate advanced SWOT analysis based on your database schema: https://i.imgur.com/JTv4QmX.png


We are experimenting in this area, as a way to provide a way for community managers to query their Discourse PG database using natural language.

Your product is very interesting. Maybe we can collaborate here, as Discourse is a standard Rails database schema on PostgreSQL and could be a good way to show your tool working in a real database.


Hi -- would love to connect and discuss more. Shoot me an email amir at dataherald.com.


Could you provide more details about your approach? Is the context store based in vector-embeddings?


I'd be delighted to assist you with your inquiries. Indeed, the context store interacts with vector databases to retrieve samples based on vector embeddings and cosine similarity.


In your experience, in general when comparing vector embeddings, what layer of the net works best? Are the embeddings from the first hidden layer, the second, the last, etc?


Congrats for your launch !

I also built NL2SQL solution (relying on OpenAI).

My first version was a direct NL2SQL version, named Olympe - https://github.com/BenderV/olympe I used it quite a while but trying to plug it to real database (100+ tables, unprepared) was unsuccesful.

I switched to chat version, named Ada - https://github.com/BenderV/ada

IMHO, it's the way to go. The AI explore the database, it's connection, the data format & co. Plus, it help with ambiguity and feels more "natural".


Nice! how's the switch to Ada been? Any unexpected hurdles?


Great since GPT4. However, I need to add some optimization because it's take time for the AI to explore databases / tables schema / structures.


Hey ! This is awesome. I have been playing around with the dirty DIN-SQL repo for a couple of weeks, so this is a delight.

A couple of questions:

* How easy is it to swap out the LLM ? (to Azure GPT4)

* Do you have 1 end2end custom user scenario ? It would really help to have 1 golden workflow (the fastest way to get things up and running) and 1 custom workflow (how to maximally configure the package to make it work the way you want) outlined in a demo/blog/tutorial.

Great work! I especially like how neat and transferrable the ideas in DIN-SQL are. Should naturally improve itself as new LLMs come out.


Hi -- thanks for the kind words. While right now the engine only works with OpenAI, swapping out other LLMs simply from the envars is on the roadmap. Since we use Langchain for LLM calls this logic is already abstracted away so it will be a small code change.

We have some samples (including a quickstart in the README and docs https://dataherald.readthedocs.io/en/latest/quickstart.html) but will be adding more tutorials in the coming days.


Hello, and thank you for your positive feedback on our work with Dataherald.

I'm currently in the process of tidying up the DIN-SQL repository, and I apologize for any inconvenience this may cause :).

Regarding the usage of other Large Language Models (LLMs), we offer support for OpenAI, Anthropic, Google, and Cohere models. However, it's important to note that our testing and optimization efforts have primarily been concentrated on OpenAI models. If you intend to utilize other models, it's advisable to proceed with caution and be mindful of potential variations in performance and behavior.


This looks very interesting and timely for me, thanks for sharing!

I was looking at the project to spin it up with a BigQuery connection, but I couldn't find an example in the docs (expected it on this page: https://dataherald.readthedocs.io/en/latest/api.database.htm...), nor can I find it in the repo. Any hints on this? Otherwise I'm fine going source code diving. :)


Hi, you can find updated documentation on connecting to BigQuery here: https://dataherald.readthedocs.io/en/latest/api.database.htm.... We have also updated the ReadMe.


Thanks for the update, yeah that makes sense!

I think it's useful to add as a note how to add credentials when running inside Docker since that mode is included right there in the readme (ie. adding a volume with the credentials).

As an idea, for the "Adding string descriptions" section, having stuff like dbt docs (or even just the manifest that generates) could supercharge the context.

Very interesting in general, definitely trying it out for our database!


Nice job! We're building something relatively similar at Vanna AI: https://vanna.ai/


Steve from Hegel-AI had showed us Vanna. Looks super cool. Let us know if you think there is room to collaborate.


awesome, we'll check it out!


Any solutions like this that work with self hosted models?


Depends what you mean by 'work.' You can definitely use them, but GPT-4 is the only model that generates good SQL without lots of training + fine-tuning (and it is very difficult/impossible to come up with Nl <> SQL training sets). So GPT-4 is the only model that works well in terms of accuracy for now.

That being said, I am pretty sure the self hosted models will get there very soon.

Are you looking to deploy a solution with a self-hosted model?


What about datasets like https://huggingface.co/datasets/b-mc2/sql-create-context? An OSS model based on StarCoder was also recently published which is roughly between GPT-3.5 and GPT-4: https://github.com/defog-ai/sqlcoder


Yes, we have use cases that can't leverage GPT-4 for data privacy reasons. I was curious if any self hosted models would suffice and could be integrated.


GPT4 is best right now for all the reasons you likely already know. We are working on self-hosted but for right now it is GPT4. We can let you know when we work on self-hosted, which will no doubt come in the future. feel free to email me at anuj at dataherald dot com or sign up on our site (https://www.dataherald.com/contact) and i will keep you posted.

i should note that our (dataherald's) product only sends schema to the LLM, which is important to note.


Looks really cool. Congrats on the launch. Typo in the readme, btw, "enteprise-level" instead of "enterprise-level"


Thanks. Noted. We will fix.


this is very interesting. just curious - is the context store specific to the table structure of the target database ? or is it a very generic one. im asking this cos the DIN-SQL algorithm seems to be a chain-of-thought kind of algorithm (with query decomposition). I'm wondering where do u plug-in the context store ? in DIN-SQL the schema linking module is random samples from Spider. Is that whats in the context store ?


Thank you for your interest in our work. The schema linking approach employed in our agent significantly differs from the one described in my paper. In the paper, we utilized a method that involved breaking down questions and matching entities with the table schema. However, when dealing with large datasets, the approach outlined in the DIN-SQL paper is no longer feasible, as we cannot feasibly provide all table names and columns within the prompt. As a result, we opted for an alternative method, leveraging table embeddings along with the NL question embedding to facilitate the schema linking process.

Regarding the context store you mentioned, it plays a vital role in our system. We use the context store to retrieve previously verified questions from earlier interactions. This functionality assists the agent in locating the correct tables and columns based on the knowledge gained from past interactions with our system.


Reminds me of https://www.parse.dev Used it and it was pretty cool!


Do you still use this? Thoughts?


This is interesting. I have an enterprise customer who's looking for exactly this. Any plans to support DB2?


happy to connect and describe more to see whether we could collaborate. please feel free to email be at anuj at dataherald dot com or sign up here and we will get in touch https://www.dataherald.com/contact


How are you dealing with stored procedures?


TLDR -- We currently do not automatically load stored procedures as context for in context learning/prompting but it is on the roadmap.

Additional color: Once you configure a connection to a DB, you can trigger a 'scan' of the DB through the API which detects various tables/columns, identifies low cardinality (categorical) columns and their values and also loads in VIEWS stored on tables. In our default NL2SQL agent, these are then used together with 'verified' NL2SQL pairs and with other metadata you can manually add to generate the SQL.

We are looking to automatically add context from Stored procedures and other sources like dbt and various data dictionaries as well.


I’ve seen impressive demos with just LLM direct to SQL. Can’t wait to see what people build with these new tools.


Hi Dylan -- new GPT-4 class LLMs have gotten good at writing correct SQL, but while the SQL they generate almost always executes correctly they often write SQL that generates incorrect answers to the question. Some reasons for this can be the business context or definition is not in the table schema, or if the correct query is complex and requires multiple joins.


Yeah, you really shouldn’t be building this (or really any sort of important or destructive interface) with LLMs unless you have an intermediate representation which users can inspect and understand. Even if the understanding of English were perfect there are still ambiguities you’d need to sort out. You also need to be able to suggest something useful if the user asks a nonsense question (for example data that doesn’t exist in your schema or its contents). When I worked on this a decade ago we were quite good at mapping unknown queries to various suggestions drawn from a canonical subset of English for which we had very explicit semantics. But obviously the Soave of unknowns was huge.

I’ve no doubt the AIs will reach human parity at some point (and therefore still make mistakes!) but I’d be terrified deploying any sort of one-shot text-to-results pipeline today.


It's also because of the inherent nature and hallucinations in LLMs, likely impossible to remove completely, always double check the LLM's work


Totally agree. In the hosted version built on top of the engine we block the answer from going to the question asker until an authorized user 'verifies' the answer. However these 'verified' answers are then stored in the context store and retrieved for few shot prompting.


How do you do the initial query -> possible schema matching? Also, how do you control token counts?


For schema matching, we leverage embeddings. We create embeddings for the tables within the database and generate one for the natural language question provided. We then calculate the cosine similarity between these embeddings to determine a relevance score. This score is then forwarded to the agent, which uses it to identify the pertinent tables.

To minimize the token count used by Large Language Models (LLMs), we've designed our tools to handle most of the computational work internally. Consequently, only a small number of tokens are transmitted to the agent. The agent focuses solely on decision-making, while the tools handle the heavy lifting and similarity assessments.


How is this approach different from Nalir? https://dl.acm.org/doi/10.1145/2588555.2594519


In this demonstration, we show that NaLIR, while far from being able to pass the Turing test

2014, so it (and previous similar projects) are not going to be able to compete with contemporary LLMs.


yes, exactly. recent advancements in LLMs have changed the game entirely


"So is enterprise conversational BI impossible in 2023? Will there be a few more years of academic papers and company AI hackathon projects before a solution can be deployed in production? We don’t think so." -- from the medium article.

Want to put your attention to https://www.veezoo.com as well. A conversational self-service analytics solution that's been around since 2016 and productively deployed in fortune 500 companies and used by thousands of users daily :)

Congrats on the launch - and also reaching the top of the Spider dataset. We're very familiar with that dataset and its difficulties :)!

Happy to have a chat as well!


Amazing! Feel free to shoot me a line at amir at dataherald


done :)! Looking forward


Very cool. I'm checking out veezoo.com now!


We were using OpenAI's codex model with a bunch of few shot examples for natural language queries over blockchain data coupled with a finetuned completions model for plot the results when we were working on makerdojo.io.

Without the ability to fine tune Codex, we had to employ a bunch of techniques to pick the right set of examples to provide in the context to get good SQL. We have come very far since then and it has been great to see projects projects like sqlcoder and this.

Pieces of the pipeline powering makerdojo eventually became LLMStack (https://github.com/trypromptly/LLMStack). We are looking to integrate one of these SQL generators as processors in LLMStack so we can build higher level applications.


We are looking to partner with other open source projects. Let's connect. Shoot me a line at amir at dataherald.com


Excellent! This blows writing SQL by hand out of the water.

In 10 years, programming languages like Java, SQL, etc will be dead.

I predict everyone will write in their native language (English, Chinese, etc) and a AI-based compiler will write the actual code for them.


Once an AI is capable of writing sensible code it will be telling humans what to do, not the other way around.


said like a true swe


A mix of magic and imagination.


the future indeed!


Not to dismiss the effort here, but wondering if this should go even further than bridging natural language to SQL. With recent work on LLMs showing promising capabilities in handling complex data structures, maybe the real play is to cut out the SQL altogether. Direct interaction with LLM-optimized data stores could yield more efficient, flexible, and fine-tuned results, eliminating the translational losses that might occur in the NL-to-SQL paradigm.

Like instead of shoehorning LLMs into old paradigms, why not conceive data storage systems inherently designed for LLM interfacing?

There's a latent inefficiency in using SQL as a bridge. It feels like trying to fit a new narrative into an old framework

And yes, rethinking database architectures from the ground up is daunting and demands more background in storage systems but given where machine learning and data interaction are heading, maybe we should consider it.


This is a rather superficial take that seems to only consider the aspect of querying a database, but, nevertheless, I'll jot down a few quick counterpoints:

1) A middle layer, such as SQL, offers a vital introspection point to understand and fix LLM mistakes. Without it, we will end up with a "black box" that could lead to errors in data queries and manipulations, making debugging and correction much more challenging.

2) Mathematical properties and data integrity: Databases are built around principles like 3NF and BCNF that are pivotal for efficiency and data integrity. It is highly doubtful that these essential principles could be as effectively replicated in a pure LLM-based system without losing critical aspects of data management and efficient query planning & execution.

3) You mentioned "translational losses" without clarity on what these might be. SQL is a specialized declarative language with clear structure and syntax, while human language often leaves room for ambiguity. Relying solely on an LLM could lead to interpretations that are likely but not necessarily correct. This ambiguity might actually increase "translational losses" rather than reduce them.

I understand there's a lot of excitement around generative AI at the moment, but DBMS are an extremely complex topic and this feels like another case of "when you have a hammer everything looks like a nail".


Appreciate your insights. A few comments/responses

1. Agree on the introspection point, but it's worth noting that the future of LLMs might involve self-awareness capabilities, which could provide an introspective mechanism similar to SQL's transparency. the aim would be to build upon this middle layer, not blindly supplant it. 2. While 3NF, BCNF, and other normalization forms have served us well, they are essentially tools to manage imperfections in our storage and retrieval systems. LLMs can be trained to understand these nuances intrinsically. additionally, database theories have evolved, with advancements like distributed databases, graph-based models, and nosql. so, it's not entirely outside the realm of possibility that we can pivot and adapt to new paradigms. 3. The "translational losses" referred to the semantic disconnect between natural language queries and their SQL representations. while SQL is unambiguous, the leap from a user's intent to the SQL formulation often introduces errors or inefficiencies. LLMs can be trained to improve over time, constantly refining based on feedback loops.

not arguing that SQL or databases as we know them are obsolete today, just advocating for a more imaginative exploration about where the tech is headed.


Your answer to question 2 tells me that you haven’t read Codd’s seminal paper.


I doubt any current LLM is up to the job of bypassing SQL, even for the simplest of queries.

The thing is, an SQL database is already a sort of AI - an old school algorithmic AI:

“The code generator, and especially the logic in where*.c and in select.c, is sometimes called the query planner. For any particular SQL statement, there might be hundreds, thousands, or millions of different algorithms to compute the answer. The query planner is an AI that strives to select the best algorithm from these millions of choices.”

https://www.sqlite.org/arch.html

Translating natural language into SQL is ironic, because SQL was intended to be a natural language that would be interpreted by an AI.

In the future maybe an AI could be trained to bypass high-level languages. Perhaps it could generate intermediate representation (IR) code - skipping the tokenising and parsing. Perhaps it could generate byte code directly. Or it could even skip the VM and generate machine code to work with the back-end API.


SQL is definitely an old technology. However it is still the main language for interfacing with structured data, even for newer tools like Clickhouse. Down the road it is conceivable that it will be replaced with something else, but the cost of migrating existing data to those will be so high I imagine no-one will take the step to proactively move.


SQL makes the most sense if humans are going to be programming deterministically, but if we're talking about natural language input I can't help but think that compiling natural language to SQL feels a little shortsighted as a long term strategy.

Surely it makes sense for a MVP, but I'm curious about how much we're potentially forfeiting in terms of innovation by anchoring so deeply to it with all the possibilities afforded by generative AI. In holding onto SQL, might we be forestalling the development or adoption of a more evolved language? With Dataherald's proposition, how do you envision maintaining that cutting edge, ensuring you're not just prolonging the lifespan of a fading tech, but genuinely pushing the envelope forward?

I suppose one intriguing possibility in my mind is that Dataherald could serve as a bridge or runtime. As you continue to develop and refine the tool, do you see a future where it might support more abstract querying languages or even emerging data interaction paradigms? This could potentially mitigate some transition pains when the tech ecosystem eventually shifts.


Do you mean that SQL will significantly fade away?


Curious what recent work you are referencing? Lots of new so I don’t want to miss things. In my experimentation I basically wrote off LLMs from doing any math directly. That includes aggregations or precise calculations. I’ve been playing with sql generation and allowing the db to do its thing.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: