Show HN: Translate English to SQL

BrentOzar · on June 5, 2019

Ah, this brings back fun memories. Over a decade ago, Microsoft SQL Server had English Query to do the same thing. Developers had to define all the words & relationships that would be used, and keep them up to date over time. Microsoft discontinued it around 2005.

I remember briefly working with it and realizing, “No, this is never going to work because users have a tough time spelling things accurately.”

johnsonjo · on June 6, 2019

If misspelling is one of the main problems than it would be interesting to have suggestions based on how similar a word was to another, so that you could have something similar to the rust compiler’s “<undefined variable name> does not exist, did you mean <defined variable name>”. I think it uses Levenshtein distance, and maybe some other algorithms/heuristics to pick it’s best guess.

Myrmornis · on June 5, 2019

I am not criticizing this project -- obviously there's a desire for software to understand natural language questions and certainly makes sense to target SQL.

However, it immediately reminds me of the main drawback of SQL: it was designed with the aim that it would be easy for non-programmers to use, at a point in history when that seemed to make sense to the designers. Hence SQL itself has a natural languagey feel, e.g. a keyword like "group by" has an embedded space despite "group" and "by" having no meaning individually. As many others have observed less superficially, SQL lacks composability. It feels less like a programming language and more like a natural language interface to a relational database. It's certainly great, and has proven the test of time, etc etc, but I agree that we can do better, e.g. the following article's discussion, if not their proposed solution.

https://edgedb.com/blog/we-can-do-better-than-sql/

drinane · on June 5, 2019

The reverse is way more interesting to me.... do a seemingly bunch of random aggregations and joins and send me an email of all the English versions and let me look at the results of those if I care to. The issue with these types of ideas is always that only "clean" or heavily analyzed data can generate this level of automation... and this type of tool is really only useful for when you are looking at virgin data or data that is crap.

mynegation · on June 5, 2019

This is a wrapper for a pretrained model. Probably a better github link would be https://github.com/naver/sqlova/

tiborsaas · on June 5, 2019

Since this is using NN under the hood, I would add "INSUFFICIENT DATA FOR MEANINGFUL ANSWER" for zero rows returned :)

How does it handle real life tables such as "employee_stock_option_pool_not_vested_IDS" for example?

mitchtbaum · on June 5, 2019

Dhruv Baldawa, a friend of mine, also has an interest in this and a good write-up: https://www.dhruvb.com/blog/posts/specialized-syntax-for-qui...

amolo · on June 5, 2019

Nice. Very interesting. But 4 gigs just to get I running. That's probably more than the DB I'm trying to query.

z3t4 · on June 5, 2019

There's nothing wrong with being resource efficient, but relatively SQL queries used to take two minutes, and the result printed on paper, and the whole machine cost two million.

sfoley · on June 5, 2019

Yeah and sending a message to someone across the sea used to take months and probably wouldn’t even get there half the time, that doesn’t meant it’s not ridiculous that I can’t run two instances of slack without my computer melting.

z3t4 · on June 6, 2019

the VR world where everything is HR enough that you can go to work inside the simulation. Like in various Sci-fi movies.

hadsed · on June 6, 2019

Without really digging into this, if it's a neural model like a sibling comment mentions perhaps you can do model compression that shrinks many multiples, if that is the majority of the image size.

jtbayly · on June 5, 2019

Yes, I don’t know anything about how/what this is running, but I immediately can’t use it based on requirements. Too bad.

libria · on June 5, 2019

Dockerfile says

    FROM ubuntu:18.04

It'd be interesting to see what the space savings to Alpine would be.

y4mi · on June 5, 2019

at 4g? not much... ubuntu base is <100MB. thats massive compared to alpine with <10MB, but a tiny difference if we're talking 4G.

i wanted to check with dive [1] but sadly couldn't pull the linked image. the mentioned name on the github readme doesnt seem to exist

[1] https://github.com/wagoodman/dive

c8g · on June 5, 2019

does it only work with a question that returns a single answer or it can be used to build something like https://www.thoughtspot.com/solutions/healthcare-life-scienc... ?

o10449366 · on June 5, 2019

Very interesting. I did a similar project for my machine learning class in grad school. I'm curious what kind of architecture/model framework you used for this.

____Sash---701_ · on June 5, 2019

Looks like PyTorch

____Sash---701_ · on June 5, 2019

Or Panther, who knows

newaccoutnas · on June 5, 2019

PyTorch accorging to https://github.com/paulfitz/sqlova#requirements

Also CoreNLP

viach · on June 5, 2019

Why do people fear programming that much, so that even a simple SQL query is tried to be abstracted?

Same thing with visual "no code" programming. Is that really simpler to endlessly move around colorful blocks than writing a few lines in, say, Python?

regular_person · on June 5, 2019

Yes? If you work with it every day you forget how obtuse and opaque programming can be. Most normal people barely understand basic consumer electronics.

TazeTSchnitzel · on June 5, 2019

Programming is like many things not an immensely difficult task, but it is one that requires knowledge and experience, so a true novice confronted with a SQL manual would not be able to quickly write the right code. Someone who's done programming before would have a much easier time.

Myrmornis · on June 5, 2019

It's not for programmers; think more of Siri/Alexa translating to SQL.

(I'm with you, I hate natural language interfaces too, but there we are.)

rgoulter · on June 5, 2019

The programmer's equivalent is "what's the StackOverflow answer for how to do X with Y?". (e.g., Microsoft demonstrated some of their tech with a Stack Overflow Bot https://github.com/microsoft/BotFramework-Samples/tree/maste... -- idk if anyone actually uses it though).

Sure, I should do it the "go fast by going slow" way and spend time to read the manual and go through the tutorial. But often I'll want to know something which many people will have asked.

michaelmior · on June 5, 2019

It's great to see a nice open source example of work in this area. It would be helpful if there were also some examples of where the model fails either by not being able to produce a query or producing an incorrect query.

rezeroed · on June 5, 2019

I might be too much of a control freak to give up the explicit precision of SQL.

Myrmornis · on June 5, 2019

It's not intended to replace your code :) It's intended to turn normal peoples' words into code.

ohadron · on June 5, 2019

I feel that SQL queries that people will need help with are not the SQL 101 examples shown here.

Looking at the queries I write as a product manager, I wouldn't have the slightest clue how to translate them to English. How do you LEFT JOIN on a relation and fetch NTiles on a calculated value from three different tables? Surely this will require some English-to-schema dictionary that someone who knows SQL will have to maintain.

gatherhunterer · on June 6, 2019

Humans think with language. Anything that can be understood can be written in language. The README acknowledges the need for further research and development but it's not hard to imagine how a more complex query could be written in English.

"How many bridges are there that are in the same state as any bridge designed by O.H. Amman and are shorter than the number of kilometers between the Earth and the Moon divided by the number of people living in the state and have a maximum load that is at least as great as the combined weight of all currently living Asian elephants?"

When someone walks over and asks, "What does this query do?" Do you just tell them that it cannot possibly be put into English? Of course not, knowing the answer and knowing how to say the answer in at least one human language are the same thing.

LunaSea · on June 5, 2019

The example is probably wrong or not working since the English questions uses "throgs neck" in lower case instead of the captialized "Throgs Neck" version.

Those are two different strings in SQL and shouldn't match exactly.

This already shows one difficulty of translating a natural language to a SQL query.

laurent123456 · on June 5, 2019

Aren't the queries case insensitive by default?

zainhoda · on June 5, 2019

It depends on the collation method used:

https://kendsnyder.com/utf8_bin-vs-utf8_general_ci/

utf8_bin: compare strings by the binary value of each character in the string

utf8_general_ci: compare strings using general language rules and using case-insensitive comparisons

utf8_general_cs: compare strings using general language rules and using case-sensitive comparisons

novaleaf · on June 6, 2019

can be, I think it depends on the implementation.

nothrabannosir · on June 5, 2019

Looks fascinating. I tried the docker example but I'm getting a 405 Method Not Allowed when I visit localhost:5050/ in the browser :/

herohamp · on June 5, 2019

Check the examples, you meant to upload the CSV and questions using curl

monkeydust · on June 5, 2019

From a product managers pov this looks very interesting - is there a demo environment available somewhere to play with this?

mritchie712 · on June 5, 2019

Just curious, what tool do you use to write SQL today and how much time do you spend doing it? Would you be interested in this to reduce the time you spend writing SQL?

sheeshkebab · on June 5, 2019

I think anyone who uses sql would find this tool (and a lot of similar abstractions) too imprecise, verbose, and error prone.

What's the longest bridge in mariland - or was it merryland (or maybe state_code=MD), is it in bridges table or bridge, or maybe bridge_specs joined on bridges by some id...I'll just write my sql in my favorite sql autocomplete tool.

mritchie712 · on June 5, 2019

I agree, but I was thinking it could give people that don't know much SQL a starting point. They could ask a natural language question > get SQL > tweak SQL. So in your example, they might get something back like:

    SELECT MAX(length) FROM bridges WHERE state = 'merryland'

And they could recognize the error of merryland vs. MD.

exdsq · on June 5, 2019

Really interesting. Where can I learn more about building something similar myself?

FiberBundle · on June 5, 2019

You can e.g. do this using semantic parsing, I remember some older papers but cannot find them right now. For a more recent paper see [1] (haven't read it though). You can also use Deep Learning for this. There are some interesting approaches using Reinforcement Learning [2]

[1] https://www.aclweb.org/anthology/P18-1034 [2] https://arxiv.org/pdf/1709.00103.pdf

mritchie712 · on June 5, 2019

Do you mean building the model on a different training set? If so, this would be a good place to start https://github.com/naver/sqlova/

duxup · on June 8, 2019

This might not be useful directly but this is a super cool project.

quizotic · on June 5, 2019

Does anyone remember Larry Harris and "Natural" back in the 1980s?