Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Translate English to SQL (github.com/paulfitz)
209 points by paulfitz on June 5, 2019 | hide | past | favorite | 45 comments



Ah, this brings back fun memories. Over a decade ago, Microsoft SQL Server had English Query to do the same thing. Developers had to define all the words & relationships that would be used, and keep them up to date over time. Microsoft discontinued it around 2005.

I remember briefly working with it and realizing, “No, this is never going to work because users have a tough time spelling things accurately.”


If misspelling is one of the main problems than it would be interesting to have suggestions based on how similar a word was to another, so that you could have something similar to the rust compiler’s “<undefined variable name> does not exist, did you mean <defined variable name>”. I think it uses Levenshtein distance, and maybe some other algorithms/heuristics to pick it’s best guess.


I am not criticizing this project -- obviously there's a desire for software to understand natural language questions and certainly makes sense to target SQL.

However, it immediately reminds me of the main drawback of SQL: it was designed with the aim that it would be easy for non-programmers to use, at a point in history when that seemed to make sense to the designers. Hence SQL itself has a natural languagey feel, e.g. a keyword like "group by" has an embedded space despite "group" and "by" having no meaning individually. As many others have observed less superficially, SQL lacks composability. It feels less like a programming language and more like a natural language interface to a relational database. It's certainly great, and has proven the test of time, etc etc, but I agree that we can do better, e.g. the following article's discussion, if not their proposed solution.

https://edgedb.com/blog/we-can-do-better-than-sql/


The reverse is way more interesting to me.... do a seemingly bunch of random aggregations and joins and send me an email of all the English versions and let me look at the results of those if I care to. The issue with these types of ideas is always that only "clean" or heavily analyzed data can generate this level of automation... and this type of tool is really only useful for when you are looking at virgin data or data that is crap.


This is a wrapper for a pretrained model. Probably a better github link would be https://github.com/naver/sqlova/


Since this is using NN under the hood, I would add "INSUFFICIENT DATA FOR MEANINGFUL ANSWER" for zero rows returned :)

How does it handle real life tables such as "employee_stock_option_pool_not_vested_IDS" for example?


Dhruv Baldawa, a friend of mine, also has an interest in this and a good write-up: https://www.dhruvb.com/blog/posts/specialized-syntax-for-qui...


Nice. Very interesting. But 4 gigs just to get I running. That's probably more than the DB I'm trying to query.


There's nothing wrong with being resource efficient, but relatively SQL queries used to take two minutes, and the result printed on paper, and the whole machine cost two million.


Yeah and sending a message to someone across the sea used to take months and probably wouldn’t even get there half the time, that doesn’t meant it’s not ridiculous that I can’t run two instances of slack without my computer melting.


the VR world where everything is HR enough that you can go to work inside the simulation. Like in various Sci-fi movies.


Without really digging into this, if it's a neural model like a sibling comment mentions perhaps you can do model compression that shrinks many multiples, if that is the majority of the image size.


Yes, I don’t know anything about how/what this is running, but I immediately can’t use it based on requirements. Too bad.


Dockerfile says

    FROM ubuntu:18.04
It'd be interesting to see what the space savings to Alpine would be.


at 4g? not much... ubuntu base is <100MB. thats massive compared to alpine with <10MB, but a tiny difference if we're talking 4G.

i wanted to check with dive [1] but sadly couldn't pull the linked image. the mentioned name on the github readme doesnt seem to exist

[1] https://github.com/wagoodman/dive


does it only work with a question that returns a single answer or it can be used to build something like https://www.thoughtspot.com/solutions/healthcare-life-scienc... ?


Very interesting. I did a similar project for my machine learning class in grad school. I'm curious what kind of architecture/model framework you used for this.


Looks like PyTorch


Or Panther, who knows


PyTorch accorging to https://github.com/paulfitz/sqlova#requirements

Also CoreNLP


Why do people fear programming that much, so that even a simple SQL query is tried to be abstracted?

Same thing with visual "no code" programming. Is that really simpler to endlessly move around colorful blocks than writing a few lines in, say, Python?


Yes? If you work with it every day you forget how obtuse and opaque programming can be. Most normal people barely understand basic consumer electronics.


Programming is like many things not an immensely difficult task, but it is one that requires knowledge and experience, so a true novice confronted with a SQL manual would not be able to quickly write the right code. Someone who's done programming before would have a much easier time.


It's not for programmers; think more of Siri/Alexa translating to SQL.

(I'm with you, I hate natural language interfaces too, but there we are.)


The programmer's equivalent is "what's the StackOverflow answer for how to do X with Y?". (e.g., Microsoft demonstrated some of their tech with a Stack Overflow Bot https://github.com/microsoft/BotFramework-Samples/tree/maste... -- idk if anyone actually uses it though).

Sure, I should do it the "go fast by going slow" way and spend time to read the manual and go through the tutorial. But often I'll want to know something which many people will have asked.


It's great to see a nice open source example of work in this area. It would be helpful if there were also some examples of where the model fails either by not being able to produce a query or producing an incorrect query.


I might be too much of a control freak to give up the explicit precision of SQL.


It's not intended to replace your code :) It's intended to turn normal peoples' words into code.


I feel that SQL queries that people will need help with are not the SQL 101 examples shown here.

Looking at the queries I write as a product manager, I wouldn't have the slightest clue how to translate them to English. How do you LEFT JOIN on a relation and fetch NTiles on a calculated value from three different tables? Surely this will require some English-to-schema dictionary that someone who knows SQL will have to maintain.


Humans think with language. Anything that can be understood can be written in language. The README acknowledges the need for further research and development but it's not hard to imagine how a more complex query could be written in English.

"How many bridges are there that are in the same state as any bridge designed by O.H. Amman and are shorter than the number of kilometers between the Earth and the Moon divided by the number of people living in the state and have a maximum load that is at least as great as the combined weight of all currently living Asian elephants?"

When someone walks over and asks, "What does this query do?" Do you just tell them that it cannot possibly be put into English? Of course not, knowing the answer and knowing how to say the answer in at least one human language are the same thing.


The example is probably wrong or not working since the English questions uses "throgs neck" in lower case instead of the captialized "Throgs Neck" version.

Those are two different strings in SQL and shouldn't match exactly.

This already shows one difficulty of translating a natural language to a SQL query.


Aren't the queries case insensitive by default?


It depends on the collation method used:

https://kendsnyder.com/utf8_bin-vs-utf8_general_ci/

utf8_bin: compare strings by the binary value of each character in the string

utf8_general_ci: compare strings using general language rules and using case-insensitive comparisons

utf8_general_cs: compare strings using general language rules and using case-sensitive comparisons


can be, I think it depends on the implementation.


Looks fascinating. I tried the docker example but I'm getting a 405 Method Not Allowed when I visit localhost:5050/ in the browser :/


Check the examples, you meant to upload the CSV and questions using curl


From a product managers pov this looks very interesting - is there a demo environment available somewhere to play with this?


Just curious, what tool do you use to write SQL today and how much time do you spend doing it? Would you be interested in this to reduce the time you spend writing SQL?


I think anyone who uses sql would find this tool (and a lot of similar abstractions) too imprecise, verbose, and error prone.

What's the longest bridge in mariland - or was it merryland (or maybe state_code=MD), is it in bridges table or bridge, or maybe bridge_specs joined on bridges by some id...I'll just write my sql in my favorite sql autocomplete tool.


I agree, but I was thinking it could give people that don't know much SQL a starting point. They could ask a natural language question > get SQL > tweak SQL. So in your example, they might get something back like:

    SELECT MAX(length) FROM bridges WHERE state = 'merryland'
And they could recognize the error of merryland vs. MD.


Really interesting. Where can I learn more about building something similar myself?


You can e.g. do this using semantic parsing, I remember some older papers but cannot find them right now. For a more recent paper see [1] (haven't read it though). You can also use Deep Learning for this. There are some interesting approaches using Reinforcement Learning [2]

[1] https://www.aclweb.org/anthology/P18-1034 [2] https://arxiv.org/pdf/1709.00103.pdf


Do you mean building the model on a different training set? If so, this would be a good place to start https://github.com/naver/sqlova/


This might not be useful directly but this is a super cool project.


Does anyone remember Larry Harris and "Natural" back in the 1980s?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: