Hacker News new | past | comments | ask | show | jobs | submit login
A sequel to SQL? An intro to Malloy (carlineng.com)
117 points by carlineng on Sept 6, 2022 | hide | past | favorite | 117 comments



This is the example at https://github.com/looker-open-source/malloy

  query: table('malloy-data.faa.flights') -> {
    where: origin ? 'SFO'
    group_by: carrier
    aggregate:
      flight_count is count()
      average_flight_time is flight_time.avg()
  }


  SELECT
     carrier,
     COUNT(*) as flight_count,
     AVG(flight_time) as average_flight_time
  FROM `malloy-data.faa.flights`
  WHERE origin = 'SFO'
  GROUP BY carrier
  ORDER BY flight_count desc         -- malloy automatically orders by the first aggregate
I don't see much value in this. This is not aesthetically better than SQL. It's also semantically better. This is just a different syntax that would parse to the same AST. And what's with the `?` for equality?!


Try writing this SQL. Malloy writes SQL you can't (I'm pretty sure of it).

  https://twitter.com/lloydtabb/status/1556287859601985536


> Malloy writes SQL you can't (I'm pretty sure of it).

Here you go. Malloy looks like a fine tool but it's an unnecessary mental burden for people who understand SQL and I think the below is more maintainable long-term.

  with
  aggregates as (
    select
        state,
        name,
        sum(sum(number)) over (partition by state, name) as state_name_births,
        sum(sum(number)) over (partition by state) as state_births,
        sum(sum(number)) over (partition by name) as name_births,
        sum(sum(number)) over () as total_births
    from bigquery-public-data.usa_names.usa_1910_current
    group by state, name
    qualify state_name_births > 1000
  ),
  
  scoring as (
    select
      state,
      name,
      (state_name_births / state_births) / (name_births / total_births) as popularity_score,
      rank() over (partition by state order by (state_name_births / state_births) / (name_births / total_births) desc) as rnk
    from aggregates
    qualify rnk <= 20
  )
  
  select
    state,
    string_agg(name order by rnk) as names
  from scoring
  group by state
  order by state


I wrote a similar one (forgot about partitioning because I need it so rarely, adapted yours for SQL Server to compare), only 1/3 as fast compared to your solution (1.672 seconds vs 0.545), but 1/4th the reads (82,280 vs 292,066) :D

And yeah, I agree. That wasn’t actually that hard and is far more readable if you know SQL.

For the sake of completeness, here’s mine:

    WITH CTE AS (SELECT main.Name,
                        main.State,
                        main.Births,
                        (CAST(main.Births AS float) / sb.births_in_state) /
                        (CAST(nb.births_of_name AS float) / total.all_births) AS popularity,
                     ROW_NUMBER() OVER (PARTITION BY State
               ORDER BY (CAST(main.Births AS float) / sb.births_in_state) /
                        (CAST(nb.births_of_name AS float) / total.all_births) DESC) AS RankPerState
                 FROM BirthCount main
                          CROSS APPLY (SELECT SUM(Births) as all_births FROM BirthCount) total
                          CROSS APPLY (SELECT SUM(Births) as births_in_state FROM BirthCount WHERE main.State = State) sb
                          CROSS APPLY (SELECT SUM(Births) as births_of_name FROM BirthCount WHERE main.Name = Name) nb
                 WHERE main.Births > 1000
                 GROUP BY main.Name, main.State, main.Births, total.all_births, sb.births_in_state, nb.births_of_name)
    
    SELECT State, STRING_AGG(Name + ' (' + CAST(ROUND(popularity, 3) AS varchar(50)) + ')', ', ')
    FROM CTE
    WHERE RankPerState <= 10
    GROUP BY State
    ORDER BY State

edit: Forgot to show the small view I created to make it more readable:

    CREATE VIEW BirthCount AS
    (
    SELECT Name, State, SUM(Number) AS Births
    FROM Names
    GROUP BY Name, State
    )


Your Twitter thread shows the SQL that is equivalent, so it clearly can be done. Just because Malloy produces SQL that is impossible to read or write doesn't mean that the problem is only solvable with impossible-to-read SQL.

That would be like saying that Dreamweaver lets you do things that HTML+CSS don't because "look at how impossible to read the HTML output is!"


Give it a shot. Malloy writes the query in the twitter thread with a single database read. It computes the query, simultaneously at 4 levels of dimensionalization but only touches the disk once. No joins.


I'm unclear: are you describing the compilation process or the way the database's query planner handles the output query?


My first thought was that the generated SQL looks a lot like Looker, and indeed this is from them.

Nothing in that query is remotely novel or difficult. The style isn't the most readable, but any experienced data engineer has read and written far more heinous queries.

This is obvious because someone had to write the generator for these queries, and so it's not surprising that the techniques that appear in the generated queries are recognizable.

"Malloy writes queries you wouldn't enjoy writing" is more accurate, but there's no evidence it's doing anything a human can't.


Malloy writes SQL that I wouldn't write, that's for sure. Nothing there is hard to do, let alone impossible, with commonly used SQL constructs.


SQL + recursive queries (an extension supported by everyone since the late 90s) is Turing complete. Thus every arrangement and selection of data is reachable by a query. Therefore there is nothing Malloy can do that SQL can't (I'm absolutely sure of it).


There is also nothing Java can do that Brainfuck cannot. It is very easy to implement an extensible web server in BF, you just need to F your B.


In any language there are going to be certain possible arrangements which take an exponential number of words to reach. By simple counting arguments, some things are always incompressible. I can contort myself if I so desired to find something easily reachable in Brainfuck that would be tedious to reach in JS.

The only "way out" is for one language to force rote memorization of sophisticated abstractions as part of its vocabulary. Eg you can make a language that is superficially more capable than Brainfuck because it contains all of Brainfuck as its vocabulary + a few extra words that encapsulate everything you need to implement an extensible web server. But that's not really an improvement in expressive ease, you've just swept the difficulty of constructing those expressions into learning the language in the first place, then showed off how easy it is to express the stuff you just defined words to express.

You don't get more utility out of the marginal extra vocabulary than the work you put into defining the marginal extra vocabulary.


I think you’ve completely written off human limits to produce and maintain correct code, while being technically correct.

People swear when they have to extract a symbol from an expression because that pollutes a scope and reduces readability (see “assembly” as an extreme example of it). You can’t expect them to be cool with a ladder of queries to do simplest things. They choose languages for comfort much more than for technical yadda yadda. SQL is only comfortable in comparison to other '80s tech.


I learned SQL in the late 00s. I like SQL. It's not Stockholm Syndrome either, I don't think. SQL isn't stylish, and it isn't beautiful, but it's pretty clear in spite of that. More than that, SQL is a lingua franca for databases, and it's very powerful -- this is a pretty difficult thing to beat, and the only thing that I've seen that can do it is LINQ, which isn't a language so much as an API for constructing query ASTs.


Human language and reasoning ability also runs into the problem that certain ideas really do take a whole book to convey whereas others fit into fifty words (like this post). We just don't notice the limitation on our expressiveness for the same reason fish don't notice they live in water.


I think OP meant "can't" as in "probably wouldn't be able to come up with the queries", not that it's literally impossible. Malloy compiles to SQL, so it's by definition impossible for it to express something SQL cannot.


By that definition, in that case, OP "can't" express themselves in plain English.


Pretty sure OP meant emphasis on _you_ can't. As in, things a human can't do.


In practice, it may be much easier to write a malloy query than the equivalent sql query.


Or it may not be.

Showcase the typical way you'd write a complicated SQL query and compare it against Malloy. It's like looking at EF4-6 output and thinking it's impressive and doing superhuman tasks because the generated queries were suboptimal and complicated.


Seems like all the magic is in this mysterious "all" thing, which is perfect for this particular problem... but I can't tell if it's part of a general coherent approach or just a handy gadget that makes Malloy look good on this problem.


Yeah the ? for equals is not better


None of it is.

Examples of things that would be better would be syntax for reference dereferencing. E.g., in SQL-like syntax:

  SELECT user->home_server->OS_version.version AS v,
         user->home_server->OS_version.is_obsolete AS obsolete
  FROM users
  WHERE name = 'whatever';
or

  SELECT user->home_server->OS_version.{version, is_obsolete} AS {version, obsolete}
  FROM users
  WHERE name = 'whatever';
where we don't have to join to the home servers table or the OS versions table, and also we don't have to repeat ourselves too much.

Where the relations are not 1-1 then an array_agg() aggregation could be inferred, or some alternative aggregation could be explicitly given:

  -- this would be an array of {name, type} values
  SELECT g.members.{name, type}
  FROM groups g
  WHERE g.name = 'whatever';

  -- ditto
  SELECT g.members.{name, type}.array_agg()
  FROM groups g
  WHERE g.name = 'whatever';


  SELECT g.members.name.count()
  FROM groups g
  WHERE g.name = 'whatever';
Now all that would be fantastic.

Tinkering with totally different syntax where the shape of the query looks just about the same as in SQL seems like a dead end to me.


https://1c-dn.com/library/query_language/

It was a pleasure to work with business logic on this platform, but sadly it went completely nuts trying to implement “async” web compatibility in a traditionally procedural language in recent versions. Also it is russian-oriented (and I guess you can safely replace “oriented” with “only” for the next decade at least).

SQL definitely has a huge growth potential as a language, but you can’t tell internet about it without being instantly dismissed.


And this basically gets us to EF Navigation Propeties with LINQ.

  users
    .Where( user => user.name == "whatever")
    .Select( user => new
    {
      v = user.home_server.OS_version .version,
      obsolete = user.home_server.OS_version.obsolete
    })


Tangential, but I never understood why SQL puts the FROM after the SELECT. It's extra confusing because sometimes people want to use aliased fields from the SELECT section in the FROM section but that won't work because its not evaluated in that order. It also makes much more sense to first name the table, then its fields.


There's no real reason. I suspect it would be a very small change to some SQL implementations to allow the various clauses of queries to come in any order.


The sparksql dialect does allow the from clause to come first, actually.


It's not about any single query. It's about the reusability of modularized definitions for BI analysis.


SQL does reusability using VIEWs and functions.


Aesthetics are largely subjective. Even though I like Malloy syntax better than SQL syntax, the majority of the actual post is dedicated to reasons above and beyond syntax that I think Malloy has promise; specifically, its integration of a semantic layer into the core of the language.


There is also PRQL which is more intuitive and simpler IMO - https://github.com/prql/prql


PRQL is very cool, but I think Malloy is meaningfully different (and more useful) because of its inclusion of a semantic layer into the core language.


Thanks for posting your thoughts. I'm having a bit of difficulty breaking down "semantic layer" into concrete technological concepts. An attempt:

* Schema definition. PK/FK columns and their relationships. Standard ER fare.

* Dimensions. Partially overlap with PK/FK structure, but may include fields that don't map to an explicit key column, e.g. [trunc] date or zip code or even binned measures. Can see the value of having dimensions documented across a team.

* Measures. Mainly named aggregations, e.g. value = sum(price * quant), which can be aggregated over many dimensional combinations. Definitely useful, though I'd expect PRQL "functions" to be usable in the same role.

* Formatting rules.

Am I missing something crucial?


Nope, those are the primary components of a semantic layer. Most other semantic layer products have three main components: input data sources -- typically SQL queries or table names, configuration -- the semantic layer describing all the things you mention above, relative to the input data sources, and the access layer -- usually a non-SQL API that consumers must use to consume data that has been modeled by the semantic layer. Check out the docs for Cube [1] for an example of this. Cube also has a SQL API, but it's not fully fleshed out yet.

What makes Malloy shine is that all 3 of these things are integrated in the same language, so users don't have to jump between different tools to model and explore their data. You can query/explore data in Malloy, iterate on functions to express your business logic, and immediately view the results. Doing this in something like Cube would require you to: (1) write SQL queries to prototype the function, (2) update the Cube configuration files with your changes, and (3) hit the REST API with a request to view results. In Malloy, it's all just writing and running Malloy queries.

[1]: https://cube.dev/docs/query-format


I would also suggest prql, which is also great. Plus: - the license of prql is better (APACHE v2 vs GPL v2) - although having modeling layer is good, but something like dbt-metrics maybe better. In this point, prql is more focus, on querying part.


Reading about the "semantic layer" it very much reminds me of the kind of things people do in an ORM. That is: how do tables relate, refinements of data types like strings where a column might have specific semantics... this post doesn't go into much so I don't know if Malloy also allows specifying things like how updates should happen (do you update in place or create new records?), reusable queries (especially given its nesting), knowledge of indexed vs unindexed queries, etc. All of this stuff usually either gets stuffed in the ORM layer, or exists only as folk wisdom about specific databases.

It is peculiar that databases typically lack referential integrity, something that we've decided is absolutely essential in other programming environments.


I am confused here. Referential integrity can only be implemnented in the database. If you try in the application there are race conditions that will break it. RI is a major reason to use a RMDBS (Look it is a Referential Database System)


Semantic layers usually go an extra mile beyond just ORM/E-R modeling. It is best thought of as an abstraction layer between the actual physical data system and the consumption tool (usually a BI or analytics app.)

They do things like define KPIs (how do we as an organization calculate "cost of goods sold") and business logic (what is an "invalid" order?), harmonize entities and attributes across multiple data sources (System A calls something foo, System B calls the same thing bar), provide localization options (currency, date formatting) and so on.

It can also do "basic" things like referential integrity, E-R modeling, aliasing columns, fixing data types for downstream consumption, etc.

Usually it is agnostic to the actual underlying data system (warehouse, lake, SaaS API ...)


It looks closer to a Data Fabric where you have an ORM as a service on top of all your hetereogenerous datasources and services: a semantic layer that enables you to define models across all your sources, and a data virtualization query engine that gets the data from these different sources without replicating all the data.


Out if necessity I've started working with Microsoft's Kusto Query Language [1] which is used by various services in Azure (e g. their Log Analytics Workspace).

At first I found the language rather akward and was wondering why yet another query language. But the more I used it the more it grew on me. The thing I really like is that unlike the clauses in SQL, the order of operators isn't really fixed and it reads and feels like a pipe command in a Unix shell.

One example where I find this far superior is when doing aggregations. In SQL I would have to modify both the start and the end of the query, which is quite a nuisance.

[1] https://docs.microsoft.com/en-us/azure/data-explorer/kusto/q...)


(disclaimer: Microsoft Employee, this is my opinion).

I've been using KQL for a long time, it really is a nice language both for querying and maintaining the data.

But, aside from fixing some serious language issues with SQL, I really enjoy the wide range of supported scenarios. You can use KQL to query a SQL database [1], you can use python [2], do all kinds of time-series analysis [3][4], do distinct counts on various fields without too much explicit query-authoring [5].

My main beef with Azure Data Explorer (which, as I understand it, is the engine that handles the query execution) is the price... I wish it was easy for hobbyist developers to launch and try out.

[1] https://docs.microsoft.com/en-us/azure/data-explorer/kusto/q... [2] https://docs.microsoft.com/en-us/azure/data-explorer/kusto/q... [3] https://docs.microsoft.com/en-us/azure/data-explorer/kusto/q... [4] https://docs.microsoft.com/en-us/azure/data-explorer/kusto/q... [5] https://docs.microsoft.com/en-us/azure/data-explorer/kusto/q...


There is this page about a free cluster [1]. It appears to be pretty beefy, but it seems like it is for one year only, but IMHO that isn't entirely clear.

[1] https://docs.microsoft.com/en-us/azure/data-explorer/start-f...


Today Microsoft announced that an emulator for Azure Data Explorer is generally available [1]. It is a Windows docker container.

[1] https://aka.ms/adx.emulator.doc


Speaking of custom query languages by Big Tech companies, I've always been curious about Google's Yedalog[0]. I mean it's obviously inspired by a rich history of logic programming here, as it's trying to find a middle ground between Prolog and Datalog. I'm not sure what's come of it, and I certainly hope they eventually make it open source or at least available for BigQuery for something.

[0]: https://static.googleusercontent.com/media/research.google.c...


Kusto is used extensively within Microsoft and has been for a long time. I think it's generally really well liked and really productive, and while it tends to be quite quick, it has some similar performance pitfalls as SQL


The query structure does look nice indeed! The usage of the pipe character feels odd but I suppose there are benefits in the end


I had the same initial reaction regarding the pipe character. But once I started thinking of the query as a pipe (like in the terminal) through which the data flows, where stuff like ORDER BY, SELECT, and GROUP BY are just operators, it started making sense.


Related:

Malloy – A Better SQL, from Looker - https://news.ycombinator.com/item?id=30053860 - Jan 2022 (99 comments)

Malloy: An Experimental Language for Data - https://news.ycombinator.com/item?id=28926349 - Oct 2021 (1 comment)


My musings on why SQL has been so hard to displace, and what it might take to do so. Any and all feedback appreciated!


The thing that is absolutely essential to displace SQL is to have the replacement be the native interface to a high-performance database that people want to use.


The minimum enhancement I want for SQL is a version where no literal values are allowed in queries, as this would completely preclude SQL injection :)

To make that more tolerable for query planning purposes, there would have to be two types of query parameters: compile-time and run-time.

Next up: why not allow query clauses to come in any order? `SELECT .. FROM .. WHERE ..;` or `FROM .. SELECT .. WHERE;` and so on.

Since the query parser/planner has to see the whole thing anyways. The parser/planner can't begin coding at `SELECT`, or at `WHERE`, since there might be a `GROUP BY`, or an `ORDER BY` that affect the whole query plan, so all these clauses might as well come in any order. Wanna put `HAVING` first? Sure, why not. It's probably best to insist that table sources all come together rather than be all over, but I think even that doesn't have to be so.

Also, I'd like an out-of-band mechanism for expressing query planner hints. This would be a separate string or object passed along with the query, and which does things like: identify a table source to use as the outer-most table for the query plan, for each of some or all table sources identify an index to use or temp index to create, for each of some or all joins pick a join strategy, etc. Table sources would have to be addressed as {<CTE_name>, <table_source_name>}, naturally. Such a thing should also allow one to specify indices that should be created on CTEs.


Not sure that’s worth it? It’s not a difficult software engineering problem to prevent SQL injection categorically.


A lot of safety/security isn't hard, people just don't do it if it isn't forced on them/ easy.


This. If SQL injection was infeasible because the language had no literals, then there would be no SQL injection. Given that SQL injection is possible because the language does have literals, we should and do in fact still have SQL injection issues.


The fact that SQL has literals is not a problem. It’s true that eliminating literals would prevent SQL injection but it’s an absurd solution to the problem. How should we write queries by hand? What about things that actually should be constant (e.g where col is null)?


> this would completely preclude SQL injection :)

No it wouldn't! I regularly see web apps with URLs like this:

    someapp/reports/foo?start=1000&end=2000&sort=DESC
There are two fixes for this that mostly work:

1) Use a binary-only format that has no string representation at all. Instead, queries must be built up using a fluent builder style, or similar API surface. This is fine for "apps", but is less useful for ad-hoc queries obviously.

2) Use something similar to what web app frameworks do when generating HTML. Instead of accepting naked 'string' types, only accept something like an 'IEncodedHtml' type that automatically converts from strings, but escapes them in the process. Passing in a raw string as SQL would require some sort of scary-looking function call that is easy to grep in a codebase. E.g.: "QueryBuilder.UnsafeConvertTrustedString(...)" to really drive the point home. And then of course, IDEs and build tools would need to emit warnings or outright errors if HTTP inputs are directly passed to this function as-is.

Having said that, nothing will ever be sufficient. Developers start sentences with: "I just want to...", which is basically to say that they don't want to have to "deal" with security and just get it "working".

Another GIANT mistake I see in most (all?) database clients is that there is no distinction between needing a read-write connection or a read-only connection.

Making read-only the default would have huge benefits, such as:

1) Making all apps automatically gain the ability to do read-scale-out when using systems such as SQL Server AlwaysOn Availability Groups with readable secondaries. Right now this takes a major application code change to have any benefit, which means "nobody" does it. I've been literally begging and pleading with developers to just add a second connection string with "ApplicationIntent=ReadOnly" in it for report generation, and NOBODY has done it. Not one developer.

2) Security! If you don't need your app, page, or specific query to write, then don't enable the write privilege! Again, the default here is super dangerous, but every developer does it because it's the easy happy path. I'd be shocked if even 1% of the web applications in the wild switch to a read-only account or connection string, because it is so fiddly in current systems.

Even with all of the above, databases won't be perfectly secure against injection. The latest SQL injection attacks I've seen pen testers use is to add "DELAY" commands to exfiltrate data one bit at a time. This works similarly to micro-architectural side-channel attacks such as the famous Spectre and Meltdown vulnerabilities in Intel. Even if explicit injection isn't possible, timing attacks are often still possible if ad-hoc query generation is exposed.

Security is hard.


I'm not sure what your URL is showing? The ask was to make it so those have to be supplied as a parameter to the sql, such that all of those would be considered literals if put directly in the query. Right?

More directly, ad-hoc query generation is something that shouldn't be allowed from potentially hostile users. Which, by and large, means all users that don't otherwise have admin access to the system.


> I'm not sure what your URL is showing?

What most people mean by query parametrisation in my example (as pseudocode) is something like the following:

    conn.Query( "SELECT * FROM Foo WHERE id BETWEEN @From AND @To ORDER BY " + HttpRequest["sort"],
        new { From: HttpRequest["from"], 
              To: HttpRequest["to"] });
In typical SQL or DB client libraries you can parametrise literals, but not queries. The queries are almost always built up as strings. Any time you see "sort=DESC" in a URL, there's like a 90% chance that code like the above is floating around in the codebase, just waiting to be exploited. That's because there's no way to use a safe parameter for "sort order", or "which columns to sort by", or... anything else for that matter. You just have be eternally vigilant. And by "you", I mean everyone everywhere.

Thankfully, most pen-testers cackle with glee when they see "DESC" or "ASC" in a URL, but what if the code above does some sort of simple check to see if the string is one of the allowed two values?

... I can tell you what happens: the pen test report will be ignored as "not actually vulnerable", and now you have landmine in the code. Any junior dev updating or rewriting the queries may forget the sanity check and... Boom!

Building up queries as strings is a pit of failure into which developers have been falling en-masse for decades and decades. Millions of them. Entire industries and products have been built out to detect, mitigate, and protect from this pit.

What we need is something more like the bytecode used by the JVM or the .NET Framework. A set of binary primitives that must be built up via a fluent API such as:

       conn.Query("Foo")
        .Where( new ColumnRange( "id", params["from"], params["to"] ))
        .OrderBy( params["sort"] == "ASC" ? Order.Ascending : Order.Descending );
Even if "OrderBy()" allows a string input, it'll just be doing the same kind if parsing logic instead of just "appending" to the end of the query blindly.


I mean, fair on the sort order. I was distracted by the other items.

But there is no reason that couldn't be made one of the things that can't be included as a literal. Is there?


Ultimately if your language supports constructing strings from untrusted input + a query language, you have an injection problem. The question is mostly one of scope. As an example, CQL doesn't have nearly the same power of SQL, and so an injection is just not as big of a deal - it's still possible, and in some cases it could be bad, but it's not nearly the same kind of primitive as SQL.

Similarly, removing primitives would probably solve a lot of injection use cases, but if someone does something insane like inject DESC into their query, it's over.

Because SQL is so powerful it's really just up to the libraries to try to help where they can, we'll never truly be rid of injection vulns as they're fundamental to any program that treats data as code.


But, I honestly see very little reason to allow dynamic strings in building queries. Such that I would happily take a system that failed the build if you passed a non static string to the query constructor. I thought this was common with build systems and format strings?


But then where do you stop?

Fundamentally the problem is with any kind of dynamic query construction. It’s very hard to make this secure and general purpose.


I'd happily not allow any dynamic query construction. Just as I wouldn't allow dynamic format string construction.


u/jiggawatts is saying that you or others can always build a thing (a language, in this case URI q-params) that is itself vulnerable to injection attacks even if under the covers you were using an injection-proof SQL.

This is obviously true. In that sense my statement that u/jiggawatts was responding to is in some way very much incorrect. But my statement was specifically about SQL injection, and not about things one might build with an injection-proof SQL.

It's a quibble, though an important one.


You can build an injection-vulnerable language on top of an injection-proof, so in that sense, injection issues can't fully go away. So you're quite right. But it's harder to build an injection-vulnerable language than it is to use an injection-proof language, so having an injection-proof language should help reduce injection vulnerabilities.


It sounds like Malloy might be able to find a niche amongst some casual users, but it will never gain traction amongst DBAs. My reasoning is as follows.

Being a DBA is more than just writing queries, you have to be able to maintain the database, maintain the security settings, and (most importantly for this discussion) do performance tuning.

Performance tuning is incredibly important; I have, personally, seen a query that was running in only a few seconds have an absolutely catastrophic performance drop off when a few extra records were added. I had to get help from the DBA to find where it was going wrong and rewrite it to get the performance back to a reasonable state again. (I am talking about query run time going from seconds to minutes.)

You can't do performance tuning in Malloy I doubt; You'll be needing to run the analyser and rewriting the produced SQL. Since you'll have to know SQL really well to do this, why bother learning Malloy as well?


You could make the same argument that we all should learn assembly language, because when our C compilers produce bad assembly we'll have to rewrite it by hand. But I haven't written assembly in 30 years, and even then it was just for fun.

As it turns out, having high-level abstractions means that in many ways it's easier to do automatic optimization. I look forward to automatic data-driven database tuning and query optimization.


Postgres has done an incredible job optimizing both the overall performance of queries and the query planner - but these things make mistakes still and being able to fix those mistakes can make the difference between a two minute query and a twenty milisecond query - this comes with the fact that complex database operations can usually make or break overall response times, and these optimizations usually depend on statistic accuracy which can be hard to ensure.

If SQL optimization was as good as compilers I'd be all on this train, but I just don't think we're there yet.


Your argument seems to assume that SQL is assembly language; it is not! The query plan or internal bytecode is the 'assembly language' of an RDMS, therefore SQL is more akin to the RDMS version of C, rather than assembly language.

So Malloy is akin to a language that compiles to C, which is then compiled to assembly. Does this seem like a worthwhile pursuit to you?

Personally, I don't think any SQL replacement will really take off unless someone can get the major RDMS makers to support it natively.

Just to be clear, I'm not saying it's not a worthwhile pursuit: SQL does have many problems and could do with a replacement that supports more modern approaches to problem solving. What I am saying is that unless someone comes up with something that can automatically do performance tuning for you, or that is supported natively by the RDMS, then SQL replacements will continue to fail to gain traction.


People do this, e.g. building a cost-based optimizer for queries, but it can take over 2 years for it to really get pretty good. And even then it can only go so far and you'll still want to manually tune certain queries.


I still don’t understand why so many people seem to talk about replacing SQL. It’s not complicated, can be learned in a few hours, it’s relatively easy to read and it fairly portable across databases and stacks.

It’s been the most important detail of my job for the last 20 years.


SQL has a couple legitimate limitations as a query language, largely because we expect databases to do so much more today than when it was designed. I've been both a heavy user of SQL and implementor of SQL databases focused on data models like graph, spatial, et al, so I feel the pain from both directions.

Some useful things are difficult to effectively express in SQL because it was not designed to make them expressible. Adding support after the fact introduced a lot of compromises and complexity; you can tell which parts of SQL are second-class citizens. SQL would have been designed differently if its original scope had been broader and it is an inelegant mess for some types of data models. You can make it work but it isn't pleasant. Attempts to address these gaps explain much of why SQL databases all have their own non-standard SQL dialect, and it has made SQL effectively non-portable across databases.

In principle, a more modern query language could elegantly address the much broader scope of how we use databases today in a standardized way, instead of the zoo of non-standard extensions and mismatched parts that are grafted onto specific SQL implementations.


because as a language it's quite complicated, except in simple cases, the syntax is completely arbitrary and randomly injects keywords willy-nilly in an absurd attempt at being "english-readable"

because it lacks composability, making dynamic query generation a massive pain in the ass, as well as manual edits of queries.

because watching DBA's flailing strings about, smashing them together like apes with stones, just so they can struggle to feed it into an EXEC SQL call is perhaps one of the greatest tragedies of our programmatic times.

because it's not at all portable, with every database arbitrarily extending the SQL standard left-and-right, and even basic conveniences and functions end up breaking your portability goals; the ANSI SQL standard defines less a standard and more of an aesthetic mold for each dialect to try to match against.

because it lacks any portable ability to extend the language, with nearly every function added not as a stdlib function but instead as part of the language itself, often with custom keywords to go with it; there's a reason you find imports for anything in python, and next to nothing for rdbms's. Nearly any extension/package implemented is hooked into the DB internals, and not portable.

because the editor tooling is basically stuck in the 80's, with universally useless error messages, absurd limitations like select-clause ordering requirements (e.g. you can't use a select alias in the where clause, because the where clause gets evaluated first; this is exactly equivalent to C requiring forward declarations), select-before-from breaking autocomplete, and only advanced editors bother to even forward data-type mapping from schema definitions to your autocomplete because of the open-world assumption (the schema could change under your feet); hell, even SQL formatters have largely given up on trying to format anything but the most rudimentary constructs, because of the ridiculous amount of language they need to support.

because the language stuffs a trinary logic (true,false,null) into a binary logic system and grants you access to all of the benefits of both false positive results and false negatives.

because the relational algebra is elegant, in all the ways the SQL language is not.

But we'll stay with it, because at the end of the day, no one gives a shit about SQL. The SQL language is just the interface to the Glory of Codd, the relational algebra, the RDBMS, and in this we find all true value. But we can still dream of a better world. A world where RDBMS's support more than one query language.


SQL is uniquely challenging for many to learn because you cannot poke around in the bushes procedurally and still arrive at a good outcome. The declarative nature of SQL usually means all or nothing.

I do recall how frustrating it was to learn all of the join shapes. It's so trivial to me now, but it was a nightmare at the beginning of my journey. The popularity of ORMs is absolutely not a surprise to me.


And all frameworks migrate over time towards being more declarative with this exact failure mode.


Performance tuning is important. Things like `random_page_cost`, or `effective_cache_size` are important to understand – especially with nvme drives, or databases that partition data across multiple disks (eg. hot data on nvme, stale data on spinning disks).

But...

> Being a DBA is more than just writing queries

> I had to get help from the DBA to find where it was going wrong and rewrite it

Just wanted to chuckle at this. I agree with the point — you want to write as close to the metal as possible in SQL. It's the same with ORMs. That means... writing SQL, not an intermediate language. Although https://prql-lang.org/ looks great!


For a much much more mature product in this area with a very strong team behind it, see EdgeDB


SQL is around since the dawn of relational database and its hard to replace. The best option for mass adoption is to have drag and drop tools with visualizations like no-code ETL. Template like and markup language or framework are easier to adopt for new developers but majority of the population still tend of stick with the original language.


> The best option for mass adoption is to have drag and drop tools with visualizations like no-code ETL.

No. I've worked with BI tools and when things get complicated you end up needing to go back to text to express the weird bits and every company has a few queries with weird bits in it.

I also will gladly agree that it's network effect is what makes it so hard to replace (as opposed to some perceived perfection of the language - it definitely isn't perfect) but SQL has evolved significantly over time. Core SQL hasn't - but Postgres in particular has pushed the envelope on what can be done with WINDOWs, CTEs, and aggregate modifiers. I think it's a bit misleading to say the majority of the population still tend to stick with the original language since, at a previous job we did attempt to write "neutral SQL" that would execute on MSSQL, Postgres and MySQL - but in most shops you'll have a chosen dialect and you'll be able to make use of more recent and advanced language features... So the majority of the population is using modern SQL just like the majority of programmers that'd describe themselves C/C++ programmers can't grok ANSI C.


If we still use math 100 years from now, we'll still use SQL. It's a fine way to query relational data, and relational data is a fine way to model reality.

I'd like to hear from people that think I'm wrong.


> If we still use math 100 years from now, we'll still use SQL.

Those are incongruent. Do you mean Western notation?

> It's a fine way to query relational data

It's quirky, but good enough for ad-hoc queries that I think it will be hard to overcome the momentum in that area.

It's not fine for application work, where you need things like composition. We've tried to solve those problems with ORMs, but the ORM is starting to fall out of fashion due to a number of problems of its own. SQL is not a great compiler target. I do eventually see something lower level built for programmers, not data analysts, rising up here. If SQL is compared to Javascript, something akin to WASM, perhaps.


ORMs, when best used (and we use them even though we're pretty SQL literate and maintain a lot of SQL) will survive forever, nothing beats an ORM for really dirt simple expressions that you want to be trivially testable. Never in my life do I want to see someone write an UPDATE query against a single table with no shenanigans with dynamic field support using string gluing to properly stitch in all the columns - this is something a known tool can do better, this is a great opportunity for an ORM.

A non-great opportunity for an ORM is anything I'd call a "report query" (some complex read-only query involving a lot of JOINs, a bunch of WHERE clauses and possibly some nested aggregation for funsies) - this is where you pull out the SQL (or alternative query language!) because an ORM will struggle to properly support all the functionality you need and because trying to tune a query being produced by an ORM (even just to make sure it's well aligned with logical indices) is a task that yields nothing but endless frustration.


Never in my life do I want to see someone write an UPDATE query against a single table with no shenanigans with dynamic field support using string gluing to properly stitch in all the columns

These are just convenient features that most ORMs provide and can exist entirely outside of ORMs, they are not the primary purpose of ORMs.


You are correct by their design. But by usage I've found that to be by far the most valuable thing that ORMs deliver. Making use of ORMs to power an ActiveRecord system in your codebase has only ever, to my observation, lead to pain. Querybuilders that are equipped with more advanced functionality around type security and response decoding are quite a valuable tool.


I meant "still use math" as a proxy for "still use formal languages to communicate".

Maybe in 100 years AI will be so powerful that we'll just ask our question in natural language and get the answer we need. Or maybe in 100 years AI will have harvested us for the iron in our blood. Either way we wouldn't need SQL anymore.


The GP is referring to SQL’s mathematical roots: https://en.wikipedia.org/wiki/Tuple_relational_calculus


Then the statement resolves to "If we still use math 100 years from now, we'll still use math.", which is a rather silly statement. SQL and Western notation are interesting to compare in that they are the dominant, but not exclusive, languages used to describe their respective mathematical domains.


SQL does not fully implement Relational Algebra or Calculus, which are isomorphic.

See Many of Chris Dates' books and things like Tutorial D which do meet the ALgebra.

SQL is near enough the theory to work and also has had so much effort put into making it work fast, reliably and scale for volume that a new language has too much to overcome even if it can deal with all cases. So it won't be replaced soon.

However 100 years is longer than SQL had been around so a proper relation server could come around, there is just too much uncertainty.


It's pretty obviously silly to have an unavoidable text parser in between code and data. We've reached the point where it's fairly low overhead, but it's still not nothing.


> SQL is around since the dawn of relational database

There was a decent amount of competition back then, though. It seems the industry eventually settled on SQL to be compatible with Oracle's dominance. Postgres didn't gain SQL support until about a decade in.


What were some of the alternatives? What did Postgres start with?


Postgres used QUEL in the early days, as did its predecessor Ingres. MRDS, the first commercially available relational database, used a language known as Linus. Alpha was the language originally envisioned by Codd to describe his relational model.


Code source control is a vital aspect of software development in the modern era, and no-code tools are incompatible with that unless they are also able to output their representations as plain code so that tools like "diff" work as expected, in which case you might as well stick to SQL.


That is a limitation of text based tools.

There have been code source control tools based on the AST see Envy for Smalltalk.

Hopefully eventually we will dump the limitations of text based tools and use one based on the structure of programs. I don't want to know line 123 has changed I want to know that function fn in module m has changed or that function X was added on this date.


That is so true! Have you ever used SSIS? A really powerful tool, but even small changes can cause hundreds of changes in the underlying XML, which makes change control a nightmare. Forget branching and merging anything other the most minor changes.


Because of this reason, vast majority of new generation query languages are translated into SQL but in fact it is not a great language as a target language. I think SQL should more focus on features as an efficient intermediate language rather than adding more and more ad hoc "convenient" features that don't really play well with other language features...


I don't think SQL gets replaced until we move to a different database paradigm. For example, I could imagine a function database model [0] getting paired with a "grammar of graphics" type of metadata for the records to create much more concise query/aggregate declarations, especially where time series are concerned.

[0] https://en.wikipedia.org/wiki/Functional_database_model


SQL, for the most part, has already been replaced as a language used by people with alternatives like query builders, ORMs, etc. Beyond the occasional ad-hoc query, SQL simply isn't suitable for the tasks we require of modern relational databases, lacking features like composition that today's applications need.

It remains as a compiler target for those tools, but I am not sure that is the level of abstraction in question.


I like the VS Code integration that Malloy has. There's pretty limited in-browser tooling for BigQuery so that bit of the extension is amazing.

But I found practically that it's very hard to get folks that are writing SQL day-to-day to try to integrate a new language on top of something they already understand in and out so I'm thinking about just pulling out the BigQuery bits from their VS Code extension to be able to write SQL with in VS Code with auto-complete and references.


If I'm understanding the article correctly, VS Code integration is currently all Mallory has, and contrary to popular belief, that's not the only code editor in existence, so it seems like a huge limitation to me. That it apparently just compiles to SQL (I guess? The article seems to imply that's how it works but the README.md on the GitHub repo doesn't seem to mention that it does that that I can find) is another limitation and another check in the "why not just use SQL anyway?" column.

I'm all for a more humane SQL replacement, and maybe this has potential to be one, but right now it seems to be in the stage where it's little more than the code equivalent of a sketchpad doodle. Let's see where it goes.


If you would like to play with Malloy... You can Fiddle using just a web browser. The Malloy Fiddle uses DuckDB and WASM.

  https://twitter.com/lloydtabb/status/1567671348306264064


> there are relatively few database targets that it must support

Heh. Oh wow.


This is the umpteenth attempt at replacing SQL. Just like all previous attempts, it may well address some weaknesses of SQL, however, it introduces a whole new range of issues. SQL has been around so long as it mostly works.


None of the Javascript alternatives (except maybe CoffeeScript) really gained any traction until Typescript did. Just because something is hard doesn't mean that it shouldn't be attempted, and just because something is unlikely to succeed doesn't mean that it is a lost cause. To look at a language like SQL and say "meh, that's good enough" is crazy to me. I estimate that a widely-adopted good SQL alternative could have saved me over 1000 hours over the last 10 years.


TypeScript is backward compatible with JavaScript. It means there is pretty much zero investment required to get started with TypeScript. Refactoring existing SQL codebases to Malfoy would have cost me more than 1000 hours over the last ten years.

The problem with SQL alternatives is that they tend to lack features once you get beyond the shiny examples on their documentation pages.


I really do marvel at the number of times this has come up on HN over the past half decade I've been lurking/participating.

These kinds of approaches always have one very large flaw: The problem areas of SQL are well understood and there is lots of community support available. Why should this be traded away for a different syntax that has its own yet-undiscovered problem areas? What efficiencies are we actually losing by using SQL?


I think Malloy is meaningfully different from previous attempts (e.g., PRQL), and describe why in the post, namely the inclusion of a semantic layer as part of the language. Take a look at the post, and would love to hear if you agree or not.


How is this better than the SQL equivalent? How can I break down this query and run parts of it for debugging purposes?

query: sessionize is { group_by: flight_date is dep_time.day group_by: carrier aggregate: daily_flight_count is flight_count nest: per_plane_data is { top: 20 group_by: tail_num aggregate: plane_flight_count is flight_count nest: flight_legs is { order_by: 2 group_by: [ tail_num dep_minute is dep_time.minute origin_code dest_code is destination_code dep_delay arr_delay ] } } }


That's an example with lots of model definitions inline. The idea is that you would define various components separately so they can be reused (and to your point, broken down and debugged separately). Malloy brings the modularization and reusability that is missing from SQL itself.


It is the first thing I try to teach CS graduates: DRY ( re-usability) is not applicable in the SQL world. Trying to re-use code, as proposed in Malloy generally results in poor performing queries. Malloy also seems needlessly complex compared to highly successful tools like DBT.

One other reason that DBT is successful is the low threshold to migrate your existing codebase. It is larger than Typescript Vs JavaScript, however, it can generalky be done in a week or two.


It only works if you are just running queries in console. Any serious work with sql requires all sorts of weird stuff built around it like dbt, cubejs[1].

1. https://cube.dev/docs/


I don't really think that's true - we write quite a lot of SQL in-house and we have no issues doing it safely. There are some tools we rely on to make our lives easier but these mostly revolve around making batch operations easier to express (like IN(:array) as opposed to having to glue some string joining logic relying on an array count into every individual query). SQL definitely isn't the cleanest thing ever and I have a number of improvements I'd personally really appreciate (changing statement order, trailing commas, better aggregate and window modifier definitions) but doing Serious SQL is definitely an accomplishable thing - even weird statistical bucketing and aggregation that produces a query that's... 291 lines long.


what do you make of things like dbt which is awkward mix of python templating and sql mixed into each other.


I write highly complex SQL for a living so I don't think I'm the best person to comment on it - I have found that pretty much anything you want to do can be done with pure SQL and the resultant mass is usually going to yield much better performance over mixed solutions... but I haven't used dbt so my comments are more directed at traditional mixed solutions (like sequentially submitted queries being stitched together in C++/PHP or pruning and joining multiple ORM delivered results into a full in memory data set). The way I usually like to write mixed SQL is preparation, execution, cleanup - have a blob of complex logic to preprocess the request into the appropriate SQL recipe - execute said recipe - then go back to imperative programming land to apply any different sourced joins (i.e. combine a DB query result with something coming back from OpenSearch, a memcached query or some other non-relational database source) apply any complex customization and value cleanup (like pulling data out of a JSONB blob and actually sending it as a plain array to the consumer) and then shuffling it off to whoever requested it.

Again, I don't feel comfortable commenting on any tool I haven't personally used and I hope that was helpful - if you have any other questions I'm happy to try and answer.


As someone who uses dbt, I would recommend it for an analytics workflow. The templating is useful for shared code, like a CTE you use in multiple places. That said, you don't need to use a lot of templating for it to be useful.

DBT shines, not because of the language/templating, but because it handles a lot of the scut work of building out a data warehouse. Write the select statements you want to populate the model, write tests to constrain the model, and build.


It’s nice, but it’s hard to beat the clarity and expressiveness of dplyr and purrr.




Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: