Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: PRQL 0.2 – a better SQL (github.com/prql)
378 points by maximilianroos on June 27, 2022 | hide | past | favorite | 159 comments
Hi everyone — thanks for your interest in PRQL — let us know any questions or feedback!

We're excited to be releasing 0.2[1], the first version of PRQL you can use in your own projects. It wouldn't exist without the feedback we got from HackerNews when we originally posted the proposal.

[1]: https://github.com/prql/prql/releases/tag/0.2.0




Previous discussion: PRQL – A proposal for a better SQL https://news.ycombinator.com/item?id=30060784


I'm surprised that none of the examples on Github or the website deals with join. I eventually found some in the "book" here: https://prql-lang.org/book/transforms/join.html

    from employees
    join side:left positions [id==employee_id]
turns into

    SELECT
      employees.*,
      positions.*
    FROM
      employees
      LEFT JOIN positions ON id = employee_id
I would love to see joins worked into the main learning examples. Without join, the examples lack a bit of the "relation" part; we could just as easily be compiling a DSL to a chain of `array.filter`, `array.reduce`, `array.map` calls. Joins are what makes relational modeling interesting!

I would love to see Datalog/SPARQL-style implicit joins to make graph traversals like "which users have edited documents I own?" less verbose.


Great point, we'll add that.

I don't think we do joins that much better than SQL does. We're thinking whether there's potential there, maybe through understanding foreign keys — but we're being conservative about introducing change without value.


I looked at the book after this and have to say, I'd heavily recommend spending the next dew months just improving joins (and complex joins especially). Like GP says, relational modelling is the interesting bit about SQL and I don't feel exaggerative in saying the only reason I use SQL are joins, and so the only reason I'd introduce the complexity of your project into my stack would be if it makes handling joins, views and other aspects of relational modeling and slicing nicer - one example could be many to many relationships, or the gradient between graph/document based and normalised table based modeling


I second this. However, it's important that we don't make it so easy that we hide the cost of the join itself.


The biggest failure of SQL joins is not using declared foreign keys. For example it should be something like "JOIN USING fk_invoice_customer c" instead of repeating each time the relationship between invoice and customer entries already defined in foreign key


Most engines support NATURAL JOIN, which isn't perfect and has drawbacks, but allows for shorter joins.


I always found it surprising that joining on foreign key is not possible in SQL. I'm no expert, but looking at PRQL this feels like it should fit in quite well with the philosophy.



> Joins are what makes relational modeling interesting!

It is the central part of RM which is difficult to model using other methods and which requires high expertise in non-trivial use cases. One alternative to how multiple tables can be analyzed without joins is proposed in the concept-oriented model [1] which relies on two equal modeling constructs: sets (like RM) and functions. In particular, it is implemented in the Prosto data processing toolkit [2] and its Column-SQL language [3]. The idea is that links between tables are used instead of joins. A link is formally a function from one set to another set.

[1] Joins vs. Links or Relational Join Considered Harmful https://www.researchgate.net/publication/301764816_Joins_vs_...

[2] https://github.com/asavinov/prosto data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby

[3] Column-SQL https://prosto.readthedocs.io/en/latest/text/column-sql.html


I always found that side:left/right should also be expressible as rapport:antecedent/consequent as in propositional logic, rather than limiting these relationships to the geometric representation of Venn diagram.

And maybe a shorter alternative might be tie:arm/leg.


I'm not sure if this is a joke, be we actually had a serious an idea to replace side:left/right with nulls_left:true and nulls_right:true

This part of the join operation should be an after thought - just a flag after the central argument of the transform which should be the condition you join over.


Have you considered using "optional"? I feel that this would be a more natural syntax for joins. It could default to "left join", which is probably more frequently used than a right join.

from employees join optional positions [id==employee_id] --> LEFT JOIN

from employees join positions [id==employee_id] --> JOIN

Then you'd use "optional right" or something similar for the "right join" case


That is an interesting suggestion.

Thank you!


I didn’t mean to make a joke here, what was the funny part? Reading it again, maybe the tie:arm/leg sounded too much of a BDSM stuff?

Not my initial idea though: I was just looking at short words that might hold the analogy need, from "relationship" you easily come to "tie", and then "arm/leg" for "anterior/posterior" seems pretty straight forward and analogous to "antecedent/(consequent|postcedent|succedent)".


I transformed this informal suggestion into an issue, see https://github.com/prql/prql/issues/718


For those interested in database query languages, it is worth knowing about Datalog, the query language behind Datomic, XTDB and Datahike: http://www.learndatalogtoday.org/

E.g. a parameterised aggregate query that retrieves the name and average rating of a film starring cast members whose names match the input names:

    [:find ?name (avg ?rating)
     :in $ [?name ...] [[?title ?rating]]
     :where
     [?p :person/name ?name]
     [?m :movie/cast ?p]
     [?m :movie/title ?title]]
To reveal the answer, click on tab labelled "3" and then "I give up!": http://www.learndatalogtoday.org/chapter/7


This is the Datomic/Clojure dialect of Datalog. I had an easier time learning a stand-alone datalog variant. I think compiling Datalog to SQL is an interesting idea.

I wrote a toy Datalog -> SQLite compiler: https://percival.jake.tl/

Other Datalog -> SQL compilers I know of:

- Originally from Mozilla, now independent: https://github.com/qpdb/mentat

- From Google: https://logica.dev/



I've thought about building a better query language too. I'd love the ability to model sum types in databases, something like:

    enum SchoolType {
       College {
           degrees: Vec<Degree>
       },
       HighSchool
    }
It's such a common pattern and yet it's so annoying to model in a normal relational database. I wouldn't be surprised if the rise of NoSQL is tied to the inability of relational databases to model basic patterns like this.

Part of me has wondered if a language is the solution. Maybe just a better query builder with support for sum types is necessary. But I suppose there's something useful about having a consistent model based around a language, even if people aren't writing the language directly.


Doesn’t Postgres support this with table inheritance https://www.postgresql.org/docs/current/tutorial-inheritance... I don’t know if they’re recommended, but they are an option.


They're not recommended: https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use....

They're incomplete (e.g. don't work with foreign keys) and are essentially unmaintained.


Thank you for this. Modeling an annoying amount of similar models currently and this just gave me what I was looking for.


IMHO categorical data model (https://www.categoricaldata.net/) has much better support for sum types than relational model (as well as other advantages, roughly along the lines why to prefer type theory to set theory in math foundations), unfortunately practical databases and query languages are not yet well developed for it.


I totally agree with this. Interesting point about NoSQL!

I'm not sure if it's just the query language though - the definition language needs to make creating columns that are sum types trivial. For one-to-many data this might be a slight generalization of foreign key (compound of table tag + foreign key for that table). This can work for one-to-one data too, but can be a bit annoying having lots of tables compared to doing adding a couple nullable columns (plus there's also data locality differences). I suppose a wrapper language that covers both DDL and DML could work.


I'm not sure that relational databases are "unable" to model something like this, since I recall from years ago Date describing how to do something like this. Don't remember the details, but you might want to look into Date's writings.


You'd have 1 table per sum type which requires extra data. Then polymorphic foreign key (aka a pair of fields school_type, school_id).

(No foreign key constraints, but those are falling out of use in some cases due to inability to online migrate mysql schemas anyways.)


You can retain foreign key constraints by having one column per type of reference. It is also possible to ensure that exactly one column of several is NOT NULL, so that the columns can always be mapped to an enum in application code. Also, in PostgreSQL, the storage for the extra NULLs uses just one bit per column in a bitmap.

    CREATE TABLE dirents (
        parent         bigint   NOT NULL,
        child_dir      bigint,
        child_file     bigint,
        child_symlink  bigint,
        basename       text     NOT NULL,
    
        -- Ensure exactly one type of child is set
        CHECK (num_nonnulls(child_dir, child_file, child_symlink) = 1),
    
        CONSTRAINT dirents_child_dir_fkey     FOREIGN KEY (child_dir)     REFERENCES dirs (id),
        CONSTRAINT dirents_child_file_fkey    FOREIGN KEY (child_file)    REFERENCES files (id),
        CONSTRAINT dirents_child_symlink_fkey FOREIGN KEY (child_symlink) REFERENCES symlinks (id),
    
        PRIMARY KEY (parent, basename)
    );


Yep, and you can also solve this problem by having a separate table for each variant and joining whenever you need to deal with common stuff. Personally, I think it's usually better just to have one giant table and allow the different columns to be null since it lets you avoid a bunch of verbose joins.


What is annoying about implementing something like this in a relational database?


It's not straightforward to do polymorphic joins: one common pattern is to have child tables for each case of the union, but there's no integrity constraint such that each parent must only have one child, e.g.

  CREATE TABLE schools (id SERIAL PRIMARY KEY);
  CREATE TABLE colleges (id INTEGER NOT NULL REFERENCES schools (id));
  CREATE TABLE high_schools (id INTEGER NOT NULL REFERENCES schools (id));
How can you ensure that a school is either a college or high_school but not both?

Another alternative is to make one big table with check constraints but that's also hairy in its own right:

  CREATE TYPE school_type AS ENUM ('college', 'high_school');
  CREATE TABLE schools (
    id SERIAL PRIMARY KEY,
    type school_type,
    /* college columns */,
    /* high school columns */,
    CHECK (type = 'college' AND /* college column constraints */),
    CHECK (type = 'high_school' AND /* high school column constraints */)
  );
The other thing in the grandparent's comment that's a constant pain in SQL is representing an ordered list: how do you insert items into the middle of the list? Depending on your database, it can also be painful to renumber the other items.


A combined approach works if want to encode the exclusive constraint:

    CREATE TYPE school_type AS ENUM ('college', 'high_school');
    CREATE TABLE schools (
      id SERIAL PRIMARY KEY,
      type school_type,
      unique (id, type)
    );
    CREATE TABLE colleges (
      id INTEGER NOT NULL,
      type school_type default 'college',
      check (type='college'),
      foreign key (id, type) references school(id, type)
    );
Ya, the syntax is annoying and repetitive. It would be nice if foreign key could be a literal to remove the extra column altogether. e.g.:

    foreign key (id, 'college') references school(id, type)


Good point, I hadn't thought of that. Thanks!


As go_prodev indicated, the former just isn't how you reason about data modeled in a relational form.

The latter makes little sense. What constraints are you placing on a college that do not also apply to a highschool, given they're both schools?

> The other thing in the grandparent's comment that's a constant pain in SQL is representing an ordered list: how do you insert items into the middle of the list? Depending on your database, it can also be painful to renumber the other items.

I'm unclear on what you mean by this. If you want a list of records ordered in a certain fashion, there's an entire "ORDER BY" clause for that express purpose. If you're trying to add "extra" data into the middle of some list that is not otherwise represented in the data in the database, that's essentially business logic and you should be using some kind of custom view or procedure to do that or doing it inside your application code.

If it's just a question of how you add data into the middle of a resultset from actual data in a table based on some arbitrary ordering, you can do that too, people solved that problem ages ago by simply having an ORDER column or similar that's just an int with whatever likeliest precision you get, e.g.: default might be 1000 and then if need be you can insert 999 items between two others before needing to do a re-numbering on the column. These are dumb tricks but needing to "insert a record between two other records" is often also a dumb trick someone is trying to do in the database because they haven't designed things well elsewhere.

I'd venture the 99.9% case is querying real live data in the database and ordering it by factual things like record names, dates of update or creation, status, etc.


That's exactly it though: having to do dumb tricks is the painful part. There's plenty of things with user-defined order where an explicit index isn't surfaced, like to-do lists, playlists, etc.


A user-defined order is an explicit index.

EDIT: To expand further, I would generally model a playlist as its own tables anyways. Something like:

  CREATE TABLE playlist (
    id ID_TYPE_FOR_DB PRIMARY KEY,
    name varchar(4096), /* Or whatever storage size makes sense */
  )

  CREATE TABLE playlist_entry (
    id ID_TYPE_FOR_DB PRIMARY KEY,
    song_ref ID_TYPE_FOR_DB FOREIGN KEY REFERENCES songs(id),
    order INT, /* Or bigint or whatever you want */
  )


> How can you ensure that a school is either a college or high_school but not both?

Do you have any real world scenarios where you've faced this problem?

In your example, you wouldn't model it like that. A school just needs an attribute that identifies the type of school (high school or college), and other attributes that would be common to both.

I'm sure there's lots of examples but it's late and I'm struggling to think of one that a good normalized data model couldn't handle.


I'm modeling the GP's comment, but I think a common case is something like a polymorphic user/organization entity like GitHub: there is some kind of base user that is usable in a lot of places (e.g. https://github.com/<username>) but there's also a load of distinct organization-specific fields and a load of distinct user-specific fields.


Not to mention actually using this schema ends up being really verbose since you have to do a bunch of joins. Additionally, besides being verbose, these joins can wreak havoc with the optimizer since join optimization is exponential. The optimizer might play nicely and just join all these on the ID column in whatever query you're doing, but that is very dependent on how the optimizer understands the input queries. Having a single table instead of 3 limits the number of ways the optimizer can think about a particular query.


> How can you ensure that a school is either a college or high_school but not both?

If the DBMS supports it, you can add check constraints that query the other tables. See for example here: https://stackoverflow.com/a/2588427


You add a xor non null check on the foreign keys?


Auto-completion sucks in a lot of sql statements because the table provides all the hints that good autocompletion would need to provide good suggestions.

That sounds like a nitpick, but man is it useful when you need it.

Notice how the first thing in PRQL is the table declaration.

The fact that UPDATE and INSERT have different syntaxes for basically specifying the same mutation operation is pretty dumb.


That’s not necessarily a showstopper. Oracle’s SQL Developer editor for example provides useful completion in the SELECT clause if the FROM clause is already present.


Have you tried Datagrip?


Thanks, I've frequently wanted a query language that was designed after the 70s. The ideas are sound, but a modernized syntax with variables to reuse subqueries would be lovely. This looks like it.

I noticed one issue though... please don't copy the prefix of f-strings! That only exists because Python boxed itself in and it was literally the only ascii syntax left that could be used for string interpolation. It's mildly ugly but the best that could be done given those requirements. Not so here.

The way shells do it with single quotes producing literal strings and double quotes available for interpolation has not been topped imho. Triple quotes are a nice extension as well, not sure if that made it in.


Interesting suggestion. We added f-strings because we already had s-strings (pass trough to SQL) and r-strings (for raw multi-line text).

And would you rather see "My {name}" or "My ${name}"? I personally dislike the $ prefix for all variables and interpolations...


Languages like Perl, Ruby and more offer plethora of additional ways to encode interpolated strings. I especially love the squiggly heredoc for multi line quotations

https://infinum.com/blog/multiline-strings-ruby-2-3-0-the-sq...


The first one, the $ is redundant if braces required. Multi-line could be triple quoted. SQL, that one I'm not so sure.


Choose backticks as the quote style for interpolation to attract Markdown fans and confuse the hell out of MySQL users :D


> a modernized syntax with variables to reuse subqueries would be lovely.

CTEs provide this functionality already, don't they?


They're a bit awkward, and I've found that some DBs (cough cough MSSQL cough cough) do a horrific job of optimizing them.


When I've needed them I've needed them for multiple statements, once is not enough. Currently have to use plpgsql for this, which is half awesome, half abomination. :-D A single simple language sounds easier to learn.


Put a comma between them, postgres has been able to do multiple CTEs in a single query for quite some time: https://stackoverflow.com/questions/35248217/multiple-cte-in...

Or did you mean like using the same CTE across multiple queries? Views / materialized views are good for that.


The second one, yes.

Need to delete from multiple tables with foreign keys back to a single primary table. This before deleting from the primary table, due to consistency.

We often get a "list" of pks, then use it in multiple "delete key in" statements. A kludge, but these are for one-off tests on a dev database.


I'm not totally sure I follow, as you can re-reference/manipulate the subquery as much as needed. Is it for some kind of dynamic programming like finding a column containing a certain value

  SELECT cols from table where <ANY_COLUMN> like '%foobar%'"
which would need to dynamically insert values into the query

  select col1 from table where col1 like '%foobar%' union select col2 from table where col2 like '%foobar%' union ...
This type of usage is not possible/prohibitively difficult in standard SQL but I'm interested to know if it's a different use-case.


See my comment under the sibling comment.


Here's one suggestion: SQL tediously requires specifying the equality condition on joins, when 90% of the time you just want to join on the fk defined between the tables.

  from a
  join b
should implicitly join on the FK if no condition is given.

It would require knowledge of the schema. I don't know if this is possible in PRQL, or if the transpilation to SQL has to be stateless.


This is something you might end up regretting later.

It’s annoying adding another foreign key later and then having previously working queries fail at runtime due to an ambiguous join condition.


Agree about not implicitly finding the join key. But as long as we're brainstorming imaginary features, then maybe as part of the schema, we could somehow declare the default join key to use, for any given two tables. In most cases it's pretty obvious what the best join key would be.


The risk here is that if one table has two fks to another table, the syntax becomes ambiguous. And the number of fks two tables have to each other may change over time. This means that an append-only change to a table may break existing queries that have no knowledge of the new column.

SQL addresses this via the natural join keyword `using`, where you enumerate the common columns between the two tables being joined. It isn't too convenient for your example unless your pk naming pattern happens to be `<entity>_id` instead of just `id` (note: this naming pattern has all sorts of other adverse consequences though). But it does provide convenience in some cases without introducing backwards compatibility risks as the schema evolves.


If you're willing to sacrifice economics elsewhere, repeating the table's name in the id column is one workaround:

  from a
  join b on b.a_id = a.a_id
You can even use NATURAL JOIN if you can guarantee that the only fkey/pkey names will overlap between tables.

An unreasonable way to achieve that is to put the table name in every column. A more palatable way is to write some clever functions in your schema to scan the information table look for column name clashes (you essentially write a tiny "linter" inside your schema).


If you have identical field names, you can do in sql:

  Select * from a join b using (a_id)
Don't do this in Oracle though, pain follows when you try to touch an a_id column.


You can also do this in PRQL:

    from a
    join b [a_id]
is the equivalent query.


The problem is when one has multiple FKs between the same pair of tables.

Of course, if you allow naming the relation when you create a foreign key, then you could use the source table-qualified relationship name for joins rather than the target table name, which would be unambiguous (and more communicative of intent).

E.g., for a hypothetical table with two self-fks:

  FROM employees ee
  INNER JOIN ee.manager mgr
  INNER JOIN ee.team_lead lead


Hello another contributor here!

Compilation does have to be stateless (for performance reasons), but we are planning to add some kind of schema definitions which could also specify foreign keys.

So joins without conditions would be possible, we'll look into it!

What do you think should happen if there are multiple foreign keys connecting the two tables? Should this also work for many-to-many relations with an intermediate table?


"What do you think should happen if there are multiple foreign keys connecting the two tables? Should this also work for many-to-many relations with an intermediate table?"

If it's not ambiguous, then let me do it. If I rely on ambiguity then throw an exception. In the case of multiple foreign keys, throw an exception, as there's no way to know which one I mean. It'd be nice if I could disambiguate the situation though. Normal SQL allows the `on` clause.

  from TableA
  inner join TableB on <expression>
What if I could specify a foreign key constraint just as easily...

  from TableA
  inner join TableB by ConstraintC
Where ConstraintC is the name of a foreign key constraint between Table A and Table B. It'd be nice to specify the constraint without having to specify the column name details.

The same goes for the many to many relationship with an intermediate table. It could look something like this...

  from TableA
  inner join TableB through TableC
I wouldn't introduce TableC into the scope of the statement. It's not in the FROM clause. It's used in the query but is not available for selecting from. If you want to bring in columns from it, join on it the usual way.

As applications grow, and initially simple lookup table semantics get more nuanced, it might be nice to be able to constrain the join on the lookup table like this...

  from TableA
  inner join TableB through TableC where <expression>
That way if my TableC has some extra columns, such as effective dates, or deleted flags, or that sort of thing, then I can filter out some of the joins that might usually happen.


Unambiguous things can become ambiguous at later points. As soon as you add a second relation between the tables, what once was unambiguous now is, and because of something which may be entirely unrelated to the specifics of the original query.

This is where many conveniences that use implicit data run into problems. A small convenience now for the possibility of accidentally breaking because of mostly unrelated changes later is a poor trade off for anyone that wants to have stable and consistent software.

This is likely one of those cases where you're better off with tooling to help make writing the correct unambiguous code easier (or automated away) than introducing a feature which leads to less stable systems in some cases.

Edit: Along the lines of what you note at the end, I would rather see joins able to use named relations as defined in the schema. Of there's a relation from table movie to table actor specifically names roles in the schema, I would rather be able to join movie on roles and have actors joined correctly using that relation, and aliases to roles which I could then use. Then you're using features that are designed and stable and not implicit and subject to changing how or whether they function based on semi-unrelated changes.

That might look like: "from movie relate roles" which is equivalent to "from movie join actor roles on movie.id = roles.movie_id", but because actor.movie_id has a constraint in the schema named roles which restricts it to a movie.id already.


Agreed! But rather than making the query compiler infer join paths from the schema, wouldn't it make more sense to support defining common join paths (like your 'movie relate roles') in the language, and build a tool, that generates such definitions from the schema, as a separate step?

I don't have a specific syntax in mind yet; for illustrative purposes:

    defjoin r,m,a = %prejoin_roles() -> {     # define a common join path between three relations r,m,a:
      from r=ROLES                            # can hard-code table names or use parameters (which may refer to other parameters)
      join m=MOVIES [r.movie_id = m.movie_id]
      join a=ACTORS [r.actor_id = a.actor_id]
    }

    from r,m,a = %prejoin_roles()
    select m.title, a.character_name
This `defjoin` thing is a limited version of PRQL `table`, which -- unlike a CTE -- remembers which relation each attribute comes from. Perhaps one can instead figure out how to extend `table` to support this.


That sounds like the perfect solution!


One way to avoid constraint name collisions is to include the base table and foreign table names and keys in the constraint name separated by underscores, at which point you don’t save much by using the constraint in a join.


If I may suggest an entirely different direction (that also requires either schema knowledge or adherence to a convention), I've always been mildly annoyed that foreign keys are kind of like pointers that don't behave like pointers. What if:

    create table users (id int primary key, name text);
    create table things (id int primary key, creator int references users);

    from things select [id, creator.name];


> What do you think should happen if there are multiple foreign keys connecting the two tables?

Compilation should fail and require you to explicitly specify what key to use. Please don’t do anything magic.


You can’t add an additional foreign key to an existing table without potentially breaking all existing queries.

Probably the biggest constraint SQL language design has is that its on a live system — things are not compiled at the same time.


SQL has that actually

select * from a natural join b

(not based on fk constraints though, it will join on all attributes with the same name in the relations)


I agree with you that it's a pain writing join conditions with many fields...

But I think that's a shortcoming of the client tool, rather than the language.

If SQL tools auto completed the join conditions as best as they could it would probably be a great help.


I recently started implementing the Postgres protocol in Rust (https://github.com/dmeijboom/postgres-conn). So I guess I’ll be experimenting with creating a Postgres proxy which translates PRQL on-the-fly.


That looks really exciting! Please keep us in touch with your efforts and let us know if there's any way we can be helpful.


Wow! This is cool stuff.

I was looking through the documentation for conditional logic, control flow, IF(), and the CASE...WHEN...THEN operator. It seems like the ternary operator is the single way to implement conditionality?

At https://prql-lang.org/book/examples/functions.html I found the example `func if_valid x`, which is then used in the `derive` expression `prices_adj | ret | if_valid`. This usage of `if_valid` at the end of the pipeline seems a bit awkward: if I want to do the whole calculation only if some condition is met, I'd like to write it at the front of the pipeline. Can I nest the `ret` function in the if-function like so?: `func ret_if_valid x -> is_valid_price ? (ret x) : null` Then I guess I'd have to do:

    return_total = prices_adj | ret_if_valid
Next, can a string variable be used as part of a column name? It's something I needed recently in order to categorise values and use them to do a pivot. For example, I had to do:

    SELECT CASE category WHEN 'a' THEN 'x' AS newCategory
    ...
    SELECT SUM( if(newCategory = 'a', revenue, 0) ) AS aRevenue, ...
Ideally I'd like to do:

    aggregate [
        for x in ['a', 'b', 'c']:
            (concat x 'Revenue') = sum ( [new_category == x] ? revenue : 0 ),
    ...]
i.e. I'd like to use x as part of the final column name. A pivot example would be great (or maybe you already have a PIVOT implementation in mind).

I'd be interested in contributing to this, guess I'll take a look at the code and any community chat you've set up!


I've been contributing to this project on a few little things due to my little knowledge level. But I felt like home with such a good company of people!.

I not mature enough to fully appreciate the technical potential of the project, but the good ambient, the kindness and the growth potential is for sure worthwhile. I truly encourage everyone to contribute!


In particular, I think this is looking pretty good and I'd want to see even more complicated examples. For example, What do window functions end up looking like? [1]

What about crazy operations like calculating percentile_cont? [2]

Or just in general, how would "implementation specific" queries end up looking?

[1] https://www.postgresql.org/docs/current/tutorial-window.html

[2] https://docs.microsoft.com/en-us/sql/t-sql/functions/percent...


Great questions!

Window functions are here [1]. (We should add these to the homepage too)

Implementation specific queries can be handled by the Dialect parameter [2], though there's still lots of work to do to build that out.

[1]: https://prql-lang.org/book/transforms/window.html

[2]: https://prql-lang.org/book/queries/dialect_and_version.html


Window example is now on the homepage, thanks for the question: https://github.com/prql/prql/pull/692


This looks great. I've thought about something similar to this for quite a while now. Column autocomplete is key for me from a quality of life perspective and to make it truly usable.

I'd absolutely love to see the next level of this pipeline be continued where something like Observable Plot or ggplot2 like functionality where you can take your pipeline data analysis and directly plot it to visualize it.


I also recommend looking at EdgeQL -- https://www.edgedb.com/showcase/edgeql -- a new query language aimed to eliminate some of the SQL quirks.

(I'm a co-founder)


I'm a huge fan of EdgeDB!

Possibly our focus is a bit different — I see EdgeDB as primarily focused on transactional queries, whereas PRQL is very focused on analytical queries. PRQL doesn't do quite as much — e.g. we don't model the relationships between entities, which is less functional but more compatible.

Feel free to reach out on Twitter if you think there's some way of us collaborating, or if you have any feedback or guidance for us.


Replied on Twitter!

> I see EdgeDB as primarily focused on transactional queries, whereas PRQL is very focused on analytical queries.

That's true to an extent currently, but we actually envisioned EdgeQL to be a capable analytical query language too. We'll release EdgeDB 2.0 in a couple of weeks and it will feature a powerful GROUP BY statement (read more about it here [1]) and in 3.0 we might ship window functions (or some equivalent).

With all that said PRQL looks cool!

[1] https://github.com/edgedb/rfcs/blob/master/text/1009-group.r...


I've seen similar solutions being built internally in multiple companies, none with a syntax as well thought out as this. Amazing work!


Somehow I didn't see it coming -

> pronounced "Prequel".

- and I burst out laughing. Very good.


If you appreciated that check out "Franchise - a notebook SQL client. What you get when you have a lot of sequels": https://github.com/MobilityDB/Franchise


A good example might be a groupwise maximum. Those always tend to be a bit of a PITA in SQL if you're not writing them regularly. Be interesting to see what it transpiles to, as well.


If you only want maximum of one column, the PRQL is quite simple:

    from my_table
    group column_a (
      aggregate (max column_b)
    )
If you want the row with the maximum value it gets interesting:

    from my_table
    group column_a (
      sort [-column_b]
      take 1
    )
You can read more about group here: https://prql-lang.org/book/transforms/group.html


Opened an issue as I couldn't get this to work against sqlite:

https://github.com/prql/prql/issues/695


Thanks a lot for testing and opening an issue! This is now fixed and released [1]. Let us know if you still face any problems.

(We need better tests against real DBs, which is very much on our roadmap)

[1]: https://github.com/prql/prql/pull/698


Thank you for the amazingly quick fix!


It seems like an obvious next step for Postgres support would be to make PRQL a stored Procedure Language.

create function foo() returns bar as language prql $$<prql code here>$$;


Looks awesome! I don't think it adds much to SQL when the queries are simple, but when you have this looong and complex query I can totally see the appeal.


Tangentially related, but does anyone know of a sql alternative that carries the execution plan with it? Sometimes you don't want a black box interpreter sitting between you and the database, so it would be nice to specify not only what you want, but also how to run it.


I don’t know if this directly answers your question, but Gurjeet is working on a Postgres extension that allows you to “lock in” a query plan: https://github.com/DrPostgres/pg_plan_guarantee


I see that the JavaScript package is at [1] and it's implemented by compiling the Rust code to WASM. That should eventually make it pretty easy to run it.

It has a typescript definition file, but it looks like it's autogenerated and a bit clunky. You get back a CompileResult and have to call free() explicitly, it seems? That doesn't seem very idiomatic for JavaScript.

Also, the links to the documentation and examples in the README are broken.

[1] https://www.npmjs.com/package/prql-js


That's true - the package is auto-generated using [wasmpak](https://github.com/rustwasm/wasm-pack), that's why TypeScript definitions are clunky. I did the initial prql-js release and I'm actually not sure about the free() issue you are talking about.

We are currently working on compiling it for both Node.js and the browser target, and would be happy to see some advice if you are familiar with WASM!


I didn't actually try it out and I'm not all that familiar with WASM. Here is the typescript I see (stripped of boilerplate comments):

export function compile(s: string): CompileResult;

export class CompileResult { free(): void;

  readonly error: CompileError | undefined;

  readonly sql: string | undefined;
}

What is the purpose of the free() method?


Thanks a lot — issue added: https://github.com/prql/prql/issues/708


Is it true that this is somehow an analogue to how JS development is really ultimately targeting browser-compatible-JS in the end, even though we use the latest ECMAScript features & TypeScript in development? I.e. is it expected someone writes PRQL and then transpiles before executing against the database? Is there a REPL one can use against a local Postgres or something?


> Is there a REPL one can use against a local Postgres or something?

Somewhat — you can use it in Jupyter now[1]; e.g.:

    %%prql
    from p = products.csv
    group categoryID (
      aggregate [average unitPrice]
    )
This doesn't yet have the benefits we'd get from e.g. autocomplete, so there's much more to do there.

There's also a cool TUI in PyPrql[2]

[1]: https://pyprql.readthedocs.io/en/latest/magic_readme.html

[2]: https://pyprql.readthedocs.io/en/latest/readme.html


Why should I use this instead of SQL?


I work on a TUI logfile viewer that uses SQLite as a backend for doing analysis on the log messages (https://lnav.org). However, writing SQL interactively is painful since you can't really provide good auto-complete or preview, which is something I try to provide for most other operations.

The PRQL pipeline syntax would make for a much better experience for lnav since you're able to progressively refine a query without having to jump around. (You've probably noticed that many log services, like Sumologic, already provide a pipeline-style syntax instead of something SQL-like.) The nice thing is that you can simply keep typing to get the results you want and get a preview at each stage. For example, entering "from" and then pressing <TAB> would make it clear to the program that table-names should be suggested. The program could then show the first few lines of the table. Typing "from syslog_log | filter " and then pressing <TAB> would make it clear that columns from the syslog_log table should be suggested (along with some other expression stuff). And, then, the preview of the filtered output could be shown.

In the current implementation, pressing <TAB> just suggests every possible thing in the universe, whether it's appropriate or not. This leaves the poor with not much help after they've typed "SELECT". I find myself having to lookup docs/source to figure out column names or whatever and I wrote the darn thing. Ultimately, I think the analysis functionality just doesn't get used because interactively writing SQL is so user-hostile. So, I'm looking forward to seeing this succeed so that I can integrate it and still be able to use SQLite in the backend.


That's a really good question! (and one we should probably answer explicitly in the [FAQ](https://prql-lang.org/faq/) rather than just implicitly)

The README states that "PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement. Like SQL, it's readable, explicit and declarative. Unlike SQL, it forms a logical pipeline of transformations, and supports abstractions such as variables and functions. It can be used with any database that uses SQL, since it transpiles to SQL."

What that means to me is that PRQL more naturally maps onto how I think about and work with data.

Say I have some dataset, `employees`, and I want to answer some questions about it like, for US employees, what is the maximum and minimum salary and how many employees are there:

    from employees
    filter country == "USA"                       # Each line transforms the previous result.
    aggregate [                                   # `aggregate` reduces column to a value.
      max salary,
      min salary,
      count,                                      # Closing     commas are allowed :)
    ]

Moreover, after each line you have a valid pipeline which you can transform further by adding more steps/lines to your pipeline. This matches more closely how people construct data pipelines in R using dplyr/tidyverse and in Python using Pandas.

If you find that it doesn't map well onto how you think about data pipelines then please let us know as we're constantly looking for more real world examples to help us iterate on the language!


One benefit of SQL is that the Database Engine will do the hard work of optimizing the query plan.

Do you think the SQL complied by PRQL could be as effective and optimized by database engine as the direct-written SQL?


As you said, let the Database Engine do the hard work of optimizing the query plan for you.

I currently have no reason to believe that the PRQL generated SQL would be any worse than hand written SQL. That said, I don't think we've currently looked at any ways of passing hints to the query planner. We're always open to suggestions!

In the worst case, you have full access to the generated SQL, and for absolutely crucial queries you can hand modify that SQL. At least PRQL might have saved you the trouble of writing a cumbersome window function or something like that (see for example the example of picking the top row by some GROUP BY expression).


This reminds me of KUSTO I'm not sure how it compares to SQL in general. But it was really fun to work with for querying Azure application insigts


Because of things like WHERE/HAVING in SQL.


To avoid working with SQL strings.


SQL/jinja like dbt could also avoid working with SQL strings. what would be the better advantage?


There is already an integration for dbt: https://github.com/prql/dbt-prql

For example

    {% prql %}
    from source = {{ source('salesforce', 'in_process') }}
    derive expected_sales = probability * value
    join {{ ref('team', 'team_sales') }} [name]
    group name (
      aggregate (sum expected_sales)
    )
    {% endprql %}
would appear to dbt as

    SELECT
      name,
      SUM(source.probability * source.value) AS expected_sales
    FROM
      {{ source('salesforce', 'in_process') }} AS source
      JOIN {{ ref('team', 'team_sales') }} USING(name)
    GROUP BY
      name
dbt is definitely a use case we are very aware of and I am personally very keen on (since I use that in my $dayjob). With some of the ideas in https://github.com/prql/prql/issues/381 , I think PRQL could really shine in this area!

With your contribution we can get there faster!


I mostly work using T-SQL but I like PRQL.

Are you able to add in the examples the following:

    1) Use of delimiters for names that include space, etc. I don't know if PRQL uses double quote or square brackets. 
    2) Use of two/three/four-naming convention to refer to servers, databases, tables and columns.


Awesome to see the progress here. Looks like the language has significantly matured since last it popped up on HN.


I wonder, why would you go for a "pipeline" of relational operations, when it's strictly weaker than allowing for a tree of operations? The way the examples seem to be written, a stack machine would subsume the existing syntax, since you first specify an operand (like "from employees") and then you specify and operation (like "filter country == "USA"), where in a stack machine an operation such as "from X" would put the relation X onto the top of stack, whereas an operation such as "filter" would then replace the top of stack with a transformed relation. This could be extended by for example "join on ..." being simply an operation consuming two relations from the top of stack and putting one result back, joining two pipelines into one.


Is it correct there's no CASE WHEN and instead you have to define a function using a ternary operator? CASE WHENs may be verbose but when you have a dozen of them they're more readable and the waterfall nature is far preferable to a giant block of ternary clauses.


Looks fantastic.

There are a lot of rough edges when building a string representing an SQL query in the programming language that you're using. You have to be careful to avoid SQL injections, for starters. Do the bindings for PRQL innovate at this level?


SQL injections will always be a thing, regardless of SQL vs Not-SQL, if you’re building strings to represent programs. Parameterization is precisely how you properly differentiate between code and data, and it’d be the same strategy no matter the language/system.


From a mathematical point-of-view are there any transforms/operations (note: not end results, but actual operations) that this can do that SQL can't or vice-versa?


As said, currently PRQL transpiles to SQL, so all expressions in PRQL are can be expressed in SQL. But not all SQL expression can be translated back into PRQL - some intentionally and some are just not yet implemented (UNION - i.e. vertical concat).

But we also have plans for doing things that some SQL databases may not support, such as pivot (rows to columns).


hopefully you'll forgive my pedantry - "union all" is vertical concat - "union" without the "all" gives you the distinct list


One of the reasons why SQL is crap: there should be no distinction between the two in relational algebra. A set of {A, B, C, B, C} is the same as {A, B, C}.


Detecting duplicates has a cost you can't just hand-wave away. UNION ALL tells the engine not to worry about it and just output as it sees it, usually going faster. Depends on your data needs.


And not treating relationships as sets has costs of its own -- for example it breaks relational formula equivalences that could be used for query optimization.


For some definitions of same as.


I assume it can't do anything SQL can't, because they write "It can be used with any database that uses SQL, since it transpiles to SQL." Not sure about the reverse.

I'm used to SQL syntax, but this has definite appeal. As a small example, I like that it starts with the "from" clause, so autocomplete is more viable.


Transpiling to SQL doesn't mean all the underlying SQL features are being exposed to you.


Yes, that's what I meant by "not sure about the reverse".


Thanks, I missed that it transpiles to SQL.


This can be transpiled into SQL which makes it then trivial that it can do everything SQL can do. SQL is Turing complete so it can do anything PRQL can do.

EDIT: I'm sorry, I didn't realize that even if something transpiles from one language to another it does not guarantee that one language can generate all strings of another language. But taking a look at the abstractions PRQL offers I would be very surprised to find it not capable of it.


The second part of your statement is fine, but the first part is just a complete fallacy.

I can transpile a pure language exposing only `if`, `while`, and `for` with no standard library and no interop to C - that definitely does not make it "trivial" that it can do everything SQL can do.


I realized this after posting and edited the post. Thanks.


No problem. (I didn't downvote.)


I find SQL harder to work with because I'm used to reading and writing functions - functions that have explicit parameters with (hopefully) explicit types, and explicit return types.

So I was hoping PRQL might have some sugar like that.

It'd be nicer for my brain if I could treat any select like a function that transforms a set. Instead, even with PRQL, I'm stuck examining source tables and fully understanding them before being able to understand the query.


Is this like what ORM for?


Yeah, but without the objects.


Is there a representation in PRQL of UNNEST and a sub-select over an array within a row? We use it quite extensively in the DB that I most often work with.


Nice. A few years ago I designed a query language to do aggregations in RediSearch, and it's quite similar in its structure. https://redis.io/docs/stack/search/reference/aggregations/


This reads a lot like the style of Pandas I teach.[0]

This seems to have resolved a problem with SQL, you can't read it in a linear fashion.

When we start getting query optimizers for Pandas, much of the benefit of SQL will go away.

[0] https://store.metasnake.com/effective-pandas-book


What is a query optimizer for Pandas?


Congrats for the milestone!! The syntax looks more intuitive than SQL. Great to see viable alternatives to SQL!


I’m a raw-SQL-no-ORM snob and I really like this. I’d like to see it become more mainstream.


SQL doesn't need fixing or improving if you ask me, its well supported and just works.


I just noticed the play on words "prequel" vs "sequel". Nice.


This sounds like LINQ, or SparkSQL. Instead of a full new language, it makes feel better to create libraries in languages that supports embedded DSL easily.


Great project, was looking something like this

but wish this was somehow encoded as JSON, so you could easily build pipeline UI for complex SQL Generation.


I don't really find SQL that difficult, other than NULL !=NULL, I wish <=> could be switched to the default!


Doesn't the pipelining mean that it's not declarative? At least not in that part.


This is really a SQL renaissance.


Obligatory dismissive comment:

> 0.2

No it ain't (in production).

Anyway, this looks great. I LOVE the fact that you've provided a book too. Consider me a fan!


We're definitely not ready for production! Sorry if that was implied.

But we are ready for people to start using it in their development work. Lmk if there's a better way of describing that.


Looks a lot like LINQ


[deleted]


Hot takes: SQL is great, actually. This thing isn't better.


As much as I love SQL it can often be a pain and involve lots of nested subqueries to do 'simple' things. I like this way this abstracts it.

Would I use this instead of proper SQL in a data warehouse / large app? Maybe not.

Would I use it to manually query DBs when I need some ad hoc info? For sure.


If this becomes a full-fledge DQL that can be used in a proc or function in a running Postgres instance, I would use it in production.


Not sure if this is better, but SQL is HORRIBLE... we probably put up with it bc it's based on sane math & theory, and we almost never write it by hand.

There's zero thought to any kind of ergonomics, there's no way to say "join table Y but prefix all its columns with employee_", it's expressed backwards ffs (instead of starting with FROM), results of queries with joins are forced to be flat tables and there's no way to get trees as you need 99% in app code - all the repetitve app code to "nest" entities in results that also needs to make brittles assumptions about ordering and uniqueness because people couldn't standardize on a "RESULT AS TREE [NESTING <Y> INTO <X>]" clause or smth. equivalent etc. etc.

PRQL though seems to also lack all the essetial features you would expect around joins.

Suff like Arrango DB's AQL seems to be a nice example of adding the missing feature to SQL, probably more of the need to accomodate graph data too, but it actually solves SQLs problems even in relational contexts - see https://www.arangodb.com/docs/stable/aql/tutorial-join.html#... .


No, but it's not as foreign a paradigm or language to (presumably?) its main target audience - developers.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: