Hacker News new | past | comments | ask | show | jobs | submit login
PostgreSQL’s New LATERAL Join Type (heapanalytics.com)
278 points by drob on Dec 2, 2014 | hide | past | favorite | 39 comments



LATERAL is awesome. It makes a lot of queries that required sub-select joins much simpler to write and later read.

It's also great for set returning functions. Even cooler, you don't need to explicitly specify the LATERAL keyword. The query planner will add it for you automatically:

    -- NOTE: WITH clause is just to fake a table with data:
    WITH foo AS (
      SELECT 'a' AS name
           , 2 AS quantity
      UNION ALL
      SELECT 'b' AS name
           , 4 AS quantity)
    SELECT t.*
         , x
    FROM foo t
      -- No need to say "LATERAL" here as it's added automatically
      , generate_series(1,quantity) x;
    
    name | quantity | x
    ------+----------+---
    a    |        2 | 1
    a    |        2 | 2
    b    |        4 | 1
    b    |        4 | 2
    b    |        4 | 3
    b    |        4 | 4
    (6 rows)


Thanks for the example. It seems to be the same as MS SQL Server's CROSS APPLY / OUTER APPLY?

Anyway, good that Postgres has it too, now. There are several Postgres features I'd love in SQL Server, like range types...


Usually when I've had to use cross/outer apply it's been to work around overly normalized, and somewhat bad data.

Agreed on range types.. proper enums in T-SQL would be nice too. I'm really liking where PL/v8 is going, and would like to see something similar in MS-SQL server as well.. the .Net extensions are just too much of a pain to do much with. It's be nice to have a syntax that makes working with custom data types, or even just JSON and XML easier.

If PostgreSQL adds built-in replication to the Open-Source version that isn't a hacky add-on, and has failover similar to, for example MongoDB's replica sets, I'm so pushing pg for most new projects.

Maria/MySQL seem to be getting interesting as well. Honestly, I like MS-SQL until the cost of running it gets a little wonky (Azure pricing going from a single instance to anything that can have replication for example). Some of Amazon's offerings are really getting compelling here.


Am I right in thinking that this does not increase Postgres's expressive power but allows more concise implementation?


No, you can write queries that are not really possible to express without it. Basically, it allows you to execute a table-valued function for each row in an earlier query.

For example, in SQL Server I find a common use of CROSS APPLY (which appears to be the same thing) is where the "table-valued function" is a SELECT with a WHERE clause referencing the earlier query, an ORDER BY, and a TOP (=LIMIT) 1. (In fact, this is exactly the example given in the article.) It allows you to do things like "for each row in table A, join the last row in table B where NaturalKey(A) = NaturalKey(B) and Value1(A) is greater than or equal to Value2(B)".


That's not true. Anything you can do with LATERAL you can also do with correlated scalar subqueries in the SELECT list. LATERAL simply makes writing these kinds of queries easier and more intuitive.


The syntax for this is pretty horrible, however. And if you want to return more than one column from the subquery, you would have to duplicate the subquery definition for each column, right? Then you'd have to have faith that the optimizer can work out what you meant and reconstruct just a single subquery.


There's no faith required; the planner is guaranteed not to do that. The "normal" way is to create a composite type containing each of the columns you need, and then "unpack" it to separate columns. Horrible? Yeah, but it's possible.


Is it possible with scalar subqueries to perform anything other than a tree of correlation? With CROSS APPLY one can correlate a DAG of subqueries, e.g. a diamond where B and C depend on A, and D depends on B and C.


What if the limits on the lateral subqueries were 2 instead of 1, and they were doing select * instead on select sum() in the outer query? How would you recreate that with correlated SCALAR subqueries? There's no such thing as non-scalar correlated subqueries is there?


Untested, but this is the general approach:

  SELECT unnest(ar).* FROM
    (SELECT ARRAY(SELECT tbl FROM tbl
                  WHERE .. ORDER BY .. LIMIT 2) AS ar
     FROM .. OFFSET 0) ss;
If you want a specific set of columns instead of *, you'd need to create a custom composite type to create an array of, since it's not possible to "unpack" anonymous records.


Thanks, joining to the last row is an instructive example.


Looking at the examples from the official documentation, I agree with your sentiment. Indeed, the conciseness can cause some confusion to people familiar with the existing scoping rules.

IMHO, it's a good thing if LATERAL is only added as some kind of syntactic sugar. I once had to use LATERAL in DB2 as a band-aid solution for its broken scoping rules: https://www.ibm.com/developerworks/mydeveloperworks/blogs/SQ...


I'm a huge believer in postgres' json store, but anything above the basic queries can get a bit messy. Lateral joins really clean up aggregate data queries on json columns. E.g.

        SELECT id, COUNT(keys)
        FROM users,
          LATERAL json_object_keys(login_history) keys
        GROUP BY id;


As someone who only works now and then with Postgres, this is what made it click for me:

"Loosely, it means that a LATERAL join is like a SQL foreach loop, in which PostgreSQL will iterate over each row in a result set and evaluate a subquery using that row as a parameter."


Sounds like what Oracle calls a "correlated subquery" is it the same thing?


No, Postgres has supported those for a long time. For example:

    SELECT *
    FROM   employees e
    WHERE  EXISTS (SELECT 1
                   FROM   employee_projects ep
                   WHERE  ep.employee_id = e.id
                   AND    ep.project_id = 5)
    ;
This new feature is a lot like correlated subqueries except you can put that nested SELECT into the FROM clause and still access the employees table.


The `sum(1)` and `order by...limit` approach really isn't the best way to build the funnel. And if you take another approach then this could have easily been built with normal left joins.

Also, you should probably show some explain plans before making this claim: "Without lateral joins, we would need to resort to PL/pgSQL to do this analysis. Or, if our data set were small, we could get away with complex, inefficient queries."

Here's a comparison of the explain plan from your query without the `sum(1)` and `order by...limit` business and a query using only left joins (no use of lateral): https://gist.github.com/ajw0100/0d86715d6ce97389f306. Note, I ran this against an empty copy of your exact table (no data, no statistics). However, the explain plans are the same.

My understanding is that lateral was really meant for set returning functions like generate_series as others have already mentioned.

Edit: I should mention I know you were just trying to demonstrate how lateral works and that it is always good to see people writing about new Postgres features!


Looking at this a little further, the outer nested loop could cause issues in the left join vs. left join lateral version, depending on how many use_demo events there are in the week following the user's first view_homepage event. I added another query that uses CTEs which allows for intermediate aggregation which should make the size of the nested loops similar between both versions. However, I wouldn't be surprised if the CTEs take more memory than the lateral joins because CTEs are basically temp tables that only last for the duration of the query. Lateral may indeed be the best option but ideally I would populate this table with real data, gather statistics, and then run explain analyze on each query.


> Without lateral joins, we would need to resort to PL/pgSQL to do this analysis. Or, if our data set were small, we could get away with complex, inefficient queries.

I don't buy the performance benefit over derived tables with properly indexed fields for this example. However, I'd definitely use this more so with functions.


In MS-SQL I've done similar things with CROSS/OUTER APPLY, which seems to be about the same things... It's usually to work around wonky data and has some overhead.

In my own case it was so I could create a view that would give me a denormalized view of most of the data for a given record... most of the fields were in a common PROPERTIES table with some funky data in it... Sometimes normalization goes too far.


Here's another explanation from the Postgres docs with additional examples: http://www.postgresql.org/docs/9.3/static/queries-table-expr...


seems similar to CROSS APPLY and OUTER APPLY from mssql http://technet.microsoft.com/en-us/library/ms175156%28v=sql....


Fantastic. This is exactly what I wanted earlier in the week. I ended up just writing some python to do the looping for me as it was only a few thousand rows and I was optimizing for developer time (i.e. mine).


How does this relate to WITH RECURSIVE (CTEs)?

http://www.postgresql.org/docs/8.4/static/queries-with.html


LATERAL allows a subquery or function to refer to the records that it's being joined to when computing results.

WITH RECURSIVE allows a query to refer its own results when computing its results. That may sound mind-bending, but it's really just poorly named way to do a "while" loop in SQL.

A WITH RECURSIVE has the form (base-case-query UNION ALL iterative-step-query). The query for the iterative step can refer to itself via the WITH alias. When it does so, it's actually operating on only those records produced by the previous step of the iteration. The iterative-step-query will execute possibly multiple times, stopping only when it doesn't produce any more records.

Here's the WITH RECURSIVE example from the Postgres docs, translated into Python:

  all_records   = []
  previous_step = [1] # base case, i.e. "VALUES (1)."
  while previous_step:
      all_records.extend(previous_step)
      # iterative step, i.e. "SELECT n+1 FROM t WHERE n < 100"
      current_step = [n+1 for n in previous_step if n < 100]
      previous_step = current_step
The confusing part is that the alias "t" in the SQL example means different things in different places. Outside the WITH RECURSIVE, "t" is equivalent to "all_records" in the example. Within the WITH RECURSIVE definition, "t" is the same as "previous_step".


cross join lateral also works with json arrays. Select your json array then CROSS JOIN LATERAL with a jsonb_to_recordset() to take your json array into individual columns.


From the many specific comments on here, it sounds like most people don't use an ORM. Is that the case?

Not trolling or trying to start a flame, I'm genuinely curious as to how people here get stuff done.


Even folks who use an ORM still need to go outside the ORM for some operations -- especially reporting (like the example in the article). I find ORM's very helpful for basic CRUD operations, especially when one conceptual entity of data maps to one or two database tables directly. But there are just some things that can be expressed in SQL more efficiently than you could in OOP programming (and vice-versa).


Thanks. That's for sure true, using an ORM to avoid the boring CRUD stuff is definitely a good thing. However, this shows once again that, for non-trivial systems, you still have to resort to plain SQL, which kind of locks you in to your database of choice.


...we let people like you shuffle simple result sets in and out of web pages, freeing us up to do more interesting things?

(ha ha, only serious)


I don't get it. I don't think your comment has anything to do with mine.


"I'm genuinely curious as to how people here get stuff done." -- we're not doing what you're apparently doing.


Nice read.

Somethings i would have liked to have read/seen though are, the hardware specs, database dimensions (nr of entries in tables etc) and query times.

Just my penny. Hope you can spend it still somewhere :)


So it seems this does the same thing as putting the subquery in the list of columns to return, except more efficient.

i.e.

   SELECT a, (SELECT b FROM ...) b FROM ....


Kind of.. it's more expensive than a direct join, but you can do more with it. AFAIK it's syntactic sugar around a correlated sub-query... It may be slightly more efficient now too. It should probably be used sparingly and over limited result sets.


I think it goes beyond syntax sugar, but perhaps there's a transformation between them that I'm not aware of.


Is this the same as a correlated sub-query (something that exist in Teradata)?


No:

https://news.ycombinator.com/item?id=8690389

But they are similar... Kind of like correlated subqueries in the from clause. But they allow a few new twists in the semantics.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: