Hacker News new | past | comments | ask | show | jobs | submit login
SQL Join Types Explained in Visuals (dataschool.com)
152 points by notoriousarun on July 7, 2021 | hide | past | favorite | 53 comments



This seems like a terrible explanation.

Firstly, joins are substantially orthogonal to keys. You can join two relations with no keys at all. Keys might help you make sure that there are going to be results to your join, but they are not required, and i wouldn't start with them.

Secondly, union is not a join. If mentioned at all, it should be in an addendum about other ways to combine tables.

Thirdly, left and right joins are specific kinds of outer join. Surely it's much clearer to describe "left, right, and full outer joins" than "left, right, and outer joins".

Fourthly, surely the most logical way to explain joins is the progressive refinement cross -> inner -> equi -> natural -> full outer -> left/right outer. A cross join is a Cartesian product, simple to explain. Restrict that with a predicate and you have an inner join (i'd use an example where the predicate is not equality). Introduce ON as a synonym for WHERE in inner joins. Make the predicate equality on a set of the columns and it's an equijoin. Introduce USING as a shorthand for ON in equijoins. Make that set all the corresponding columns and it's a natural join. Now include all the rows that didn't match and it's a full outer join. Note that WHERE wouldn't work here. Now just the rows on one side or the other and it's a left/right outer join.


> A cross join is a Cartesian product

I think that's the most important detail that can't be explained visually like this.

Cartesians explode the row-counts - 10 rows x 10 rows is 100, not 10 or less than (which is the assumption for most people as they write joins with a primary key on at least one side).

If you have a cartesian and add on the additional entry needed for the null in outers, you can explain the output of the join as a filter over that set (full outers make more sense in that model).

In fact, that's a good way to think of when you see WHERE used to produce an inner join (cartesian + filter == inner).

Sometime ago, I had to optimize pure cartesian joins which had a user-defined function like ON overlaps(time_range1, time_range2) and this involved a deep-dive[1] into distributing a cartesian join to parallelize the filtering.

[1] - http://people.apache.org/~gopalv/xldb18/tez-xprod.pdf


1. They are also missing antijoins. There's not a dedicated SQL keyword for it, but I think it's worth mentioning (especially if you're going to talk about union/intersect/except).

2. Actually -- and I suppose I can't blame the article -- but from title I was expecting to see HashJoin, MergeJoin, NestedLoopJoin, etc. I.e. join implementations not specifications.


I really don't get the confusion around JOINs...

Any JOIN is what the author is calling a "CROSS JOIN", with "ON" (which is optional) basically being a WHERE condition, and "LEFT", "RIGHT", "INNER", "OUTER" words only describing whether or not to drop a result if the left/right side of the join would be a NULL reference.

And, of course, a "UNION JOIN" is not a join at all, because nothing is joined.


There is a semantic difference between a `where` clause and an `on` condition. If you're intending to do a left|right|full outer join, filtering via the `on` condition versus the `where` clause can produce drastically different results, even if the predicates used are exactly the same.


Before SQL/92, you would exclusively use the where clause. Here's an example of a left join:

  SELECT a.pk
  FROM table1 a, table2 b
  WHERE a.pk *= b.fk
I haven't tried this syntax in ages, but I imagine it's still supported.


I just tried this in MariaDB 10.5.10 on my personal blog database and it doesn't work. `post` <- many to many -> `tag`

With LEFT JOIN:

    SELECT posts.title,
    GROUP_CONCAT(tags.tag) AS tagarr
    FROM posts
    LEFT JOIN post_tags ON post_tags.post_id = posts.id
    LEFT JOIN tags ON tags.tag_id = post_tags.tag_id
With WHERE:

    SELECT posts.title,
    GROUP_CONCAT(tags.tag) AS tagarr
    FROM posts, post_tags, tags
    WHERE posts.id *= post_tags.post_id
    AND tags.tag_id *= post_tags.tag_id
It gives:

    ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near '= post_tags.post_id [...]


The trouble you're having there is that there is no standard syntax for ANSI 89 outer joins at all. All outer joins were RDBMS-specific syntax. Indeed, if I remember right, left outer comma joins are *= for Oracle and =* for DB2 (or was it =+). MySQL has never supported any other syntax for outer joins, but it does support ANSI 89 (comma) inner joins.

The ANSI 89 syntax for inner joins is pretty ubiquitous, as it can be a little easier to write code generators that way.


I guess the star is for outer join? Or maybe it's just vendor syntax. Interestingly the postgresql tutorial introduces joins without using the join keyword initially:

https://www.postgresql.org/docs/13/tutorial-join.html

See also join_type: https://www.postgresql.org/docs/13/sql-select.html


It is and it was (is?) for a long the time mandatory join style for UPDATE statement.

Reason being that it's used to declare and restrict which table should be updated in the UPDATE section as opposed to optional joined tables listed in the FROM section.

PS: The *= is plateform specific however (have no clue which). Usually you just write a more SQL standard WHERE section.


It is supported and still in common use in some circles. For reasons I don't fully understand this is called "oracle style".


PostgreSQL doesn't support this; IDK if it ever did.


> I really don't get the confusion around JOINs

Nearly all programmers get locked into thinking in terms of sequential, imperative programming, and the big differences are procedural vs OOP. They think that's always what programming looks like. They never learn to think in any other terms. They think functional programming is a weird outlier and looks like Lisp.

It's not entirely their fault. This is all they're typically taught in school. They're taught they don't need to think any other way than how Java, C, or C# tells them they should.

So they're not used to thinking in terms of sets. They're not used to declarative programming. Indeed, they'll often argue that anything other than general purpose, imperative programming isn't "real" programming. That's why Professor Brailsford from Computerphile got so much flack for saying that HTML was a programming language.


To be fair, I wouldn't consider HTML (or Markdown) a programming language. It's in the name, [] Markup Languange


Yes, that's just what the YouTube comments insisted.

I'm not really interested in the argument, but all you've said so far is "it's domain specific" or "it's not Turing complete", neither of which are actually requirements for programming languages as a whole. It's not particularly difficult to devise a sufficiently loose definition of "program" that HTML fits into, too. Or, rather, it's difficult to envision a definition that LaTeX fits into and HTML+CSS fits into and SQL fits into and LISP fits into and so on, but HTML does not without your argument simply restating "it's domain specific" or "it's not Turing complete".

However, it's also not a particularly interesting discussion, either, because the only purpose behind such categorization is to assign value judgements like "a 'real' programming language can be used for systems programming".


> [...] because the only purpose behind such categorization is to assign value judgements like "a 'real' programming language can be used for systems programming".

And the other unstated purpose is gatekeeping. Oof, for such a relatively risk-free, comfortable and privileged profession, we're such an insecure lot.


But CSS is a programming language because it's Turning complete*?

And Regex isn't a programming language because it's not Turning complete?

It's natural to for me to say that writing a regular expression is programming a state machine.

* https://stackoverflow.com/a/5239256/1212596


I taught myself SQL on a need-to-know basis. One thing that blew my mind and really helped making certain non-trivial queries more manageable was when I realized I could join subqueries. It never occurred to me to try, until one day I did and it worked.

A contrived example for illustration:

    select oh.oid, oh.order_ref, ol.num_lines, ol.total_value
    from order_head oh
    join (
      select oid, count() as num_lines, sum(value) as total_value 
      from order_lines
      where status = 'ready'
      group by oid
    ) ol on ol.oid = oh.oid
    where total_value > 500
For certain queries this has led to huge speed-ups, by allowing rows to be filtered sooner rather than post-massive-join. For others it makes it easier to integrate with libraries, as aggregate fields like in the example can be filtered using the main where clause rather than in a having clause.


Yep, I've used this with huge success in PostgreSQL when needing to join the contents of tables containing potentially millions of records that could (usually) then be filtered via search criteria. Basically I'd do my search-filtering first via a sub-query, then simply join the result of that vs doing it in the opposite order. The performance gains were orders of magnitude.


It's a nice visualization, but it assumes the keys on both sides are unique. Usually the key on at least one side of a join is not unique, and sometimes it's not unique on either side. The behaviour in these cases is an important part of the behaviour of the overall join.


I understand what you're driving at and I think you know this, but your terminology sounds confused to me. I don't think you're wrong, I just wouldn't phrase things the way you have.

At least in RDBMS terms it's not a key if it's not unique. "Key" means "a set of columns whose values uniquely identify a record in the table".

You might have a column which is a key in another table, a.k.a., a foreign key. But a foreign key is [nearly always] not itself a key in the table it's foreign in. You might also have a composite key with a subset of fields that are foreign keys, too. But I wouldn't describe a foreign key as a key. It's unique in another table but not unique to this one. It's a key there, but not a key here.

I would say that in nearly all cases in a well-design system, a proper JOIN condition is going to involve the key of either the left or right table. It's almost always incorrect not to have at least one JOIN condition involve a key. You'll usually know when not to do that because you know you're intentionally ignoring the data model.

I do agree that the visualization doesn't adequately communicate the row "duplication" that often happens and which new SQL users often find confusing or undesirable.


Yes, perhaps "key" is the wrong term. I intended to refer to the column that is used in the join condition in not so many words.

> It's almost always incorrect not to have at least one JOIN condition involve a key.

Eh, almost always seems a little strong. Sometimes you can reduce the number of joins you need to do by "skipping" a table that would serve as the unique key on both sides of the join.

ie. instead of joining:

    `A --[many-to-one]--> B <--[one-to-many]-- C`
You can directly join on `A.b_id = C.b_id`, and that's perfectly valid.


It's a bit dangerous to visualize Union like that, people might confuse it with concatenation which it's not.

You'll want to use "UNION ALL" if your intent is to concatenate the two sets of rows, and not to calculate the (deduplicated) union. If only because it's much faster if your database doesn't have to check for duplicates.


I've always liked Jeff Atwoods visual explanation of joins [1]

[1] https://blog.codinghorror.com/a-visual-explanation-of-sql-jo...


That's a great article - however my issue with Atwood's explanation is in the classic use of overlapping circles to describe joins.

After ~2 decades of using SQL in some capacity, the circles metaphor still makes no fucking sense to me.

Dataschool's article takes a somewhat non-orthodox approach with coloring rows / columns that would be included in the results of the joins - which I appreciate a lot more.


Yeah the overlapping circles make absolutely no sense because typically they are used to visualize operations on sets of the same kind resulting in a third set of also the same kind.

But when joining two tables the two base sets are of different type (containing tuples/rows of different length/element type) and the resulting set containes neither tuples from the one nor from the other original set but totally new concatinated tuples combined from the base sets.

It makes absolutely no sense and everytime I see an article explaining joins I just check if it contains those venn-diagrams and if so, just skip it.


If you view the tables in the join as sets, then the join is the cartesian product of the sets. The venn-digrams work on that cartesian product. The circle labeled Table A is everything in the cartesian product that has data from table A, same for the table B circle. The over lap of the two circles is the part of the cartesian product that contains elements from both tables.


Still does not make sense to me:

OrigTableA = {(ruby, 5), (rust,7), (php, 3)}

OrigTableB = {(red, 5), (black, 7), (purple, 3)}

OrigTableA \cap OrigTableB would be empty, makes no sense

OrigTableA \cup OrigTableB would conain all 6 elements, would be a SQL UNION, makes no sense for join

OrigTableA \times OrigTableB would be the the cross product:

{ ((ruby, 5), (red, 5)), ((ruby, 5), (black, 7)), ((ruby, 5), (purple, 3)), ((rust,7), (red, 5)), ((rust,7), (black, 7)), ((rust,7), (purple, 3)), ((php, 3), (red, 5)), ((php, 3), (black, 7)), ((php, 3), (purple, 3)) }

Yes the cross product is what happens on a cross join (with the nested tuples being flattened) but now you have a single large set of all combinations and the filtering of only matching tuples (eg where the second element each tuple-pair matches) is simply not a binary operation on two sets but a unary operation on the single cross product set. Or if you tink of it as a binary pairing operation on the two original sets it is neither a set-union nor a set-intersection.


We start with the cross product set, and apply the join on criteria as a subset operation to create the "background" of the venn diagram.

We can then define a set "Table A" as the subset of the cross product where (as you have written it) the first element is not null. Same for Table B and the second element. Then joins are set-union and set-intersection between these two sets. I would agree that most people just flip up the diagrams without explaining what they are actually working with.


Would the set analogy hold if the "parts" where first all "promoted" to the ssame type via cross join, then filtered down, and finally joined? (I'm not sure if this would make sense in general for sql, but it just occurred to me in terms of the type/category of sets involved in a join.. )



So much better than the usual Venn diagram mess.

Props to the author


Not discussed in this article: 1-many, 1-1 and many-many relationships. Also: Fan traps and chasm traps. These issues are important to understand, to avoid serious mistakes.


thanks, I hadn't heard of the traps before. Cheers


Just google "sql joins diagram" and you will be rewarded by a quick one image visual representation of each type of join. Here is one in stackoverflow: https://stackoverflow.com/questions/13997365/sql-joins-as-ve...


That is a smooth site. I like how you can suggest edits to their articles right in google docs.

What framework/service is being used to build it?


Off Topic slightly and perhaps deserves a "Ask HN:" post - but I'd love any recommendations for best resources to quickly get to "medium proficiency" with SQL. Thanks!


Good, clear explanation of these SQL joins. Works well as a quick way to mentally visualize what each join would do in your own implementation.

Thanks for the learning resource.


Nice visualization. Never heard/used cross join. What would be use case for something like that in practice?


I mostly use CROSS JOIN when I need "one of every combination".

Suppose we have a log of events, and each event has a user, a date, and an event action associated with it. Now I want to transform this log into a tabulation, where I have one record for each combination of user, date, and action type with an associated count. I can do something like:

    WITH tabulation AS (
        SELECT
            user,
            action,
            date,
            COUNT(*) AS num_events
        FROM events
        GROUP BY
            user,
            action,
            date
    ),
    combinations AS (
        SELECT
            users.user,
            actions.action,
            dates.date
        FROM (SELECT DISTINCT user FROM tabulation) users
        CROSS JOIN (SELECT DISTINCT action FROM tabulation) actions
        CROSS JOIN UNNEST(
            SEQUENCE(
                (SELECT MIN(date) FROM tabulation),
                (SELECT MAX(date) FROM tabulation)
            )
        ) AS dates(date)
    )
    SELECT
        user,
        action,
        date,
        COALESCE(num_events, 0) AS num_events
    FROM combinations
    LEFT JOIN tabulation USING (user, action, date)


+1 for being the only comment so far to mention USING!


They're fairly rare, but they do come up. It allows you to very quickly create data, and since it's multiplication it can very quickly create a lot of data. It's mostly a way of thinking about things rather than something you actually do.

Usually they can be written as an INNER JOIN because what you really want is a non-standard join or a partial CROSS JOIN. You want a Cartesian product. For example, maybe you want to join an employee table to a calendar table to produce a list of dates for each employee. Usually you want a subset of the calendar though, so you end up with something like:

  SELECT *
  FROM Employee e 
  INNER JOIN Calendar c 
      ON c.Cal_Date BETWEEN '2021-01-01' AND '2021-12-31'
They also come up when you define an inline table with a VALUES statement, which is like creating a virtual table in a query. So if you want one row for M-F, you could do something like this:

  SELECT *
  FROM Employee e
  CROSS JOIN (VALUES ('M'),('T'),('W'),('R'),('F')) d (Day_of_week)
Note that we have to give the system an alias for the "table" we're making and a name for the column we created, too.

Another common use is with a tally table, which is a utility table that stereotypically has a single column with integers from 1 to some arbitrary value (10,000 is common). It's got many uses. Say you want to look for non-English letters in names. You have a UNICODE() function that returns the character code of a single character, and a SUBSTRING() function that returns a substring of a character field.

  SELECT *
  FROM Employee e
  INNER JOIN Tally t
      ON t.N BETWEEN 1 AND LEN(e.Name)
  WHERE UNICODE(SUBSTRING(e.Name,t.N,1)) NOT BETWEEN 32 AND 126
Note again that it's not actually written as a CROSS JOIN here, but it's the same line of logic that leads to CROSS JOINs. We're creating a Cartesian product. You could write it as a CROSS JOIN, but this method makes it a bit more readable to understand the logic.

(The above is written with SQL Server or T-SQL. Other RDBMSs have different syntax, but can generally do the same things.)


Cross Join is also called Cartesian Join.

A (vague, handwavy) use case would be if you had a table of Towns, and a table of Pizza_Chains; by iterating over the joined table, you would see every permutation of Towns and Pizza_Chains, and (by inputting address info) you could produce a new table giving the address for each Pizza_Chain in each Town.

The cross-join would be used once, to create the new table. You want to use cross joins as little as possible, so a single-use cross join is reasonable. I admit it's a pretty stretched example; I suspect a better example might be found in a more abstract field, like maths, but I'm not feeling abstract enough right now to work one out.

You make a cross join like this:

SELECT Towns.id, Towns.name, Pizza_Chains.id, Pizza_Chains.name FROM Towns, Pizza_Chains

I.e. you simply don't have a WHERE clause.

I don't think I've ever had occasion to use a cross join. It's not a very useful query, because the resultset doesn't give you any information - or rather, it tells you every possible result you could get, if you tried the query with every possible WHERE clause.


When you want to view (or display) all possible combinations.

It doesn't make much sense for example with relations that describe things that happened or are (facts). For example you wouldn't cross join a users and addresses table.

But when relations are composable, say a product with selectable features, then it can make sense to cross join.

An example:

You ask a user a few questions and want to suggest a solution or product based on the combinations of the answers. The suggestions are provided by domain experts in advance.

In this case you might want to use a cross join to provide a matrix of all combinations to your domain experts. They can fill in valid suggestions for some and invalid ones for others.

Based on their input you can now automatically generate a questionnaire.


cross joins create data (facts) that dont exist in real life

you will probably never use them, except in rare cases for example

return employee sales by products, show 0 if no sales in a normal db, you will only have records for products with sales above 0

you will need a cross join to join every employee with all products even those with no sales

and full outer join will not work, because it will only return products with some sales for at least one employee

only a cross join will return all products

i will do this in two steps 1 cross join emp and product left join the result from 1 to product sales replace nulls with 0

rare case, stupid case too


I've only used them while using certain frameworks to dynamically generate queries referencing multiple tables. They end up getting turned into an implicit inner join though. The following queries are equivalent.

  SELECT a.*, b.*
  FROM a CROSS JOIN b
  WHERE a.id = b.a_id
  
  SELECT a.*, b.*
  FROM a, b
  WHERE a.id = b.a_id

  SELECT a.*, b.*
  FROM a INNER JOIN b ON a.id = b.a_id
IIRC .NET's Entity Framework will (or at least used to) generate queries similar to this as well.


I've only used it a few times, but it was usually to combine data from two tables that didn't have a column I could join them on.


Enumerating a cafe menu:

  select *
  from
    (values ('egg'), ('sausage'), ('spam')) main,
    (values ('baked beans'), ('chips'), ('spam')) side;
Of course, to get the complete menu, you would need to use a recursive common table expression.


In Oracle in some cases cross join works as cross apply, so you can cross join table A with table-returning structure and have it compute column based on table A.


I like this one page visual for explaining joins much more succinctly https://www.codeproject.com/KB/database/Visual_SQL_Joins/Vis...


Venn diagrams accurately describe set operations but most joins in common use in SQL are filtered Cartesian products.

https://blog.jooq.org/2016/07/05/say-no-to-venn-diagrams-whe...


There's a very big problem in the industry with so many developers not understanding SQL at all, let alone how the internals work. Conceptually most developers only need to understand the set operations, relational algebra, and first order predicate calculus. RDBMS technology is already (incorrectly) viewed by many as obsolete and too complicated. Even experienced developers sometimes need reminding of which seldom used join they need. Do you want to read an article or glance at a picture to do this?


Missed self join




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: