Hacker News new | past | comments | ask | show | jobs | submit login

This is silly. They're a cheatsheet for those who know how SQL joins basically work and what they represent actually makes perfect sense if you don't misrepresent what sets the circles refer to.

Two tables (A & B) are somehow being joined (in the sql sense, by some query) ON a column from each table expressing the same type of id key.

Consider the union of values for those columns in either table. That is the set containing everything in either circle in the Venn diagram. Then consider the values which appear just in A or just in B -- those, respectively are the non-overlapping sides. Then consider the intersection (values which appear as values in both A & B), that is the middle part shared by both circles.

The highlighting on the Venn diagram shows which sets of values for the key could be made to appear in the resulting join for each type of query (and, yes, by extension indicating whether some other non-null data could also be coming from a given table in the cases where the given value is present for the table with the ON'ed column in question)

That's perfectly valid for a Venn diagram. The representation is meaningful and concise.

If you don't understand the fundamental nature of what is happening in a SQL JOIN I can see why you may need a more detailed/verbose diagram for each type of join, but the diagrams don't purport to be about the sets of rows from a table in the result and anyone who grasps the basic principle behind the joins won't struggle with the Venn diagrams because they strike the nail on the head with respect to the differences between each type of join without the need for verbosity & redundancy.




Because it's the wrong model.

Take two tables of 1m rows each and do a cross join on them. The Venn diagram model says you would have at most 2m output rows. But that's wrong. You would have 1t rows output. When when you're using a normal join on a specific value matching, if the columns are unsorted, that's 1t comparisons potentially.

The right mental model leads to the right intuition of what's going under the surface. And it's all well and good to say "If you don't understand the fundamental nature of what is happening in a SQL JOIN..." But that is exactly who these diagrams are intended for! People who don't understand yet and are trying to.

Teaching people with the right model means they don't have to unlearn the wrong model later when questions get more complicated.


I have yet to see a cross join represented by a Venn diagram, so this strikes me as a red herring on the topic of whether Venn diagrams for the joins that are usually represented with them make sense.


The author is arguing for a mental model in which all joins are just cross joins with a filter added, and that Venn Diagrams are a bad representation because they depict set differences, which is not what SQL joins are.

Say tbl_1 has 5 rows, and column "A" is the numbers 1 to 5. Then tbl_2 has the numbers 1 and 2, each repeated 5 times. Now how many rows are in the result of the query below?

    select tbl_1.A from tbl_1 left join tbl_2 using(A)
The fact that the result has 13 rows (more rows than either of the tables!) seems like it would be pretty surprising to someone using the left-join venn diagram as a mental model. The Venn diagram depicts a left-join as a set difference, which it clearly is not.


Yes. And, having taught SQL to Excel power users, I can tell you it is a very effective learning model.

The "filtered cross-join" model allows students who have learned SELECT and WHERE to think of a JOIN as an extension of those primitives, a combination of two tables on which they can filter.

Venn diagrams might be useful to visualize an outcome, but they will not support stepping through primitives to a solution.

In "filtered cross-join", JOIN can be an extension of SELECT that combines two tables. The combination is a set of rows, each each of which combines all the fields of one row of one table with all the fields of one row of the second. This is easily visualized with two two-column tables of three rows.

They can then use WHERE to find those rows with matches on the key field.

With this model, students can build up JOIN as an abstraction of simpler primitives. When they are struggling with a problem, you can ask them to first step through those primitives to accumulate the solution. Say, query the "raw" join and examine a few rows to see which they want returned. What is true of those and not true of the ones you do not want? ("I want those where these two fields match, and none that do not.") Ok, how do you express that condition in SQL? ("Where . . . this equals that?" "Hmm, try that" "HEY THAT WORKED!")

With that basis, they have a model that can extend to more complexity -- joins across several tables, joins on the same table, joins with conditions other than field equality.

Building up to and using this model, you can have the vast majority of students writing joins with confidence in two days.

My suggestion to the site would be, use an example that has two columns on each table, to provide the key field on which the join will be performed.


For your example, the Venn diagram is a good visualization of the filter being applied. Its sets are the values that are being filtered over. I.e., the left circle contains the values {1, 2, 3, 4, 5}, and the right circle contains the values {1, 2}. No, this doesn't explain the entire join operation, but that wasn't ever the goal.

The diagrams that the author of the article comes up with (with the exception of the cross join) are just Venn diagrams rotated 90 degrees.


For people that already have the correct mental model, the Venn diagram is a quick way to identify which name applies to each join type. The JOIN diagrams are much harder to read.


Diagrams and notation serve similar symbolic roles. Just because you can represent a real number, say 5, in a complex number's notation (5 + 0i) and you can't represent a complex number as a real number doesn't mean that you should always use the notation for complex numbers when trying to represent different real numbers. You typically want to use a concise representation that gets the message across without adding extra noise. Things can always be abstracted further into a more generalist, but more verbose representation. Complex numbers can be represented as quarternions, and so on. For the same reason, it makes little sense to use the existence of a JOIN that doesn't use ON (and is frankly not as commonly used) as a reason to complicate the representation of the 7 JOIN queries that do use ON. If translating real numbers into a notation that represents the complex numbers helps you anchor your understanding of complex numbers, great, but I wouldn't buy that our natural representation of real numbers is deficient as a result. I also don't buy that using Venn diagrams to represent JOINs that use ON are "PLAIN WRONG!" because there are things that they can't represent and aren't used to represent.


The fact that I don't use "on" is just syntax, it doesn't change the query. The following query is equivalent.

    select tbl_1.A 
    from tbl_1 
    left join tbl_2 
      on tbl_1.A = tbl_2.A
I don't think this is a special complex example. It's about as simple as you can get.


This was incorrect:”The Venn diagram depicts a left-join as a set difference, which it clearly is not.”

It is in fact a set difference and the sets in question are values for the column being ONed on in each table (as I explained in the top level comment)

My reply was with respect to these being cross joins with filters. That’s just one way of thinking about a JOIN that actually adds notational complexity (and does not actually represent how your database would be doing the join because it would be unnecessarily expensive computationally). Thinking of it in terms of a set difference of the values in thr ON column is arguably the better way of thinking of it if you are interested in how such a join should be implemented. Alternatively you can interpret the venn diagrams as just the filtration if you prefer interpreting as a cross join with filters.


f you have one set with 5 elements and another with 10, the set difference cannot have 13 elements.

-------

In anticipation of some possible technical replies, let me be more explicit:

It isn't a set difference, no matter how you view it.

Viewpoint 1: The original lists of objects are not actually sets because the elements are not distinct. From this viewpoint, it isn't a set difference because they aren't even sets to begin with.

Viewpoint 2: There is an implicit "row number" feature of each object so in that sense the objects are distinct and do form sets. From this viewpoint, it still isn't a set difference. If you have one set with 5 elements and another with 10, the set difference cannot have 13 elements.


This was my first reaction too, but then I thought back to when I was first learning SQL and, after hours of work trying to tabulate events from users across multiple tables, discovered that I seemed to be double-counting (or more) in mysterious ways. I was using Venn diagrams to understand how joins worked and I think the "intersection" depiction of inner joins _did_ cause me to think that "duplicates would be handled," whatever that would even mean. Really it just caused me to skip over the complexity of thinking about it.

If you think of inner joins as a cross join with a filter, then we have all seen cross joins represented by Venn diagrams - and for me at least I lost several hours of work restructuring my understanding of what I was working with!


The examples given into article also suffer from this problem because they only illustrate a special case when the ON column is unique. This is not the problem with the representation, just with their particular examples.


thank you for this explanation / clarification.

i could tell the link had an important distinction to make, but i was unable to understand it without your supplemental explanation.


The article says not to use Venn diagrams when explaining joins – that is, conveying the concepts to someone who doesn't already understand them.

You dismiss that as "silly" because anyone who already "grasps the basic principle" could figure it out. You're talking about the use of Venn diagrams by people who already "know how SQL joins basically work".

It sounds like you missed the point the article was making.


I know early in my career I made a lot of hay rewriting database queries that were outer joins that wanted an inner join. DISTINCT in the same statement as OUTER still gives me hives.

My point is that people aren’t really getting it. Is that the fault of the pupil or the teacher? If someone wants to investigate the latter I say let them. All your surprises will be good.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: