Honestly, if it were not for Venn diagrams, I may have never learned how SQL joins work. It might not be perfect, but if you'd tried to make 16-year-old me understand relational algebra in order to grok how the joins in his Django site were working, he'd have given up before he began.
That's not to say that other ways of illustrating things don't have merit, but broad statements like "say NO to Venn diagrams" just aren't helpful. If people find Venn diagrams a useful tool (and their proliferation indicates they do), I don't see the harm. They didn't stop me going deeper on stuff like relational algebra when I was ready to.
Venn diagrams are almost, but not quite, completely wrong. They hint at the idea of including failed joins vs only successful joins, but this is only a small part of the full logic behind Cartesian product + filter - even if it's the most commonly used pattern in practice in most database models that ultimately get used in an object oriented programming language.
I think it's OK to use the venn diagram only as an explanatory tool, briefly, and informally. If you raise it to the level of a complete description of semantics, then it's very inadequate. Any step off the beaten path, where the cardinality of the join is no longer 1:1 or 1:n, and you're back in the weeds again.
The first section is about how Venn works fine for AND and OR because it’s set theory and Venn works just fine for examples (maybe not every problem though).
That may be a fair critique. Should SQL be successful only to mathematicians and computer scientists? (Who else would even know about set theory or Venn a priori?)
In my underfunded rural American elementary school, they taught us Venn diagrams. Not set theory per-se, but they taught us to read and create Venn diagrams.
I don't know if that's representative, but my gut instinct is that Venn diagrams are very accessible to the general public.
A cross join is the union of 3 mutually exclusive sets: the left join tuples, the inner join tuples, and the right join tuples. Is this not the case? So then you can use Venn diagrams just fine.
Um no. The left and right joins both include the inner join tuples; they are not mutually exclusive. A cross join is the Cartesian product of two sets, which means it doesn't include the outer join tuples.
Ok I should have been more precise. By "left join tuples" I intended to mean the "left join - inner join" tuples. The point is you can use sets and Venn diagrams to talk about joins.
You're language is still not quite right. A cross product does not include those parts of the outer joins. An inner join is a subset of a cross product, but an outer join includes elements not in either. What you say would be true of a full outer join, but not a cross product. If one circle in the Venn diagram represents the left join and one circle right join, then the Venn diagrams would be correct but not if the circles represent separate tables.
I did read the article and I don't find what the author thinks is a "nice illustration of a cartesian product" easy to grok at all. Similarly, I think I would have found the the diagrams further down hard to understand versus a Venn diagram.
But that wasn't my point; if other people find these illustrations helpful, that's awesome and I wouldn't want to stop someone inventing something new and useful. My point was that I violently disagree with one-size-fits-all proclamations like "Say NO to Venn Diagrams."
The hardest part of joins is remembering which is which. Venn diagrams are used specifically because they solve that problem with the inner/outer/left/right terminology.
His visualizations ultimately help nothing. You have to stare at them for a while to see what the pattern is, but there's absolutely no mnemonic hook to remember what's what, so in the end all you learn from his visualizations are that there exist different ways to slice and dice the products of sets, which is already a given if you're trying to figure out what the differences are.
For me, I still have no idea what the visual of a venn diagram is trying to tell me within the context of SQL. Just having a visual that I can't even understand does nothing to benefit the situation either.
This is silly. They're a cheatsheet for those who know how SQL joins basically work and what they represent actually makes perfect sense if you don't misrepresent what sets the circles refer to.
Two tables (A & B) are somehow being joined (in the sql sense, by some query) ON a column from each table expressing the same type of id key.
Consider the union of values for those columns in either table. That is the set containing everything in either circle in the Venn diagram.
Then consider the values which appear just in A or just in B -- those, respectively are the non-overlapping sides.
Then consider the intersection (values which appear as values in both A & B), that is the middle part shared by both circles.
The highlighting on the Venn diagram shows which sets of values for the key could be made to appear in the resulting join for each type of query (and, yes, by extension indicating whether some other non-null data could also be coming from a given table in the cases where the given value is present for the table with the ON'ed column in question)
That's perfectly valid for a Venn diagram. The representation is meaningful and concise.
If you don't understand the fundamental nature of what is happening in a SQL JOIN I can see why you may need a more detailed/verbose diagram for each type of join, but the diagrams don't purport to be about the sets of rows from a table in the result and anyone who grasps the basic principle behind the joins won't struggle with the Venn diagrams because they strike the nail on the head with respect to the differences between each type of join without the need for verbosity & redundancy.
Take two tables of 1m rows each and do a cross join on them. The Venn diagram model says you would have at most 2m output rows. But that's wrong. You would have 1t rows output. When when you're using a normal join on a specific value matching, if the columns are unsorted, that's 1t comparisons potentially.
The right mental model leads to the right intuition of what's going under the surface. And it's all well and good to say "If you don't understand the fundamental nature of what is happening in a SQL JOIN..." But that is exactly who these diagrams are intended for! People who don't understand yet and are trying to.
Teaching people with the right model means they don't have to unlearn the wrong model later when questions get more complicated.
I have yet to see a cross join represented by a Venn diagram, so this strikes me as a red herring on the topic of whether Venn diagrams for the joins that are usually represented with them make sense.
The author is arguing for a mental model in which all joins are just cross joins with a filter added, and that Venn Diagrams are a bad representation because they depict set differences, which is not what SQL joins are.
Say tbl_1 has 5 rows, and column "A" is the numbers 1 to 5. Then tbl_2 has the numbers 1 and 2, each repeated 5 times. Now how many rows are in the result of the query below?
select tbl_1.A from tbl_1 left join tbl_2 using(A)
The fact that the result has 13 rows (more rows than either of the tables!) seems like it would be pretty surprising to someone using the left-join venn diagram as a mental model. The Venn diagram depicts a left-join as a set difference, which it clearly is not.
Yes. And, having taught SQL to Excel power users, I can tell you it is a very effective learning model.
The "filtered cross-join" model allows students who have learned SELECT and WHERE to think of a JOIN as an extension of those primitives, a combination of two tables on which they can filter.
Venn diagrams might be useful to visualize an outcome, but they will not support stepping through primitives to a solution.
In "filtered cross-join", JOIN can be an extension of SELECT that combines two tables. The combination is a set of rows, each each of which combines all the fields of one row of one table with all the fields of one row of the second. This is easily visualized with two two-column tables of three rows.
They can then use WHERE to find those rows with matches on the key field.
With this model, students can build up JOIN as an abstraction of simpler primitives. When they are struggling with a problem, you can ask them to first step through those primitives to accumulate the solution. Say, query the "raw" join and examine a few rows to see which they want returned. What is true of those and not true of the ones you do not want? ("I want those where these two fields match, and none that do not.") Ok, how do you express that condition in SQL? ("Where . . . this equals that?" "Hmm, try that" "HEY THAT WORKED!")
With that basis, they have a model that can extend to more complexity -- joins across several tables, joins on the same table, joins with conditions other than field equality.
Building up to and using this model, you can have the vast majority of students writing joins with confidence in two days.
My suggestion to the site would be, use an example that has two columns on each table, to provide the key field on which the join will be performed.
For your example, the Venn diagram is a good visualization of the filter being applied. Its sets are the values that are being filtered over. I.e., the left circle contains the values {1, 2, 3, 4, 5}, and the right circle contains the values {1, 2}. No, this doesn't explain the entire join operation, but that wasn't ever the goal.
The diagrams that the author of the article comes up with (with the exception of the cross join) are just Venn diagrams rotated 90 degrees.
For people that already have the correct mental model, the Venn diagram is a quick way to identify which name applies to each join type. The JOIN diagrams are much harder to read.
Diagrams and notation serve similar symbolic roles. Just because you can represent a real number, say 5, in a complex number's notation (5 + 0i) and you can't represent a complex number as a real number doesn't mean that you should always use the notation for complex numbers when trying to represent different real numbers. You typically want to use a concise representation that gets the message across without adding extra noise. Things can always be abstracted further into a more generalist, but more verbose representation. Complex numbers can be represented as quarternions, and so on. For the same reason, it makes little sense to use the existence of a JOIN that doesn't use ON (and is frankly not as commonly used) as a reason to complicate the representation of the 7 JOIN queries that do use ON. If translating real numbers into a notation that represents the complex numbers helps you anchor your understanding of complex numbers, great, but I wouldn't buy that our natural representation of real numbers is deficient as a result. I also don't buy that using Venn diagrams to represent JOINs that use ON are "PLAIN WRONG!" because there are things that they can't represent and aren't used to represent.
This was incorrect:”The Venn diagram depicts a left-join as a set difference, which it clearly is not.”
It is in fact a set difference and the sets in question are values for the column being ONed on in each table (as I explained in the top level
comment)
My reply was with respect to these being cross joins with filters. That’s just one way of thinking about a JOIN that actually adds notational complexity (and does not actually represent how your database would be doing the join because it would be unnecessarily expensive computationally). Thinking of it in terms of a set difference of the values in thr ON column is arguably the better way of thinking of it if you are interested in how such a join should be implemented. Alternatively you can interpret the venn diagrams as just the filtration if you prefer interpreting as a cross join with filters.
f you have one set with 5 elements and another with 10, the set difference cannot have 13 elements.
-------
In anticipation of some possible technical replies, let me be more explicit:
It isn't a set difference, no matter how you view it.
Viewpoint 1:
The original lists of objects are not actually sets because the elements are not distinct. From this viewpoint, it isn't a set difference because they aren't even sets to begin with.
Viewpoint 2:
There is an implicit "row number" feature of each object so in that sense the objects are distinct and do form sets. From this viewpoint, it still isn't a set difference. If you have one set with 5 elements and another with 10, the set difference cannot have 13 elements.
This was my first reaction too, but then I thought back to when I was first learning SQL and, after hours of work trying to tabulate events from users across multiple tables, discovered that I seemed to be double-counting (or more) in mysterious ways. I was using Venn diagrams to understand how joins worked and I think the "intersection" depiction of inner joins _did_ cause me to think that "duplicates would be handled," whatever that would even mean. Really it just caused me to skip over the complexity of thinking about it.
If you think of inner joins as a cross join with a filter, then we have all seen cross joins represented by Venn diagrams - and for me at least I lost several hours of work restructuring my understanding of what I was working with!
The examples given into article also suffer from this problem because they only illustrate a special case when the ON column is unique. This is not the problem with the representation, just with their particular examples.
The article says not to use Venn diagrams when explaining joins – that is, conveying the concepts to someone who doesn't already understand them.
You dismiss that as "silly" because anyone who already "grasps the basic principle" could figure it out. You're talking about the use of Venn diagrams by people who already "know how SQL joins basically work".
It sounds like you missed the point the article was making.
I know early in my career I made a lot of hay rewriting database queries that were outer joins that wanted an inner join. DISTINCT in the same statement as OUTER still gives me hives.
My point is that people aren’t really getting it. Is that the fault of the pupil or the teacher? If someone wants to investigate the latter I say let them. All your surprises will be good.
I love this premise! I have always hated the venn diagram approach. When I started programming, I never found them useful. I agree it is the wrong model to teach what is happening with joins. It think it produces more confusion.
The author's approach is a better visual model. Using boxes makes so much more sense, since we are dealing with rows of data. The circles of a venn diagram don't represent a result set visually, and i think that's why I always found it confusing.
The the main point I think should be expressed to really explain what's happening in a join is about null values. If one table has nulls in the matching set, which ones do we want to keep, or discard? The box approach shows this beautifully, while a venn approach doesn't really make this obvious.
I remember being a third year CS student, studying for some database exam coming up, and a friend of mine (a wickedly smart second year student) explained joins as Cartesian products to me. Took her all of 60 second to turn my entire brain over. Blew my mind.
I had been using SQL but not understanding it. Mostly because of poor previous explanation via things like Venn diagrams- the wrong model lead me to think incorrectly and make mistakes.
With the right model in my head, the exam was dead easy. I don't even know that she realized how impactful that was to me. So on the off chance you're reading HN, thanks Wen!
> Mostly because of poor previous explanation via things like Venn diagrams- the wrong model lead me to think incorrectly and make mistakes.
To me the Venn diagrams made it quite clear to understands while this article diagram wouldn't have helped me at all.
The best way to learn is in a way that make you understands. Your friend found a way that make you understands, but do you believe his explanation would have been universally understood? We would live in a much better world if it was that simple.
For me, Venn worked to make me understands. For you, your friend explanation worked to make you understands.
Should we make your friend stop explaining because he can't help everyone? Or should we simply let both way coexist?
I see that those circles look like a Venn diagram. But for a Venn diagram to be useful in explaining something, there needs to be an easily understandable definition about what the 2 circles are. That doesn't seem to exist here. I can't come up with a logical definition for what the top circle or the bottom circle represent.
In a venn diagram the sum of the area combined by A and B can never be larger than A+B.
In a join the result can multiply and you get up to A*B rows instead.
Hence the geometrical representation can mislead you into thinking your data is smaller than what it actually is. As a quick mental model i think the diagrams can help but you have to understand they are only a reminder of how the join is being matched, not a representation of the complete data set.
Here's the key point I think the article misses: Very nearly always (I mean really really nearly always), when you do a left/right/inner join, the condition you use for the join is a unique (usally primary) key on one side and a foreign key on the other side. So although in theory you start with the full cartesian join and then filter some of the results out (picking out of n×m entries), in practice you are only left with at most max(n,m) entries. So something like a Venn diagram is much less wrong than it is for the outer join (but still a bit wrong).
What I would like to see is inner/left/right joins explained with colours used to indicate value (as the article already does) but in the form of entries in a PivotTable (as the article used only for outer join), perhaps with faded entries for those that were filtered out. Most of all, it would be great to see this with non-unique values in both tables, as rare as this admittedly is. This could be followed up by the more usual primary key / foreign key combination. At that point it would be visually clear that only max(n,m) entries are left, so you could finish up by showing those evil Venn diagrams after all, with the disclaimer that you understand that there's a full n×m combination under the hood in theory.
I once interviewed a QA person for a very heavy data centric project.
He didn’t know SQL well, but he described a left join problem perfectly by drawing a Venn diagram to explain the data that needed to be retrieved. When I was about to finish up the interview, he was so curious about the answer he insisted that I explained the answer so “he could sleep at night”.
I gave him a thumbs up just because he knew the concept and displayed intellectual curiosity. It was a great choice.
Since by far the most errors I see in SQL queries are places where Venn diagrams do not accurately describe SQL behavior... yeah, I'm gonna have to agree, though I'm not sold on the alternative here. Like 90%+ of the flaws I run across in queries are due to multiple matching rows on the right side, without `distinct` or an aggregation of some kind.
1:1 matches on rows are trivial. Venn diagrams can describe these too. But you really can't let people develop an intuition based on that, early on, or they kinda get stuck and it takes a while to unravel.
I don't think that is true. If you base your understanding on an inaccurate representation, then you may later on run into problems trying to fit new information into your inaccurate thought-model. For example explaining electric circuits with water analogies (like described here: https://learn.sparkfun.com/tutorials/voltage-current-resista...) cannot account for EMF.
Really well done, but I agree with one of the comments in there: need to add patterns to the solid colors; it would differentiate for color blind folks and in general make the relationships easier to grasp.
I tried pasting the cross join and inner join into a colorblindness simulator [1] and the only issue I found is that the A and the 2 in the cross join look similar if you have monochromacy. Looking at the monochromatic version, I don't think this really affects the clarity of the cross join diagram.
The inner join is already perfectly distinguishable by how dark the squares are, and anything other than monochromacy preserves the distinctions that the color adds in the cross join, unless the site I used is misleading.
SQL used to have an intriguing aura of exotic knowledge back in time when SQL was still a Language to highlight on a resume and to have an RDBMS server meant spending thousands.
So Venn diagrams were just adding to such aura. But it's a visual aid and is usually supplemented with examples. If it doesn't ring a bell, some hands-on or detailed explanation helps crack it.
Personally, I believe JOIN should first be touched upon when introducing normalization. It just makes sense: break tables apart then join them together. Here one could also grasp a utility of an index on join column.
Then progress to more advanced use of joins - outer join (needs a concept of NULL), union (basically a synthetic set) etc.
Finally, a cross join - as one of those cases that may lock your terminal if you don't know what you're doing. Of course learning Cartesian product concept here, if not yet.
As anything in Computer, the hands-on aspect usually clears thing much better then science-first. The science would ease in much better once the utility is grasped.
Teaching means helping someone to learn, not making them understand your way of teaching. So different aids may work for different audiences.
Either the author is not proposing an alternative, or the alternative is formulae and poorly formatted screenshots, or my ublock origin settings are blocking the perfect method to explain SQL joins in 100x100 pixels.
Why does SQL have both WHERE and ON, plus various kinds of JOINs? It seems to me that it would be simpler to have just have one JOIN which represents the Cartesian product, plus WHERE for filtering - and then the DB engine could apply optimizations so that queries would actually run as fast as in existing SQL. Am I missing something?
The bit your missing is the awkwardness of expressing outer joins. The way Oracle does it is an existence proof of its awkwardness.
The 'FROM' clause defaults to Cartesian product for multiple tables. You can then specify your join conditions in the where clause, but it makes outer joins harder to write - Oracle does them with special (+) operator, but it's clumsy.
It doesn't (shouldn't) make much difference at all to the database whether you specify your join condition after the `ON` or in the `WHERE` - if it doesn't hoist `WHERE` conditions into the join clause, and potentially change the join order because of it, it's missing big optimization opportunities.
Ah... that makes sense. I have experimented a bit since my original comment and so far have not been able to come up with a syntactically elegant way of implementing LEFT JOIN in an imaginary MySQL that had inner JOIN as the only JOIN. This actually makes me wonder... if SQL only had inner JOIN, what would be the syntactically most elegant way of implementing the other JOINs? The various replies in this comment tree make me suppose that the answer would depend on the precise flavor of SQL being used.
On the topic as such, besides ON and WHERE there is also USING and NATURAL.
Som very influential db-people have argued that NATURAL should be the only supported join. Checkout the third manifesto.
ON as such allows for outer joins, something you can’t do starting with a Cartesian product. But Oracle has syntax for outer joins using WHERE if you prefer that.
A lot of people are saying hang on, they're useful for explaining. But my experience when teaching relational databases is that they give the illusion of understanding. I'd rather have the students know that they do not know (and work harder) than look at a Venn diagram, think they get it, and move on.
This makes me glad I learned SQL first with relational algebra. Probably that wouldn't work for a lot of people, but as a math major it made the basic concepts pretty clear.
>Honestly, if it were not for Venn diagrams, I may have never learned how SQL joins work. It might not be perfect, but if you'd tried to make 16-year-old me understand relational algebra in order to grok how the joins in his Django site were working, he'd have given up before he began.
The whole point of the article is that there are different visualizations that are much better than Venn diagrams (in fact Venn diagrams are totally misleading), and they don't require any relational algebra.
That said, if someone doesn't know what a cartesian product is (high school level math), perhaps they shouldn't be allowed near a database...
>if someone doesn't know what a cartesian product is (high school level math), perhaps they shouldn't be allowed near a database...
This is such an intellectually snobbish position to take. I can tell you that I didn’t know this term when I first interacted with a database, and if someone had told me the above, I would have just felt stupid and given up.
The fact you learned what a Cartesian product was at high school shouldn’t preclude someone who hasn’t from trying to build stuff and experiment with tech. Not everyone learns things the same way, or in the same order, and that’s okay.
>This is such an intellectually snobbish position to take.
Intellectually snobbish is "everyone should listen to Mahler and read Epictetus in the original language".
This is just basic professional requirement...
>The fact you learned what a Cartesian product was at high school shouldn’t preclude someone who hasn’t from trying to build stuff and experiment with tech.
Just not in any capacity where people depend on their output...
You didn't say "should not be allowed near a production database" you said "database"
That is very different. This was a 16 year old kid trying to mess around and learn stuff, he should be able to be around databases all he wants. I wouldn't hire him as a DBA, but he should be able to learn without having to wait for more advanced math.
Well, you were replying to someone who said they learned SQL for their Django site when they were 16, and you admonished them for not knowing Cartesian products and using a database... so it was certainly not clear that you meant a production database.
You should be charitable to the creators of venn diagrams and the countless and diverse group of people who've found them useful for years. For instance, it obviously stands to reason that these sort of visual cheat-sheets are aimed primarily at novices, not experienced/professional DBAs.
Should we critique "lefty loosey righty tighty" because a professional mechanic knows that's not precisely true and should be able to analyze the context of the screw to determine whether it's left or right handed? Of course not, that mnemonic is for novices, not professional mechanics.
> if someone doesn't know what a cartesian product is (high school level math), perhaps they shouldn't be allowed near a database
Knowing what is meant when a term or symbol is used is not math, it's language.
You can know the meanings of all kinds of terms and symbols, and you will still not know anything about math, you'll only know something about mathematicians and how they like to communicate.
Conversely, someone may understand the math perfectly well but not know the correct terms or symbols (perhaps they learned under a different system, in a different country, etc.)
>Knowing what is meant when a term or symbol is used is not math, it's language.
When the symbol is a mathematical symbol, and describes a mathematical operation, then "knowing what is meant" by it means you know a piece of math.
Besides I wasn't describing what you "know" if you know what a Cartesian Product is. I was stating (as a fact) where you learn it (at high school math classes).
> I was stating (as a fact) where you learn it (at high school math classes).
Interestingly your "fact" is false - from my experience. You'd need to specific country where this high school is located, the year of the syllabus and possibly a region within a country. When I went to "high school" (we don't call it that) the syllabus differed by region.
A person can know the integral sign and know vaguely that it means "the area under a curve" without actually knowing calculus. Understanding the semantics of a term or symbol and being familiar with its referent are very different things. This is universally accepted and not controversial at all. If you'd like to continue arguing your point, I'd ask you to provide even a modicum of evidence.
In terms of what you're "stating (as a fact)," I can tell you with absolute certainty that you're wrong. Many high schools do not teach set theory and, among those that do, many - mine included - don't refer to this operation as the Cartesian product. It is simply "the product of A x B."
That's not to say that other ways of illustrating things don't have merit, but broad statements like "say NO to Venn diagrams" just aren't helpful. If people find Venn diagrams a useful tool (and their proliferation indicates they do), I don't see the harm. They didn't stop me going deeper on stuff like relational algebra when I was ready to.