Handling Null Values in Postgres

ainar-g · on March 5, 2020

They forgot "IS NOT DISTINCT FROM", which is, IMHO, the best god damn operator ever invented. It works like "=", but adequately responds to NULL:

  postgres=# SELECT 1 IS NOT DISTINCT FROM 1;
   ?column? 
  ----------
   t
  (1 row)
  
  postgres=# SELECT NULL IS NOT DISTINCT FROM NULL;
   ?column? 
  ----------
   t
  (1 row)

oefrha · on March 5, 2020

Came here to say the same thing. An article about NULL without even a passing mention of IS DISTINCT FROM is embarrassing.

Truth table here: https://wiki.postgresql.org/wiki/Is_distinct_from#Truth_Tabl...

kingdomcome50 · on March 5, 2020

Wait... You don't like:

`ISNULL(NULLIF(A, B), NULLIF(B, A)) IS NULL`

What are you? human?

pavel_lishin · on March 5, 2020

Oh my god. We had a bug awhile back due to null equality comparison treating nulls as non-equal; I wish I'd known about mysql's equivalent ("<=>") back then!

ibrar74 · on March 6, 2020

Actually this blog is not about how many ways you can use NULL in the database. It's all about the people who have programming experience (C/C++/JAVA) and try to write the SQL queries.

carterehsmith · on March 6, 2020

"SELECT NULL IS NOT DISTINCT FROM NULL" ?column? ---------- t

That bothers me. It is like saying that "infinity" (+inf) equals other infinity. Which is just wrong.

Some other SQL dialects will return "UNKNOWN" rather than "t" (true), which imho makes way more sense.

Tostino · on March 6, 2020

...I think you are misunderstanding the whole reason for the IS NOT DISTINCT FROM operator, which is exactly that.

"SELECT NULL = NULL" will return NULL, use that if you want the standard way. There are many cases where you want it to return true though, and the boilerplate for that sucks, so IS NOT DISTINCT FROM has a very useful place in the dialect.

jeffdavis · on March 5, 2020

As I wrote before:

http://thoughts.davisjeff.com/2009/08/02/what-is-the-deal-wi...

"NULL semantics are designed to make you think you understand them, and think that the semantics are intuitive, and think that it’s part of some ingenious consistent system for managing missing information. But none of those things are true."

Any explanation you come up with does not generalize to other cases. NULL is just a bunch of special cases piled on top of each other, and really only kind of works because SQL is not designed to be very extensible (and to the extent it is, the standard doesn't give much guidance on how NULL should behave in these extensions).

iaabtpbtpnn · on March 5, 2020

SQL needs to have NULL, because otherwise how does an outer join work? And once NULL is there, you need rules for how it interacts with other values. What is the alternative?

jeffdavis · on March 5, 2020

A left outer join duplicates the tuple on the LHS for each tuple on the RHS. It's almost exclusively used for grouping operations, so then you need to group the LHS tuples back into a single tuple and aggregate the RHS.

But that's just weird -- in a normal programming language you would never duplicate the tuples and make this mess; you would keep a single tuple on the LHS and group the RHS into a set (or array/list). Then you'd aggregate the set on the RHS. That model is simple because an empty set is a value with known semantics (sum of an empty set is zero), whereas a half-NULL row is just a hack.

Before someone asks, this has nothing to do with declarative/imperative, and nothing to do with some purity of the relational model. You can still have a declarative relational model with values that happen to be sets (or even relations).

marcosdumay · on March 5, 2020

An outer join is also used to query optional values and alternatives. There are probably many other use cases I'm not remembering now.

Some programming languages have a rich set of containers to replace what SQL does with outer joins, others simply lack them, and pile hacks over hacks that are always much worse than the SQL one.

int_19h · on March 5, 2020

NULL is actually not a good way to represent an outer join, because an outer join should not produce any "unknown" values - only "missing" values, but NULL's semantics is closer to "unknown".

derefr · on March 6, 2020

It depends on your data ingestion semantics.

In a data-warehouse built up from event-sourced data, "missing" and "unknown" are the same thing: "I don't have this in here yet, so once I get it, it might turn out to be anything."

For a column with a value you know is "nothing", I'd suggest you keep a separate column to denote the type of the column, just as you would in a union data structure. After all, the explicit choice to have or not have a value is, itself, data, that you can either be in possession of, or not. You can't use one sentinel value to indicate both "absence of value" and "absence of knowledge of absence-state of value"; you need two! (Of course, if your knowledge of the absence-state is absent, the data is probably absent too, so you'd probably have a CHECK constraint to assert that if the tag is NULL, the data must also be NULL.)

mjw_byrne · on March 5, 2020

There are in fact alternatives to NULL for outer joins. For example, the SAS data analysis system has an operation akin to a join which gives you an explicit variable saying whether a corresponding row in the joined table was found. In the result of the join, rows where no row was found get zero data values.

Whether you consider that "good" or not is a matter of taste - personally I quite like it because it is very explicit and non-magical and I think NULL semantics cause more problems than they solve.

ainar-g · on March 5, 2020

Sum types?

pezezin · on March 6, 2020

Sum types is something that I really miss when working with SQL, they would make the data model much more powerful and consistent.

patrec · on March 5, 2020

Why would you need outer joins any more than nulls?

pavel_lishin · on March 5, 2020

Show me all schools in the system, and their associated teachers, including schools which have no teachers.

patrec · on March 5, 2020

Easy:

   select school, (select array_agg(teacher) from teachers where teachers.school = schools.school) from schools;

If you don't like nulls any more than outer joins, you'd probably want:

   select school, (select coalesce(array_agg(teacher), '{}'::text[]) from teachers where teachers.school = schools.school) from schools;

derefr · on March 6, 2020

You realize that joins are a part of the formal definition of relational algebra, while subqueries aren't, right? SQL has had outer joins since the beginning, but subqueries came only gradually, and more limited SQL engines tend still to not support them.

patrec · on March 6, 2020

Why would SQL (fundementaly) need outer joins? "Because it has them", is not really a compelling answer in this context.

A "pure" relational model (like the one C.J. Date has been going on about for decades) has no outer join.

patrec · on March 5, 2020

Since I'm getting downvoted, I find nulls way more useful than outer joins, if you disagree I'd be interested in examples.

Tostino · on March 6, 2020

There are plenty of cases where the data model requires left joins (or some equivalent syntax using subqueries) to answer certain queries, and in any complex long lived application you will have queries you need you didn't anticipate when you originally modeled your data.

patrec · on March 6, 2020

Examples to base this discussion on would help.

I'm not a hardcore DB person, but I've designed and modified schemas and queries for quite a few DBs over the years and not once can I recall feeling the need to do an outer join for production code, nor do I recall seeing a hardcore DB person using them (but that might just reflect limited exposure on my part).

On the other hand, I do see LEFT OUTER JOINs a lot in code written by inexperienced people.

Basically, the main use case I have for "dynamically creating" NULLs is quick inspection of data by shoe-horning something with more dimensions into the rows with equal numbers of columns model (CUBE and also OUTER JOIN) – but this is essentially just for display purposes.

From a data modelling perspective outer joins seem quite dodgy to me and an aggregating subquery (such as in the teachers example) is not really conceptually equivalent, even if you can implement them with left joins, and in some DBs such as MySQL they have been traditionally faster. An aggregating subquery is a perfectly conceptually "clean" operation, no ternary logic involved.

Thaxll · on March 5, 2020

How do you handle an empty string vs a string that was never set?

ibrar74 · on March 6, 2020

What do you mean by never set? In the database, you can either enter a string (empty or non-empty) or NULL in a column. There no such thing called never set.

Thaxll · on March 10, 2020

That's my question, if you have a field that represents something and that field is a string that can be empty, null or contains anny string, if you don't use NULL how do you make the difference between an empty string and a null string ( user set "" vs never set anything )

adamzochowski · on March 5, 2020

An easy way to think about database nulls is how you would answer a question in real life about unknowns.

I have a friend James, and another friend Jane. I don't know their ages. Are they of the same age?

Or Jane has a car. James has a blue car. Do car colours differ?

SQL provides human answers: explicitly answers "I don't know".

Other languages treat null as if it was just another value, providing very confusing answers. I wish all programming languages would use SQL style of null.

kingdomcome50 · on March 5, 2020

Eh... it's exceedingly common to want to determine whether or not two expressions both evaluate to "unknown" (e.g. null = null). I literally wrote a query that needed to respect null equality yesterday. Extending your examples it would be the equivalent of asking:

"Is the answer to 'Are they of the same age?' the same as the answer to 'Do car colors differ?'"

Both answers are "I don't know". So... are they the same answer? The human in me says "Yes".

Of course you will recognize the similarity between the above and your first example. The interesting part is in the subtle difference between:

"Are they of the same age?" --> "I don't know"

and

2. "Is the answer to 'What is James' age?' the same as the answer to 'What is Jane's age?'" --> "Yes"

The answer to number 2 relies on the fact that I _know_ I don't know _either_ of their ages. This extra piece of knowledge allows me to "lift" null into a value rather than the absence of. It's the difference between:

`Maybe<T>.value == Maybe<T>.value`

and

`Maybe<T> == Maybe<T>`

irishsultan · on March 6, 2020

The answer to "Are they the same age?" is not really "I don't know". Try answering "I don't know" to a question during a quiz or an exam, it's very unlikely that your answer will be considered correct (even if it's actually true that you don't know it).

You don't know the answer to the question "Are they the same age" (but it's either true of false), and you don't know the answer to the question "Do car colors differ" (it's either true or false). If you don't know the answer to either question then you don't know whether the answers differ either.

Tostino · on March 6, 2020

I thought that was a good way to lay out the argument, so thanks for that!

int_19h · on March 5, 2020

The problem is that semantics of NULL in SQL are not that of "unknown". Well, they are in most cases, but there are also many when it's not true - e.g. SUM().

Even if they were, it's just a horrible name for the concept. You could argue that back when the keyword was introduced into SQL, it didn't yet have a well-established meaning that was different... but I'm not even sure that is true - some form of NIL or NULL has been around since the early Algol extensions, and it never meant "unknown".

The way to fix it would be to have MISSING and UNKNOWN as two different concepts. Then:

   UNKNOWN + 1 -> UNKNOWN
   MISSING + 1 -> *error*

   TRUE AND UNKNOWN -> UNKNOWN
   TRUE AND MISSING -> *error*

   UNKNOWN = UNKNOWN -> UNKNOWN
   MISSING = MISSING -> TRUE

Well, and fix the aggregates - e.g. SUM() of no values should just be 0.

derefr · on March 6, 2020

> some form of NIL or NULL has been around since the early Algol extensions, and it never meant "unknown".

NULL doesn't really mean unknown; it means that instead of having a pointer to a result-set, you have no such pointer.

But in the semantics of answering questions, ala Prolog, "the absence of a result set" means that you don't know something, since those "results" are the answers to a question.

An alternative way to think of NULL is that it's a short-circuiting of what would otherwise be the database blocking on a promise for data to resolve, that never resolves—waiting for its deductive algorithm to get the data necessary to definitively answer the question.

> have MISSING and UNKNOWN as two different concepts

You already do. Just have a boolean column "foo_present" (or an enum column "foo_type", where one type means "intentionally absent") beside your nullable whatever-typed column "foo". In this case, a foo_present=TRUE together with foo=NULL means you're missing the foo value that you really expected to be there—it's a semantically-present, in-practice-unknown value (which probably indicates that e.g. the enrichment step hasn't gotten around to filling it out yet.) Meanwhile, if foo_present=FALSE, the value of foo itself doesn't matter (though it probably should be NULL) since you won't be looking at it in the first place; whatever value it has in practice, it's not meant to have a value semantically. It's pretty easy to construct queries around this paradigm.

Or, if you're using a SQL RDBMS that supports custom types, you can just declare a sum type, e.g. Pet = Present pet_id | Nothing.

Or, if you're using an SQL RDBMS that supports array product-types (e.g. "int[]"), you could just use an array-typed column with a CHECK constraint on it limiting it to have either 0 or 1 value. Then a value can be [], [foo], or NULL.

All of these allow you to see when the information of absence has been explicitly entered, separately from the query-response of unknown-ness which can come from a variety of places (such as e.g. joining several tables for sub-types of something together, where the columns that don't exist for a given sub-type will appear NULL.)

carterehsmith · on March 6, 2020

>> NULL doesn't really mean unknown; it means that instead of having a pointer to a result-set, you have no such pointer.

NULL in SQL has nothing to do with pointers.

It means that the value is missing or otherwise unknown.

Consider the column that is tracing the reading of a meter of some kind. And let's say that we could not get a reading -- due to some fault. So we did a reading, and we want to record that. What "value" should we record? 0 is no good - may make you think that we recorded a reading of 0 - which is not true.

NULL is the solution - it specifically tells you that the reading is missing or inapplicable.

Obviously, this complicates things, but incidentally translates well to other languages that have a (completely unrelated) concept of a null pointer.

derefr · on March 6, 2020

Yes, relational algebra isn't a pointer machine. I was trying to give a mental image of what is going on—the resolution or lack thereof of tuples in a tuple-space.

Let me try another analogy, that might suit you better: if you think of each row-tuple in a relation in SQL as an asserted "fact" in a logic knowledge-base, e.g.

    INSERT INTO parents VALUES ('x', 'y')

equates to the logical assertion:

    parents('x', 'y').

...then a NULL result in SQL is the equivalent of asking your theorem-prover to look at this knowledge-base and deduce the truth-state of other fact you never told it anything about, e.g.:

    ?- parents('a', 'b').

...and getting back "unsat" — meaning that the theorem-prover couldn't prove your assertion true or false.

A theorem prover has the option to answer any question as Satisfied (Result) | Unsatisfied. This is what I meant by saying "a pointer to a result set, or a lack thereof." NULL is when you don't have such a pointer—which is why it's called NULL, by analogy to the DBMS's C implementation where the rowset pointer itself is likely set to be the pointer sentinel value ("NULL" or 0), rather than set to point to an empty rowset. The name NULL probably ended up in SQL after "bubbling up" from the implementations of the DBMS engines in this way.

The Weird Thing About SQL is that the grammar is phrased in terms of individual rows, whereas the query-planning occurs in terms of relations. So, when doing joins, you need a way to talk about what maps to "a join of a populated relation to an unpopulated relation", which in terms of individual rows, necessarily translates to "a real row-tuple value on one side of the join, along with some scalar sentinel row-tuple value [NULL] on the other side, standing in for 'a member of' the empty result-set."

That's the core semantics of NULL in SQL—to give you a 'scalar handle onto' the fact that there's no row on the other side of a join. That's where NULLs "come from", if you don't introduce them yourself (i.e. if you never make any of the columns in your tables nullable.) And so those semantics must be what you look at when figuring out what NULL "means." In joins, NULL doesn't quite mean "missing" or "unknown", but rather a combination of the two; it means "unsatisfied", just like in the theorem prover; it means that the query-planner has not yet been told enough to resolve your query fully.

(Yes, you can model an ingestion semantics where any row-tuple value that's not "in" a table, means that whatever predicate is represented by the presence of a value in the table should be considered to resolve false in the relation—but that's not really how SQL query-planning sees the world. It thinks of rows it doesn't have as "nobody bothered to tell me this yet"—unsatisfied—rather than "explicitly missing"—false. If you want false-ness semantics in SQL, it'd probably be better to define a view that generates rows with zero-values/false-values where the underlying table doesn't have values, and join to that instead.)

rowanseymour · on March 5, 2020

Philosophically that makes sense.. but in practice I honestly don't recall a time that I've found NULLs being incomparable to be useful. It's always been the case that I want NULL to match NULL.

derefr · on March 6, 2020

If you GROUP BY some column, do you want the rows where the column is NULL to get grouped together? They're probably all distinct things.

For example, if your items represent tags, and you're grouping by parent_id to find subtags, the ones without parents (parent_id IS NULL) aren't all subtags of some root of a tag tree; they're all distinct root tags in a tag forest.

SigmundA · on March 5, 2020

Exactly, NULL != NULL is surprising since pretty much every other language treats them equal and seems to serve no purpose other than being philosophically "right".

Even SQL databases that adhere to ANSI NULLS must treat NULL==NULL sometimes (group by, order by)

Obviously you need to compare nulls hence the crazy IS NOT DISTINCT FROM which is just another way of saying "equal" in English.

brlewis · on March 5, 2020

Exactly right. For this reason, I don't think the article should have chosen middle names as an example. NULL is appropriate if you don't know a person's middle name, but if you know they don't have a middle name, the empty string is better, assuming you're not using Oracle where empty strings are NULL.

gshulegaard · on March 5, 2020

That's fine, but it still doesn't quite intuit when you consider:

Do I know the same amount about whether their age or their car colors differ? : Yes

And that is the crux of the intuitive breakdown of 3VL IMO.

jeffdavis · on March 5, 2020

Quiz:

    -- Find all customers with total orders
    -- less than $100 in 2019
    SELECT c.name
      FROM customers c LEFT JOIN orders o
      ON (c.id = o.customer_id)
      WHERE DATE_TRUNC('year', o.ts) = '2019-01-01'::timestamptz
      GROUP BY c.name
      HAVING SUM(o.price) < 100;

Can you find the problem?

[ SPOILER ]

The problem is if there is a customer with no orders. The left join will produce NULLs for the attributes of "orders", and SUM() will sum them up to NULL because it thinks NULL means "nothing". The predicate "NULL < 100" will then think the NULL means "unknown" and produce the result "unknown". Then, the HAVING clause treats NULL like FALSE, and does not produce the row.

jeffdavis · on March 5, 2020

I'd like to add that, in the above example, you get bitten by NULL semantics even though you don't have any NULLs in your data at all.

So if you don't start out with any NULLs, and you have complete and unambiguous knowledge of your entire data set, how does "unknown" get injected into such an innocent-looking example?

skrebbel · on March 5, 2020

This is not really a problem with NULLs though, but primarily with how SQL forces you to put any kind of result in a rectangular table-shaped resultset. Because of this, a LEFT JOIN returns a single "order" row both if there is exactly 1 order for that customer, but also when there are no orders. That's just nuts.

If a JOIN could somehow return hierarchically structured data then there would've been no gotchas here.

jeffdavis · on March 5, 2020

Well, it's many problems, that's the fun of the example.

But yes, I think left join is fundamentally about grouping, and grouping would be better solved with set-valued attributes or relation-valued attributes.

I wouldn't call that "hierarchically-structured data" because that has the connotation that the data is actually stored in a hierarchy, which I think is generally a bad idea (though not always).

pjot · on March 5, 2020

'HAVING sum(coalesce(o.price, 0)) < 100'

jeffdavis · on March 5, 2020

You are correct. The coalesce can go inside or outside the SUM().

magicalhippo · on March 5, 2020

I'm self-taught when it comes to SQL so I'm probably doing stupid stuff, but I tend to use subqueries when I can.

So in your case I'd do

    SELECT c.name
    FROM customers c 
    WHERE EXISTS (
      SELECT 1 FROM orders o      
      WHERE c.id = o.customer_id
      AND DATE_TRUNC('year', o.ts) = '2019-01-01'::timestamptz
      HAVING SUM(o.price) < 100
   )

Again, might be dumb, but I usually find it easier to think in terms of subqueries rather than joins. Not that I find joining hard per se, just feels more natural and to the point to use subqueries. Database I use also seems to optimize it better most of the time.

iimblack · on March 5, 2020

Doesn't this produce different output? I could be wrong but I think your code is more equivalent to an inner join, not a left outer join.

magicalhippo · on March 5, 2020

Oh yeah you're right, I read the comment above the query and assumed he wanted only customers which had orders.

Could just modify it to exclude the orders above 100 ("where not exists"...), no?

iimblack · on March 6, 2020

Yes, that should work.

Tostino · on March 6, 2020

Sadly, I spotted it immediately. God I hate debugging queries at this point.

Edit: just read what you said was wrong with it, and realized you missed the one in the where clause.

Like you said, really hammers home how hard to deal with null is with properly all the time.

briffle · on March 5, 2020

You are only looking at Jan 1st 2019. (And really, a single second of that day, I believe)

jeffdavis · on March 5, 2020

DATE_TRUNC('year',...) cuts out the lower-order parts of the timestamp, so I believe my query is correct. More precisely, I believe my query is incorrect in the way I intended but not in other ways ;-)

uhoh-itsmaciek · on March 5, 2020

Hey Jeff! Isn't it also incorrect because

  WHERE DATE_TRUNC('year', o.ts) = '2019-01-01'::timestamptz

will filter out customers without orders even before the HAVING (because DATE_TRUNC of NULL is NULL)? I mean, I guess you're still demonstrating the perils of NULL with that ;)

jeffdavis · on March 5, 2020

Ah, you got me!

ilikepi · on March 5, 2020

Personally I like:

    WHERE EXTRACT(YEAR FROM o.ts) = 2019

...though I imagine some people dislike the way EXTRACT treats words like YEAR as keywords. Always interesting to see which solution people choose when there are several options. :)

bobwaycott · on March 5, 2020

The problem is likely being unaware COALESCE exists. :)

jeffdavis · on March 5, 2020

I don't think that's quite fair. It's pretty easy to know that COALESCE exists and still make the mistake.

A language should help you avoid making mistakes, not trick you into making them. The path of least resistance should be the safest and most normal thing, and SQL fails here.

bobwaycott · on March 5, 2020

I think it’s more unfair to say SQL “tricks” people into making mistakes than to lightheartedly see such a query and think whoever wrote it might not know about COALESCE. I see that query and the lack of COALESCE—in the context given where it is known the column we care about has NULLs—and assume there’s an opportunity to introduce someone to COALESCE: “Hey, check this out—it’s made for this situation!”

But saying SQL is tricking people into making mistakes? Nah. The real issue seems to be writing a query without a proper understanding of NULL, not being tricked by the language.

earthboundkid · on March 5, 2020

Null dates and null numbers are sometimes useful. Null strings? No. It basically never makes sense to distinguish between text/varchar '' and NULL. If you really need to distinguish between "no middle name" and "middle name not provided" (why would you need that?), just add a boolean middle_name_known. NULL creeps into data in all sorts of ways, so you can't trust that the people doing your entry are actually using the semantics you want for null text. Add a boolean field and make them explicitly commit to the semantics.

cavanasm · on March 5, 2020

Came here for this. I've had so many struggles with big enterprise data where engineers who design the schema think empty strings and nulls should have meaningful distinctions, but the humans involved in creating that data (who are not engineers) somehow break those assumptions, so when the data migrated elsewhere, inevitably it gets collapsed back down to null and empty string being identical.

teddyh · on March 6, 2020

> it is quite evident that NULL is defined as “0” in C language.

It is actually the other way around. The bare numeral 0, in C, is defined to be the null pointer when the numeral 0 is encountered in a pointer context. This does not mean that the null pointer actually has anything to do with the value 0. The succeeding if clause in the C program in the article, the one which prints “NULL is 0”, is not even using any pointers! Since the C preprocessor is text-based, the token sequence “if(0 == NULL)” is replaced by the preprocessor with “if(0 == 0)”, which the compiler will interpret as an integer comparison, not a pointer comparison.

For further reading, see http://c-faq.com/null

tabtab · on March 5, 2020

They way null strings are dealt with under default functions and operators has been a huge industrial mistake. It creates code bloat and mistakes that otherwise would be rare. Oracle mostly got it right.

And in general, "x <> 5" should be "true" if x is null, and not return null.

The justifications are mostly theoretical, not practical. I'd rather have 10 bugs from misuse of theory than 100 from real world. Theorists seem to magnify the "sin" value of theory-related bugs compared to actual bugs. I count them the SAME, and under that scoring formula, we are doing it wrong.

say_it_as_it_is · on March 5, 2020

Unless you're using an ORM that is replacing "= NULL" to "IS NULL" for you, you should know that there's a configuration within Postgres to change to do so instead: https://postgresqlco.nf/en/doc/param/transform_null_equals/

jeffdavis · on March 5, 2020

I don't recommend that setting, personally. If you do use it, I recommend it only for migration purposes with the idea that you go back to the default setting soon.

Configuration settings that change SQL semantics are just too dangerous.

From the official docs:

"Note that this option only affects the exact form = NULL, not other comparison operators or other expressions that are computationally equivalent to some expression involving the equals operator (such as IN). Thus, this option is not a general fix for bad programming."

https://www.postgresql.org/docs/12/runtime-config-compatible...

I know it seems tempting to use this setting as a convenience, but I strongly recommend against it.

say_it_as_it_is · on March 5, 2020

Thanks. "= NULL" is not benign, either. It causes its own set of issues.

mcdoker18 · on March 5, 2020

Can someone explain the meaning to use 'x == NULL' expression? Why does not PostgreSQL prohibit usage of this expression?

ibrar74 · on March 6, 2020

It is valid syntax, you are comparing an "x" with "no value" and result will be "no value"

mcdoker18 · on March 7, 2020

I know that it's a valid expression, but are there any use cases for it?

eMSF · on March 6, 2020

>The output of the above program will be “NULL is 0”, so it is quite evident that NULL is defined as “0” in C language.

To be pedantic, all that's evident from the observed behaviour is that in C, NULL compares equal to the integer literal 0. The expressions may be, and usually are not mutually interchangeable both ways, as they (may) have different types.

mfrye0 · on March 5, 2020

One thing that threw me off is upserts with NULL in Postgres. NULL != NULL so it will continually create new rows.

asah · on March 6, 2020

30 yrs in the business and I never run into NULL semantics issues... because I virtually never use NULL (!)

my experience: it's far better to explicitly reserve a value for NULL. In fact, I go the other way and ask when NULL is worth adding?

example: in your datatype NULL-values exist but there's no easy way to represent them otherwise, i.e. you can't reserve a value because they're truly all used by real data.

example: you have a slew of BOOLEAN columns and NULL is a legitimate third value in the dataset. The alternative (char(1)) can be a lot of overhead vs the bitmask optimizations that most SQL engines offer (but test this!!!)

hope this helps!

p.s. having studied the history and known some of the people responsible... if we gave them a time machine, they'd first kill Hitler... and then undo this decision.

jeffdavis · on March 5, 2020

I wish SQL had used a Maybe/Option type, or other algebraic data types, instead of NULL. So much safer and cleaner.

earthboundkid · on March 5, 2020

What would the behavior of the option type be? I don't see how it would behave differently than what SQL does today.

jeffdavis · on March 5, 2020

Option types can require exhaustive matching. So where there can be a NULL, you would have to address it. The syntax can make it pretty easy to handle typical cases.

exabrial · on March 5, 2020

Related: Did you know that MySQL has a null safe operator? Absolutely life changing: https://www.knowledgewalls.com/johnpeter/books/mysql/nullsaf...

crimsonalucard · on March 5, 2020

Null may have technical meanings in languages like C++ (0) but essentially the conceptual meaning of a null is "No Value" in every language.

It's a huge mistake in every language including SQL.

If you have an entity of a type. Say Int. What does NULL mean in context of that type?

  X = NULL
  Y = NULL + 1

It doesn't matter what language you operate in there is no NULL + 1, because NULL is not an INT. It just doesn't exist. So your programming language, if it's good, it should throw an error. If it's a shit it will continue to propogate the y deep into your program and you won't even know X was a null.

If you think a default to zero works imagine this:

  X = IntFunctionThatErrorsOutAndReturnsUnknownValueAkaNull()
  Y = X + 3

Does it make any sense to you that Y will be 3? X was unknown! You don't even know that the function errored out! Either IntFunctionThatErrorsOutAndReturnsUnknownValueAkaNull should have thrown an error or it should have returned a special type called an Optional Type. Throwing an error isn't ideal because you won't know something is wrong until runtime so the Optional Type is the better deal here. Either way neither approach is taken and instead you have a null and Y propagates through your program as if NOTHING happened.

Unfortunately SQL doesn't offer The Optional type or sum types therefore we are stuck with Null.

So if you do this in a language that defaults nulls to zeros:

   x = getFirstSqlColumnAsList("SELECT * FROM integerTableWithOnlyNullValues;")
   y = x[0] + 1

boom you get y a garbage value propagating through your entire program thanks to SQL nulls. SQL isn't innocent here.

Now imagine Optional types with exhaustive pattern matching:

  x: Maybe(int) = getFirstSqlColumnAsList("SELECT * FROM integerTableWithOnlyNullValues;")
  y = x match {
         Just x: Just(x[0] + 1),
         Nothing: Nothing
       }

If you can't understand why the above code is so much better than the previous code than you need to look up the Maybe Monad with algebraic data types. Basically the code above makes sure that there is Zero chance that an error is unhandled.

ibrar74 · on March 6, 2020

Yes, that's the problem this blog try to explain. NULL means "no value", but this is not true for all the languages. So you better know what is the value of NULL in the language you are trying to use.