Hacker News new | past | comments | ask | show | jobs | submit login
Null Values in SQL Queries (mitchum.blog)
137 points by maynman on Feb 3, 2020 | hide | past | favorite | 152 comments



>For example, Oracle database won’t allow you to have an empty string. Anytime Oracle database sees an empty string, it automatically converts the empty string into a NULL value.

Damn. This is how you do enterprise.

I might be the only person who likes SQL nulls. If you learn how they work up front, they're useful and not really that confusing. But if I ran into weird behaviors like this, I might hate them too.


There's a lot of real-world input situations in which I want NULL to indicate that a record field hasn't yet been collected, versus a record in which the answer to that field was "" – i.e. left blank by the respondent/data source. Sure I could have the front-facing data app auto-convert blank values to "N/A" or "(blank)", but that's unnecessary complexity.


I've been really enjoying using Rust because it has no NULL. You have to instead use the semantics of Some(x) / None()and handle both cases. I'm not sure how you work around the fact that an empty string is Null.

The problem with Null is that it can be either a value or show the lack of presence, and in JS because both undefined and null are values you can't depend on them being used for their semantics either (you can try to enforce their use but it that's no guarantee).

SQL suffers this exact same problem. I wonder what SQL would look like without Null.

    Select id, Option (key)
    From table
 

    Insert... Some(5), None::Int
From this perspective it still seems that empty string should not be None, but I think you're right that many people do use a code like "N/A", but you can just add a supplemental Boolean field which is cleaner for set / not set.

Interesting


> SQL suffers this exact same problem. I wonder what SQL would look like without Null.

As a base table value, it's easy to mechanically avoid NULL in SQL. Every nullable column is, in the most general case, simply a column that doesn't belong in the table but in a different table whose primary key is alsona foreign key referencing the original table’s primary key.


Your solution works, but I would hardly describe having a new table for every nullable column is "easy". It also would not be nice for maintenance -- often I want to add a new column as null initially, then soon after write a migration to convert it to non-null once it's populated. Having to create / remove a new table every time would be a pain.


>The problem with Null is that it can be either a value or show the lack of presence

No, the trouble with null is that it's a supertype that breaks the type systems -- if you actually wanted to have safe code in the sense you expect typesafety to give you, you should be checking for NULLs everywhere. Because a codebase can be changed from never-null to sometimes-null, and the type system is happy to let it go through, because never-null was not actually a thing -- we just imagined it for our convenience.

The problem with null is that, in general, if you didn't implement an explicit handling for it, then it's presence will do the wrong thing.

The primary benefit of rust's Option is that you can safely not check for null.

SQL is in fact the same problem -- we just don't bother to handle the null until it bites us in the ass. For example, booleans don't actually exist, because really a bool has values (True, False, Null). Which breaks your Boolean logic because you're using trinary values (with unintuitive results, because there isn't really a good answer). Which leads you to innocent queries including both false positives and false negatives, silently, when nulls are present in your data [0]

SQL does however allow for the column constraint NOT NULL, but there's nothing that actually enforces your queries to handle the possibility of nulls correctly (because, by default, everything is nullable, and your queries would be horrificly verbose) so queries tend not to handle the case correctly until they run into the problem.

The problem people tend to not understand is that we should be checking for nulls everywhere, and we only don't do so because it would be insanity. The primary benefit of moving nulls into an Enum is that we can now safely specify that thing is NOT NULL (and our compiler will enforce it). And as it turns out, most of your data does not need, or want, to be nullable

[0] https://www.google.com/url?q=https://sigmodrecord.org/public...


This still doesn't fix the user input Null, versus this value is not set. I think in that case you'd want a nullable boolean companion field that can then express set, actively not set and passively not set (when that matters) but it's a bit ugly.


Value isn't set? Use Nothing. Data has been collected and the user didn't input anything? Use Just "".

If you really wanted to, in Haskell you could do something like:

    data AdvancedMaybe a = Just a | Nothing | PassiveNothing


Interesting example. I certainly think the just use empty "" is fine and the coercion to NULL is not great.

I think Allowing "empty" values, and not allowing Null would be better.

If SQL had Option (Maybe) branch matching it would be an interesting replacement, and then coalesce would not be needed.

    SELECT id, MATCH k WITH (
      SOME(k) THEN k,
      NONE THEN default,
    END) AS something
    FROM ...
I can see it's terse, and it is just a wrapper around data that would be stored like NULL but it kind of makes for interesting semantics.

Just spitballing for fun on what it might look like.


Empty string isn't enough for this situation since it doesn't work for data types other than string, I actually prefer null vs undefined in javascript / json here, xml also has the concept of null vs undefined.

Emtpy string vs null is not that useful if you have null and undefined.


Undefined is very underappreciated invention. I wish I had undefined in Java, throwing exception on read. That would make a lot of software much more reliable. I'm going to write my own AutoValue library just to generate java beans which will allow uninitialized values.

PS undefined in JavaScript is not that useful IMO. But it could be useful with more strict language.


Optional was added in java 8.


Just to be clear, undefined is not valid in JSON. The only types in JSON are nulls, strings, numbers, arrays, booleans and objects.


"undefined" in this context means that the JSON object has had the relevant key omitted rather than explicitly set to null. This is valid JSON (it just so happens that JavaScript handles this scenario by returning "undefined" as with all undefined variables).


Ah, right, my bad.

In my defense, JavaScript treats a value of `undefined` differently from an unset field / variable (e.g. when testing `object.hasOwnProperty('field')`).


JavaScript has too many "non-visible" value types in my opinion. This makes for both messy code and messy debugging.


Yes omitted value vs provided with null which is easy to work with in JS because of null and undefined and them being falsey.

Working with xml or json in more typed language like C# is less fun because there is not distinction in the language instead you end up using dom like checks to see if the key was provided or was set to null.

.Net code generator use to add an extra property called [property name]Specified for xml serialization so you could do null values or not send the node. Definitely prefer more direct language support for no value provided vs a null value.


I've never needed such in 3 decades of systems design. I'd like to hear the details. An explicit flag or time-stamp should be used to indicate when or if a record as been updated. To be frank, heavily reliance on Null strings usually means somebody is doing something wrong or awkward in my opinion. Null strings cause 10 problems for every 1 they solve. I stand by that and will and have defended it for hours in debates. Bring it on! (Granted, most RDBMS don't offer enough tools to easily do it correctly.)


A common situation I've run into is with "default" values and overrides, especially for configuration-type settings. NULL indicates use parent record value, while non-NULL, including empty string, means to use that value. By allowing empty string, you explicitly allow a user to basically say "don't inherit". Think along the lines of `COALESCE(userValue, tenantDefaultValue, "Default Value")`.

One way of implementing the UI for this is to have a checkbox labelled "Inherit" or "Use default", and if it's checked, disable the actual textbox (and put NULL in the field).

I've also run into similar patterns with data telemetry. I want to accurately represent the data I received, and so if some older version doesn't include a particular value, I want to store that as NULL, because empty string is a potentially valid value. If you "normalize" NULLs to empty string, and then it makes it impossible to answer a question like "What % of systems have a value for this data item?" since it includes versions that can't get it at all.


Did the user leave the field blank in the form or did they never even see it (multi step processes etc.)

Did you want to update the field to null or not update it at all is another one.


What exactly does "not see" mean? Some use tab, others use the mouse. That doesn't tell us much and there are better ways to track user hand/mouse movements if you need such telemetry. UI api's often handle nulls/blanks different such that you don't want to over-rely on how one of them does it. Multi-screen tasks should track when each sub-screen is finished, not just the final transaction. I'd like to see a more specific use-case.

Re: Did you want to update the field to null or not update it at all is another one.

Who, the user or DBA? And why? Users don't know a Null from a horse; you shouldn't make them think about that, that's an internal guts thing.


I have to say I'm a bit amused at the insistence on tabs/mice/telemetry :)

In at least one project I worked on, there was a lot of survey data entered from paper surveys used in the 90s and early 2000s. The structure of the survey included pass-through questions: "If you have ever smoked a joint, please check yes and answer questions 42 and 43. If you have not ever smoked a joint, please check no and TURN THE PAGE."

One can certainly build a logic to process these replies (check answer to question 41, tally blanks in questions 42, 43 according to answer to question 41) but since these questions and answers were also entered into the computer in the olden days, NULL was used if questions 42, 43 were passed through, while blank was used if questions 42, 43 were left blank (and 41 was yes).

How times have changed.


Was the user prompted to enter the value but left it blank or where they never prompted. Yes you could always have more data, but this extra two value distinction is common and practical in my experience.

On updating its related to prompting typically, again communicating user intent, did the user update a field from having a value to not having a value (they purposely blanked it out) or they didn't touch the field and maybe it wasn't even shown so don't modify it. Basically am I going to generate a update statement with set field = null or no set at all for that field. This is trivial in json to send through application due to null vs undefined.


Re: but this extra two value distinction is common and practical in my experience.

I'm still not following. What exact hand or keyboard/mouse movements constitute "left it blank"? "Leave" means to "move away from". What is moving away from what?

Usually one stores a time-stamp for input prompts (screens). If the prompt never appears, you then have no time-stamp and/or record for it. Why would a broken prompt create a record to begin with? I'm missing some context you seem to be assuming. You may be assuming behavior of a specific UI product or stack.


Not sure I can explain more simply, just think multi part wizard interfaces or forms with button that open up sub-forms.

I think you get it since you talk about time stamps. Just as you could record all mouse movements and keyboards to get higher fidelity you could break a single record into multiple with times stamps to record each step in a form, then the lack of record would be distinct from a record with all nulls along with time stamps. You could also do a record per field (which I have seen more than once) with metadata per field as other columns.

But without all that a system that supports null and some empty value gives you more fidelity than just null that again in my experience is practical.

Empty string and null work fine with strings for this purpose but for other datatypes you start needing to pick special values like the minimum value for a 32 bit int or 0 date etc.


Usually you don't get the entire sub-screen's data such that a blank versus null distinction wouldn't help. You can't "half submit" a dialog or sub-screen on the vast majority of systems I ever worked with it, and one adds time-stamps to critical sub-screens if they can.

Maybe you encountered a specific product's edge-case where it helps, but I'd rather trade away 9 other headaches to get your edge case headache. It may be the case that Product X doesn't allow time-stamps, but should we gum up all strings to make just Product X happy? I don't see how the tradeoff math favors that. Plus, it encourages bad designs by plugging problems with funny non-printable codes. Let's not enable slop-heads.

Re: but for other datatypes

For the purpose of this discussion I'm limiting it to strings, where the majority of null-related headaches are found in my experience.


I just want to know if a boolean has been explicitly set, without having to check the value of another more complex data type that would never be used for anything else.


To clarify, my context is strings. If that was not clear, my apologies.


I’d go further than you and say they should be removed by default on all fields.

Want to know if a Boolean field is unset? Well it’s no longer Boolean because you now have 3 states for that field. So why not use a char, U/Y/N with the default being U?

NULL might have made more sense 30+ years ago when systems were more resource constrained but it doesn’t make sense now for most of the problems people are trying to solve day to day. If anything, it creates more problems.

Just to be clear, I’m not saying they should be removed entirely but rather that they shouldn’t have to be explicitly disabled on every CREATE.

I will say one use case for NULL that is hugely helpful is outer joins and nested queries. However these don’t generate high performance queries so if you’re having to rely on them then you might need to rethink your database schema anyway.

So essentially I don’t disagree with you, I just think you’re being too nice limiting your complaint to string fields.


> Well it’s no longer Boolean because you now have 3 states for that field. So why not use a char, U/Y/N with the default being U?

Well because instead of using a type that exactly encodes the concept of "yes/no/unset" (nullable boolean), you'd be using a type that encodes "any 1-character text, with arbitrary meaning and most of the values being nonsensical"


The problem is you need a boat load of additional code to cover unset. Not just in SQL (syntax is different for NULL than it is for comparing any other type) but often also in your importing language too (eg some languages will cast NULL to a nil value that can actually raise exceptions or even crash your application if not handled correctly).

Capturing those edge cases is non-trivial compared checking the value of a char.

In an idea world your unit tests and CI pipelines would catch all of those but that depends on well written tests. Thus in my experience having fewer hidden traps from the outset is automatically a better design than one that perfectly fits an academic theory but is harder to support in practice.


It'd probably be more sane than trying to stuff a 3VL into bunch of 2VL operations, because you refuse to acknowledge that you don't actually have a 2VL type


I'm sceptical. Do you have examples?

Take the middle name example: "What was U.S President Theodore Roosevelt’s middle name?" when you know he didn't have a middle name.

Are you suggesting that a blank is the correct choice here?

I don't think it's accurate to say they have a blank middle name. I think it's better to say they don't have a middle name.


That's entirely a schema/data policy question. Null could indicate no name, or it just as easily could indicate tha data was not collected, and an empty string would indicate it was collected but is empty.

Both are entirely valid methods of encoding the data, and require knowing how the data is intended to be collected, entered and used to determine the best choice.


Should there be an empty date, as a corollary to empty string, for representing dates that are "collected but empty" eg date of death?


If that's the data policy of the group collecting the data, then an empty date (which some systems support, 0000-00-00 is valid in MySQL for certain modes of strictness IIRC), or a sentinel value could be used (1111-11-11, depending on expected values).

If the data needs to differentiate between those cases and a separate boolean bit of data to track whether it was collected is no feasible (as it so often isn't in the real world), then you do what you must. How that works out in the real world would be the policy for how to interpret the data in that field. It's really no different than any time you've seem a field that is named something counter-intuitive to what it holds in some schema. Something tracks that difference, whether it be institutional knowledge, come conversion code, or a label prior to display. That's what I mean by it being a "schema/data policy question".


Yes absolutely a data schema question. And yes, you can set whatever arbitrary policy you want. eg "nmn" for no-middle-name. Or require that a horse's birthday is recorded as 1st August.

But SQL and the relational model is meant to be a logical system, and I'm interested in preserving some conceptual integrity in the domain modelling.


I would say there's a spectrum. My guess is that whatever schema you could come up with could have an extra layer of metadata applied to describe the data to some benefit, but at some point you have to make a decision about how something is best represented.

To keep with the example of asking questions, if a sruvey is asked and each question is represented by field in a table for that survey, but they also want to track whether any particular question is answered, there are multiple ways to track this information. It can be done through a separate table (or multiple), it can be done with an extra boolean field per original question field, or it could be done by making each question field nullable, and ensuring that no question answered will ever be left null.

Which of those cases breaks the conceptual integrity of the domain modeling? I would argue that as long as they are applied consistently and accurately, none of them break it.


If your SQL query tells you the most popular name is "nmn" or that that a whole lot of people died before they were born (DOD = 1111-11-11) then yes, I think there has been a loss of integrity.


That only matters if your SQL data is ever meant to be used in isolation. If the only appropriate way to access the data and retain integrity is either through the same set of applications that insert it, or through fucntions written to mediate access within the SQL instance, than how the data is stored on disk is mostly irrelevant.

If all your access is mediated by a layer that takes care of this, then complaining about how it's stored in SQL is no different than complaining about how your DBMS is storing the data on disk. It doesn't matter, you aren't supposed to be doing anything with it there anyway.

Data integrity is not something a database can ensure, for any non-trivial use. ANSI SQL provides tools to help ensure data integrity, through data types, referential integrity checks if you define them and they are supported, but ultimately, those tools can only account for what the database supports. Will they prevent the favorite_color field from receiving the data that should have been entered into the pet_name field? Unlikely?

It's no different in programming languages and bugs (as a data integrity problem is a bug in data domain). Using C, Java, Rust and Haskell will not prevent all your bugs, and Perl, Python, JavaScript and Bash do not mean you will have bugs. A bug free Perl script is possible, and so is a bug riddled Rust script, as not all bugs are (or even can be) caught by the assurances the more strict languages provide.

Unless all your SQL in a schema use is through a directly connected SQL client that loads data directly from native SQL formats, runs queries, and exports data as the output of those queries, the applications that have ingress and egress to that database are what really matters for data integrity, and worrying that a date field might contain 1111-11-11 as a special sentinel value that means something other than what null value in the same field means is being unable to see the forest for the trees.


No, what you suggest is counter intuitive and should never pass a code review.


> No, what you suggest is counter intuitive and should never pass a code review.

You're assertion is that it's counter-intuitive to use the database's ability to encode a lack of information to denote a lack of knowledge about the information, and should only be used to denote the information does not universally exist? And additionally that this is so well accepted that to do otherwise would not be accepted by any peers?

If so, that's a fairly bold assertion to make when you haven't provided any any evidence or reasoning to back it up.


This is the perfect example. In an enterprise, some forms might ask middle name and some forms might not.

In the database you need to represent if the user entered a blank name “”, or if the middle name was never even requested NULL.

Only in Oracle is it not possible to make this distinction. Oracle is definitely broken in this regard.

Inserting a string in the middle name column “unknown” or “unspecified” is a clunky workaround.


The problem is a bit more subtle. NULL is often used with two distinct meanings:

- I know that Theodore Roosevelt has no middle name.

- I don't know what Theodore Roosevelt's middle name is

The semantics of each case are different in a subtle but important way. Consider the following two predicates

'Jim' = [no value] vs 'Jim' = [I don't know]

The former predicate is obviously false. 'Jim' is not equivalent to a non-existent value. The latter predicate however, can't be assigned a truth value. It's entirely possible that the unknown value could later turn out to be 'Jim'.

The key issue is that SQL treats NULL values according to the latter ('I don't know') rule and ternary (Kleene) logic, as TFA discusses. That, in turn means that using NULL to represent [no value] will lead to painful and hard to diagnose logic bugs (e.g., WHERE x = 1 OR x != 1 not returning all records).

[edit: clarifying why bugs will arise]


I do recognize (1) there is a logical difference between "missing because absent" and "missing because unknown", (2) empty string is not an element from the domain of personal names.

In a sense, empty string is available as a marker, just as NULL is available as a marker, to satisfy either scenario. I think the possibility of three-valued logic applies in either scenario.

I'm not at all convinced empty (non-null) string is appropriate for many real world applications, just as empty (non-null) date is appropriate for similar scenarios.

Eg. Date of death: unknown - NULL; not dead yet - NULL


Yeah. That's what null is for (more or less).


Right, I am a fan of SQL NULl as well. It is nicely consistent - anything NULL in - you get NULL out. Clearly telling that you get undefined data. Silently converting to empty string, zero, or therelike would eventually return garbage for harder to debug reasons.

That Oracle behavior annoys me each time, though.


Re: Silently converting to empty string, zero, or therelike would eventually return garbage for harder to debug reasons.

It's never been a problem with strings in my many decades of experience, unless somebody does something which I consider poor system engineering. Nearby I invited a solid use-case illustrating a real string need.


No I am tired of this WHERE (a = b) or (a IS NULL AND b IS NULL)

Null should equal null like in every other programming language even SQL group by do null equals null which is even more inconsistent.


It is quite wordy but the SQL solution for this is "WHERE a IS NOT DISTINCT FROM b".

https://modern-sql.com/feature/is-distinct-from


> Null should equal null like in every other programming language

Think of SQL's NULL like it's a NaN.


Something to petition ANSI about.

(Microsoft's SQL Server still defaults to the non-ANSI NULL behavior where a = b when both are NULL, and that's something that still pings on checklists of SQL Server if it follows ANSI standards. SQL Server is kind enough to let you enable/disable the behavior, and likely that would persist even after the default switches to meet the standard as the docs assure will happen "in some future version".)


Actually I find this a very practical behavior. It allows me to have NULLs in columns participating in UNIQUE indexes. At least with indexes, it's very intuitive and useful.


My preference would be a flag that you can set per connection.

NULL = Undefined or No Data. Whereas a blank field can, in and of itself, be data. It may indicated something is intentionally left blank.

But for those times where you want to consider them the same, it would be nice to have a setting.

(Note that I admit the possibility that this may exist already, like most my great ideas.)


In MSSQL you have SET ANSI_NULLS OFF to do this.


NULL is similar to floating point NaN (aka not a number) which also has the same comparison operation NaN != NaN.


Sounds like a disaster when trying to correlate Ids/keys between tables.


Letting null equal null might work okay for a WHERE clause, but it turn JOINing into a terrible muddle. And I don't want null to behave different ways in different clauses.


Already behaves differently in GROUP BY, UNION, DISTINCT and PARTITION.


where coalesce(a, '') = coalesce(b, '')


Normally can't use an index as every value has to be coalesced before comparison.


You can index the results of functions (or other expressions) in at least some databases. PostgreSQL can do this, for example.

  CREATE INDEX idx ON tbl (coalesce(col, ''))
Combining this with UNIQUE can make for some neat tricks.


I like nulls too and think they make perfect sense, especially with numeric fields.

Suppose we have an "age" field, but don't actually know the age of the person, null makes perfect sense.

Otherwise we'd have do do something like using a "magic number" like 0, -1, or a separate field altogether to indicate an unknown value.

Granted, they do need some handling in queries.


except that age = 0 is valid if your database has toddlers. Hell, even -1 makes sense if you're running a query to get the age of someone at a specific time. this is why you should try to use NULLs whenever possible


I also am a fan of nulls. But I also make very sure that I have defaults set up on every column for which a null value would make no sense. Which means pretty much every column that isn't an optional foreign key.

I'm writing in Go, so my structs all have empty values unless specifically initialised. This does sometimes mean that I get null uuid's (0000-000000-00000-0000) inserted into tables, which Postgres doesn't understand as null and cheerfully returns as a valid uuid. This has been my only real pain with using nulls.

I've contemplated modding the database driver to interpret nil-value uuid's as null, but that seems a little drastic. Anyone got any better ideas?


Not sure how this integrates with Go's database drivers but from the Postgres side:

1. NULLIF(the_uuid, '0000-000000-00000-0000')

https://www.postgresql.org/docs/current/functions-conditiona...

2. on insert/update rules to rewrite the uuid (probably using NULLIF)

https://www.postgresql.org/docs/current/sql-createrule.html


My typical pattern is to use pointer types as a pseudo-optional type. There is the additional cost of dealing with a pointer on the Go side though, which can get cumbersome at times.

`database/sql` does have an interface which lets you define your own marshaling code though. Using it is very simple and would let you marshal an all-zero UUID into NULL easily enough.


You can use a type that marshals the Go zero value to database null and vice versa. https://godoc.org/github.com/jackc/pgtype/zeronull provides a UUID type with that behavior.


pgx allows to pass pointers to indicate which values are null. That works with any arbitrary types and prevents from accidentially storing an empty values as null.


I've recently gone and removed the "nullable" attribute from a bunch of SQL columns that previously had them (and arguably should have them), and the result has actually been rather pleasant.

One interesting problem that arises when you use nulls is that it can be difficult to ensure people actually use them when it's appropriate to do so. Case in point I have a field that is essentially an optional positive integer, so obviously I made it a nullable unsigned int. A few years later there's a pretty even spread of nulls and 0s in there to indicate the same thing -- to a lot of consumers, they behave the same because their business logic basically says "if (foo->field) do stuff" which works either way. In the end I changed it to a non-nullable field using 0 as the null stand-in, which is semantically worse, but ended up making interesting searches over this data set a lot easier.

On the one hand, perhaps the more correct answer would have been to yell at people putting 0s in when they should have been putting nulls in. On the other hand, we put constraints on fields for a reason...


AFAIK, this Oracle "feature" is only true for columns marked as "not nullable". So if you attempt to insert an empty sting ("") into a not-nullable column, it will fail.

All other relational databases differentiate between empty strings and NULL.


Nah, it's true for all columns:

  SQL*Plus: Release 11.2.0.4.0 Production on Mon Feb 3 15:56:30 2020
  
  Copyright (c) 1982, 2013, Oracle.  All rights reserved.
  
  
  Connected to:
  Oracle Database 11g Express Edition Release 11.2.0.2.0 - 64bit Production
  
  SQL> create table foo(fook varchar2(10) not null);
  
  Table created.
  
  SQL> insert into foo values ('');
  insert into foo values ('')
                          *
  ERROR at line 1:
  ORA-01400: cannot insert NULL into ("SPORTSBOOK_DOCK"."FOO"."FOOK")
  
  
  SQL> create table bar(bark varchar2(10));
  
  Table created.
  
  SQL> insert into bar values ('');
  
  1 row created.
  
  SQL> insert into bar values ('a');
  
  1 row created.
  
  SQL> insert into bar values (null);
  
  1 row created.
  
  SQL> select * from bar;
  
  BARK
  ----------
  
  a
  
  
  SQL> select * from bar where bark = '';
  
  no rows selected


It is also true for nullable columns.


> if you attempt to insert an empty sting ("") into a not-nullable column, it will fail

Yes, because it will convert the empty string into a NULL, and fail to insert that NULL into a non-NULLable column.


You are not the only one. I get that there are some exotic and theoretical corner cases where nulls actually are problematic, but for the vast, vast majority of practical cases, nulls and three state logic are very useful. In my humble opinion.


>For example, Oracle database won’t allow you to have an empty string.

It's also true for AWS' DynamoDB offering.


>I might be the only person who likes SQL nulls From my understanding, null values are bad because its a sign that the database design is flawed (See database normalizations) Perhaps you have a more practical experience?


That's not right. "Null" means that the value is not known. Suppose you have a table for employees, and you want to record the last time they were paid. What do you put in that column for people who just started this morning? The alternatives are to use null, indicating that they haven't been, or to formulate a codebase-wide sentinel value like "0000-01-01" and then accounting for that in every single database operation everywhere.

Further suppose that you have an external function in your codebase to estimate how many paychecks you've paid to someone, but the author doesn't know about any "0000-01-01" conventions your office uses. Without that, you'd see that Joe New Guy has worked here about 2,020 years, so we've probably issued him about 48,000 checks. If only you'd used null, then that function would have calculated "today() - null", which in any sane language would raise a type exception and alert you to the problem.

Nulls are beautiful. They have meaning. Lots of people misuse them, but that doesn't mean they're not valid and useful.


> The alternatives are to use null, indicating that they haven't been, or to formulate a codebase-wide sentinel value like "0000-01-01" and then accounting for that in every single database operation everywhere.

Arguably, you might consider that it is a bad design. Perhaps there should be a paycheck table that you can sort by date, instead of a field in the employee table that is updated.

(Of course this is hypothetical, so it is hard to argue use cases)

But a paycheck table does help with the second example because now you don’t need to estimate number of paychecks either. Which will help with unpaid leave or sabbatical situations to give you an accurate number.

But this is what they are getting at with bad design. (Potentially) As far as normalization goes.


Oh, sure. Your proposal is a better plan for this specific situation.

I certainly won't argue that people come up with some bad designs! Heaven knows I've dealt with some of them. I just wanted to point out that there are situations where null is perfectly reasonable, even though there are plenty where it surely isn't.


But data protection might not allow you to keep this for longer than legally mandated, i.e., ~10 years. How do you handle this for long-time employees?


this is 100% how I feel. It's mindboggling that some people here think NULL is "wrong" or should be avoided. Nulls are, as you said, beautiful.


A quick question I use to demonstrate the usefulness of NULL is "What color is the elephant on my desk?"

That question doesn't have an answer because there is no elephant on my desk. It can't be represented by any color, the answer needs to indicate that there is no value.


The result may be null when asking the question, but the tables representing this reality do not need a null column.

(Specifically, not returning any rows when queried, which is different from null)


Often in analytics you will create datasets which are an amalgamation of several upstream sources. For example, an Endangered Species database. You might have "wing length" and "number of eyes" as fields, and both of these need to be NULL for something like a sea urchin.

There's really no justification to avoid nulls as a principle.


Okay, but what if I `SELECT wood, lampshade, elephant_color FROM desks WHERE location = "office"`? How does it represent my mahogany office desk with a beach theme lampshade and no elephant on it, vs my teak bedroom writing desk with a pink elephant and white cream lampshade?


By normalizing into multiple tables.

It is hard to model without requirements but you probably need a table for type of desks, another for desk instances with location and another one to many linking table for items/colors.

Why? I like two lamps and no elephants. By normalizing, this new requirement is simple to represent compared to jamming more fields into the desk table.

Or if another person prefers penguins you don’t end up with penguin_color columns.

It is possible the one table design works, but the introduction of nulls is a smell that indicates maybe the design is a problem.


> That's not right. "Null" means that the value is not known.

According to my old SQL Server text book (written by the SQL Server team) NULL should be avoided because it as so many different interpretations (I think forty was mentioned).


1. NOT NULL is very frequently used in schemas, because NULL is often an undesirable value (e.g. for a mandatory field). That doesn't make NULLs bad, and NULLs are still frequently used when you have optional fields or fields where NULL has some other special meaning.

2. NULLs are used by SQL functions and operators as an "unknown" value. So, for example, "NULL AND TRUE" is NULL, because we could substitute NULL with TRUE or with FALSE to get different results, but "NULL AND FALSE" is FALSE, because no matter what we substitute NULL with, the result will always be FALSE.

3. Clearly all these valid uses of NULL do not indicate "flaws" in the database design.

4. Database normalization isn't always a good thing, and beyond a certain level it's almost always a bad thing, so using normalization methods as a standard for whether something is "flawed" is probablyn ot the best idea.

5. No database normalization method, as far as I know, actually tries to eliminate NULLs, so I don't know what "(See database normalizations)" refers to. Can you clarify?


Re: NOT NULL is very frequently used in schemas, because NULL is often an undesirable value

Talking strings, usually if you don't want a null string, you also don't want blanks (white space) either. I'd like to see auto-trim in the standard, and also a minimum length specifier. We then wouldn't need to deal with nulls. A single min-length-when-trimmed-and-denulled integer value would replace a lot of repetitious hubbub in typical CRUD apps. D.R.Y. it! You'd have one attribute "cell" that would replace the equivalent of:

     IF length(trim(denull(inputValue,''))) 
        < this_fields_min_length THEN
         raise_data_too_short_error(...);
That's the way you'd want it done in the vast vast majority of CRUD systems (if not doing weird things).


The fact that a database is not fully normalized is not a sign that it's flawed. In fact, there are cases where some tables being fully denormalized makes sense (although that's less common in my experience).


Null values and inequality are extremely counterintuitive (in postgres at least). If you run the query:

  SELECT * FROM my_table WHERE my_column != 5
You would expect it to return rows that have a null value for my_column, since null is not 5. However that is not the case.


NULL in SQL is often interpreted in many different ways. The most helpful I’ve found is to think of it as unknown. Postgres has the IS DISTINCT FROM operator to capture what you’ve intended above:

    ... WHERE my_column IS DISTINCT FROM 5


I wasn't aware, thanks for the tip. Is there any equivalent for sets of values? For example:

  SELECT * FROM my_table WHERE my_column NOT IN (5, 6)


I think what you are looking for is a compound condition:

     SELECT * FROM my_table WHERE my_column != 5 OR my_column IS NULL;
This is because what you are selecting for is two conditions: when the value is != 5 and when the value is NULL so the result of != 5 is unknown.

FWIW I agree with you that NULL's are counter-intuitive. While I am more or less aware of all the various ways to account for them, I still gravitate towards SQL schemas without NULL's since I prefer the intuitiveness of 2VL when writing or reading SQL.


Something along the lines of

  COALESCE(my_column, -1)
might work for you in this case. See [1] for documentation on it in Postgres (it is in ANSI SQL though).

[1]: https://www.postgresql.org/docs/current/functions-conditiona...


...and is not null


That will have no effect on the query.


fine, here is the full query since extrapolating from incomplete data is hard for you:

SELECT * FROM my_table WHERE my_column NOT IN (5, 6) AND my_column IS NOT NULL

happy now?


These two queries will always return the same results:

  SELECT * FROM my_table WHERE my_column NOT IN (5, 6)

  SELECT * FROM my_table WHERE my_column NOT IN (5, 6) AND my_column IS NOT NULL
Because "my_column NOT IN (5, 6)" will exclude NULL values.


Right, my bad. Upper parent was about using IN and having return everything "but" 5 and 6. So here the select that will do that:

select mumu from kaka where mumu not in (5, 6) or mumu is null.

That will return everything, null included, except for 5 and 6.

But wait, there is more. If, for example, performance is the main issue here above query is quite slow. Even on an indexed table on column mumu, it will still do a full scan of the table before returning. How to improve performance in this case? Well, you use LEFT join on itself. Implementation is left as exercise for reader :D.


does postgres really not index nulls in a useful way? mysql does, though it may only work efficiently on a single val-or-null comparison at a time.


nobody does. MySQL, Oracle, MSSQL, you name it. All sux. That's why I prefer to always declare NOT NULL and have a DEFAULT value when I create tables. Treat the default value as NULL and you'll increase performance a lot.


PostgreSQL will happily use an index when looking up nulls. Given an index over an integer column, a query for a constant non-null value appears like:

                               QUERY PLAN                              
  ---------------------------------------------------------------------
   Index Scan using foo_b_idx on foo  (cost=0.29..8.30 rows=1 width=8)
     Index Cond: (b = 333)
The exact same query plan is generated for a query looking for nulls:

                               QUERY PLAN                              
  ---------------------------------------------------------------------
   Index Scan using foo_b_idx on foo  (cost=0.29..8.30 rows=1 width=8)
     Index Cond: (b IS NULL)
(That is, the condition on the scan is the only thing that differs.)

(I would heavily suspect that both MSSQL and MySQL have similar behavior here; this is an easy optimization for a query planner.)


MySQL does too, yes, the comment is clearly incorrect: https://dev.mysql.com/doc/refman/8.0/en/is-null-optimization...

And it has done so for many versions, 8 is just the default redirect I got. The very first sentence on that page is:

>MySQL can perform the same optimization on col_name IS NULL that it can use for col_name = constant_value.

It works effectively everywhere, with the restriction that it can only do one null-optimized comparison per index (e.g. `(a == 1 or a is null) AND (b == 1 or b is null)` will only optimize one piece). Which is a potentially-significant restriction that makes me curious about the internals... but it does exist, and typical for MySQL is clearly documented.


So I guess ,,OR column IS NULL'' would have been the correct answer?


Pretty sure most DBs do not treat that any differently than

`select * from table where column not in (5,6)`

Can you name one that does?


I think the issue here is that SQL should have more "NULL variants" to express why there is no concrete value.

A NULL value technically means it's unknown. An unknown value might be 5, hence why it's not in the result set. Some abuse NULL to mean "value doesn't exist". But a value that doesn't exist can't be 3, or 42, or any other value that's different from 5, so in that regard shouldn't be part of the result set either.

Others again abuse NULL to mean "doesn't apply". And in that case I think it makes sense to include the row in the result set. For example, if I write a query to get all people who's middle name is not "William", I'd most likely want people without middle names included.

Maybe we should have introduced NEX (non-existing) and NAP (non-applicable) as possible values in addition to NULL?


> Maybe we should have introduced NEX (non-existing) and NAP (non-applicable) as possible values in addition to NULL?

Codd (the inventor of relational algebra) actually suggested this. I think the primary source is a book that may not be on the web. There's some discussion here (mostly saying why they think it didn't happen and wouldn't work out): https://arxiv.org/html/1606.00740v1/


Re: I think the issue here is that SQL should have more "NULL variants" to express why there is no concrete value.

No, that would muddy things in my opinion, like it did to JavaScript. Instead, have more operations/functions for dealing with them in a more "normal" way, so that we can say "WHERE x <> 5" and get results one expects. I'm not sure the syntax, and my drafts would take a lot of time to explain. To give a taste, maybe have something like "WHERE ~x <> 5" in which the tilde converts x's value to the type's default, such as a blank in the case of strings.

If the different reasons for "emptiness" matter, then usually it suggests the need for a "status" column of some kind so that queries can be done on the reasons. I'd need to study domain specifics to recommend something specific.


But that would mean you need to be aware that the column can have this property, no?

Continuing with my middle name example. Say I and everyone I knew had middle names, so I write a database including a required middle name column. Later I discover not everyone has middle names, and so I need to relax the restriction.

In your case, I would change the column to accept NULLs, and I'd have to remember to go over my query to add the ~ operator.

In my case, I'd change the column to accept NAPs (or whatever) and since a NAP value would behave differently to a NULL for <> (and other operators), I wouldn't need to change my query.


Re: I'd have to remember to go over my query to add the ~ operator.

I'd almost always use it no matter what. In fact if starting SQL over via a DeLorean, I'd reverse it to require "~" to make it null-sensitive.

It's analogous to case-sensitive comparing. The vast majority of the time you don't want case-sensitive comparisons such that case-insensitive should be the DEFAULT, and you only add extra specifiers/functions if and when you need to do case-based comparing.

If not, then you either end up with SQL full of "toUpperCase(x)", or problems keep popping up if somebody typed stuff wrong or the UI coders messed up.

Similarly, if your SQL is full of NVL() or DENULL() or whatnot, it means the language was poorly factored for your domain, or even most domains. It flunked D.R.Y. in design, such as comparisons having the wrong default behavior.


Yeah that'd be interesting.


Agreed. I have, off and on, labored on a still-incomplete and largely incoherent essay on this topic. NULL is overloaded to the point of some confusion.


NULL in SQL means "unknown value". This is different from most programming languages where null is a special value which typically indicate "nothing".

If a value is unknown, you don't know if it is different from 5, so it would be incorrect to return in the query.


Same in SQL Server.. this is documented behavior

Also a null is no equal to anything.. not even another null

This will print false in SQL Server

if null = null print 'true' else print 'false'


> Also a null is no equal to anything.

Wrong. It is equal to UNKNOWN:

https://docs.microsoft.com/en-us/sql/t-sql/queries/is-null-t...


I don't think that's what it says.

If I'm reading it right, (null = null) is unknown, which is falsy (except with ansi_nulls off, then it'll be true). (null is null) is true.

I don't think you can test null = (null = null), i.e. null = unknown. Let me know if that's possible somehow, I can't get it working.


Sorry brainfart, was responding to the second part, ie comparison to another null.


so?.. still not equal to anything, two unknowns are not equal

if null = null print 'true' else print 'false'


It's equal to something: the value UNKNOWN. This influences for example how the comparison result is used in compound expressions:

https://docs.microsoft.com/en-us/sql/t-sql/language-elements...


where coalesce(null_column,'') = ''

shortcuts "OR is null", works within functions.


I wouldn't expect null row values since you're doing a numeric comparison for my_column, and null isn't a number.


the idea is you don’t know if the null != 5 because null isn’t a value it just marks the absence of a value


Not sure what big deal is. You learn somewhere along the way that you check for null values with “is” vs “=“. Done, write it on a sticky note if you need, and move on.

“Why isn’t it consistent??” - well a lot of systems have a lot of bat shit crazy inconsistencies, some times there for good reason. You learn to keep them straight and get your shit done.

If you want to learn the “why” every time you encounter a system design quirk, be my guest but you may be going down a time intensive rabbit hole with little pay off for yourself.


Sometimes it’s easier to remember the why than the what. Then it does make sense to learn about it!


Null values are so important in representing data. But they cause so much confusion in a) unexpected behaviors in queries and b) inconsistent handling across various engines... I sometimes wish <whisper> that they hadn't been included in the spec at all </whisper>. But then I come to my senses again, and go fix yet another bugged query for an analyst who didn't account for nulls in the data.


Does it make sense to coalesce them away in a view? I thought most analysts are given star schemas implemented by views or ETL'd data anyway.


Depends on the level of sophistication of the analyst, and if nulls have a meaning or value to the result.

Also, at a certain point, knowing that nulls are present gives you yet another measure of dq: not knowing if they are present and hidden vs. visible and countable can be the difference between a wrong answer vs. just an uncertain one.


A pet peeve of mine is concatenating null strings. It's like a poison pill that nulls the whole result. 99.99% of the time that's NOT what one wants domain-wise. Maybe the standard should make another concatenation operator that treats null strings as zero length strings. Sure, one can de-null each string in the expression, but that's ugly anti-DRY code. Please fix it, I haaate that.


Agreed that it makes the pipe concat operator a lot less useful. Now PostgreSQL and MySQL both have CONCAT_WS which does replace NULLs with empty strings. It's also nice when you do need a common separator between all elements.


Now here's a blast from the past! 20 years ago this was common knowledge. Now it's making headline news on HN. SQL is back with the vengeance!


We recently moved from Oracle to Postgres. We had thousands of queries written based on the way Oracle handles NULLs and empty strings. It took us the better part of a year to rewrite all of them to the Postgres way. I am so glad to be off of Oracle.


I miss my past Oracle career, I've diagnosed this "= NULL" rather than "IS NULL" in so many broken queries, slow queries because of the way Oracle indexing handles NULL.

There is a lot of discussion in this thread about whether this implementation of null checking in Oracle is appropriate, analysing it, but the current implementation is just fine, it has been tested by time.

The internet does tend to rehash the same arguments over and over!!! The internet forgets. I remember these arguments 20 years ago.


[Ignore this comment. It was posted by mistake. I'm only leaving it here for the historical record.]

> the current implementation is just fine, it has been tested by time.

No, it isn't "just fine". It is broken. Just because something has been broken for a very long time and has spawned an entire industry devoted to dealing with the fact that it is broken does not change the fact that it is broken.


Do you have substantial experience with Oracle? or are you just blindly going on what everyone else says?

There is no mention of outer joins in this thread, no mentions of the ability to minus results of one query from another which are basic constructs which handle many of the issues that are discussed here. It says that the people here are inexperienced with Oracle. Everyone here trying to resolve issues using inner joins. Inexperience.

If people here had experience, not only would these topics have been discussed, but the real issues with NULL would have been discussed, one of which I mention in my previous post.


Sorry, I made a mistake: I thought I was responding to a different comment. (I'm currently on a very slow internet connection.) The "broken" thing I meant to refer to was Oracle's conflation of null and the empty string. But I think I hit the wrong "reply" link. Sorry about that.


Another one is MIN and MAX ignore NULL values, which make for some interesting rollback scenarios.

I also swear I have seen a gotcha involving UPDATE WHERE IN and not throwing an error where it should have, which is why I always quadruple check my update statements, but I wasn't able to reproduce it and couldn't find any information online. I haven't seen the issue in so long I forgot what it was, but it would update all rows in your table even if your WHERE clause was proper.


Also OR/AND can return non-null results even if NULL is one side of the operator:

    (NULL AND 0) gives 0
    (0 AND NULL) gives 0
    (NULL AND 1) gives NULL
    (1 AND NULL) gives NULL
    (NULL AND NULL) gives NULL
    (NULL OR 0) gives NULL
    (0 OR NULL) gives NULL
    (NULL OR 1) gives 1
    (1 OR NULL) gives 1
    (NULL OR NULL) gives NULL


What is a scenario where min or max should consider NULL values?


Theoretically NULL means "unknown" value. As it happens, most business applications do not have any requirement to deal with "unknown" values. These applications are only interested in acting on requests where all the required data are provided by the person responsible for entering the data. For example, when I transfer money from one bank account to another, the amount of the transfer can't be "unknown", the sending account can't be "unknown", and the receiving account can't be "unknown".

These same applications do have requirements to deal with empty values. Sometimes an empty value means "I haven't yet entered this value in the to the UI". But in that case the UI won't let you submit the form until you have supplied a valid value.

In other cases an empty value is a valid value. For example, "who is your spouse?" and the answer is "I'm not married".

Sometimes NULL represents "irrelevant", like for "who is your spouse?", where some of the records in the table represent people who can have spouses, and some of the records represent other person-like entities that aren't actually people and therefore they can't have spouses.

Given that NULL is _not_ being used to represent "unknown" values, and there is a requirement to represent empty values, and you don't want to have a whole extra column just to represent "emptiness", the most straightforward way to implement empty values is to use NULL. So that is what happens.

And you have to remember to use "is" instead of "=" when you want to test your empty NULL values for equality with other empty NULL values - because your SQL database is pretending that NULL really means "unknown", and it doesn't want to say that one unknown value is equal to another unknown value, because that would be theoretically incorrect.


there is something "missing". The SQL spec specifies `null = null` to be "unknown", where i sometimes expect "true". For MSSQL this can be configured using `SET ANSI_NULLS { ON | OFF }`. AFAIK MySQL can't be configured. Don't know about Postgres.


The standard makes sense if you go back to the theoretical basis of SQL. It seems somewhat counter-intuitive only when you think of NULL as a value you set in a cell.

When it's the result of a relational operation (such as a LEFT JOIN) however, the default makes sense while considering NULLs as equal to each other is typically not useful.


For what its worth, don't do this - pretty much all db code and practitioners expect three valued logic, not two.


until you have to work with a database created by an insane guy. Never needed it outside of that one project. (edit: small hint: composite primary key where parts can be null)


I have nothing to say to this but simply "I am so sorry."


For postgres you can just use the separate operator IS NOT DISTINCT FROM to explicitly request this behaviour. In SQLite I think it's just IS. I assume most SQL databases have something similar, and that's a far better solution than applying a global config.


NULLs in subselects do bite me with distressing regularity: Writing

  SELECT ... FROM ... WHERE blah NOT IN (SELECT foo FROM bar);
getting no hits until I slap my forehead and add WHERE foo IS NOT NULL to the subselect.


DynamoDB, which is NoSql, also doesn't accept empty strings. But at least, Oracle automatically converts the empty string into NULL, comparing with DynamoDB which would actually fail the query.


With some columnar databases NULLs are 'free' because they are a default, absent state or compressed away. Can be another reason to prefer them with very large datasets.


In MySQL, NULL values are useful when using CONCAT_WS (concatenation with separator) or GROUP_CONCAT because NULL values will be ignored - so you don’t get e.g., “one,,two”.


I was dealing with NULLs whole day on MySQL workbench. It wasn't considering int as NULL value. Needed to make all empty cells 0 to be able to import data properly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: