Hacker News new | past | comments | ask | show | jobs | submit login
Unstructured Datatypes in Postgres – Hstore vs. JSON vs. JSONB (citusdata.com)
157 points by craigkerstiens on July 20, 2016 | hide | past | favorite | 43 comments



I really wish someone would write an article about when to use regular columns instead of JSONB and whether you should split your data into multiple JSONB columns, etc.

So far the only thing I've figured out is that updating pieces of a JSONB structure seems like a bad idea at the moment since it will not always succeed (notably you can't add entries to an array once it has 999 elements already).

In my problem I've given each analysis tool I run and API I query it's own JSONB column so that they only ever need to be written once, or if they are changed, they would be overwritten entirely and keep a few normal columns that contain some metadata, but I'm not sure if this is a good approach.


Your idea of a mixed system that has the best of both approaches (some stored state which may be hard to compute and/or may be complicated for you to figure out how to store) and data that is often updated not inside the JSONB column but related to it either via a foreign key or another column in your table is perfectly fine.

The thing you want to avoid in that type of system is having data in both locations that you need to update to keep in sync, this is a bad sign and you probably need to move some data out of the JSONB column.

Also, I would almost never choose anything but normal row data unless I absolutely had to and produce some json at the end of the pipeline. However, your situation has a special type of work you can save and it would be silly to redo it all over again on each request.


I'm actually halfway through writing this post! I was hoping to finish it today. :)


Having to work with unstructured datatypes in Postgres and other RDBMS products was quite hard a few years back, now with hstore and JSONB in Postgres this has gotten much better but it still feels a bit unnatural. When we started the Raijin database project (http://raijindb.com) one of the goals was to make it possible to treat these extra attributes (that you would shove into JSONB with postgres) as normal "columns" so you can do this:

   CREATE TABLE tbl(eventtime datetime);
   INSERT INTO tbl {"eventtime":"2016-07-20 10:37:12","foo":"bar","whatever":"xx"};
   INSERT INTO tbl {"eventtime":"2016-07-20 10:38:22","foo":"bar2","intfield":42};
   SELECT foo, intfield FROM tbl;
   --
   {"foo":"bar","intfield":null}
   {"foo":"bar2","intfield":42}
It won't complain about the missing 'columns'.

Anyway, glad to see that Postgres is steadily making progress in this area.


Why does PG create fields with null values? Intfield was undefined when it was inserted for the `"foo": "bar"` row. Does this mean that if you insert 10K JSON rows, and each one has a unique field / key, then a SELECT * will return a 10K rows * 10K fields, most of which are null?

Having the DB automatically insert null fields also makes it harder to use Javascript's `Object.assign` to populate undefined fields with default values:

Object.assign({a: "default-a", b: "default-b"}, {a: "db-a"})

{a: "db-a", b: "default-b"}

VS

Object.assign({a: "default-a", b: "default-b"}, {a: "db-a", b: null})

{a: "db-a", b: null}

null and undefined are two different concepts in JS. It is like the difference between "there is nothing here" and "I don't know if there is anything here".


Note that this is not PG, Raijin is a new implementation and is not built on Postgres. Otherwise the question is valid. Raijin does not store NULL values though. In the above example a specific field (column) was selected and that's why it appears as null. Perhaps the output would be more appropriate this way then (which is a modification for us to consider):

  {"foo":"bar"}
  {"foo":"bar2","intfield":42}
Note that when you select the full record with the star it does not return null values:

  select * from tbl;
  --
  {"_id":1,"eventtime":"2016-07-20 10:37:12","foo":"bar","whatever":"xx"}
  {"_id":2,"eventtime":"2016-07-20 10:38:22","foo":"bar2","intfield":42}
The undef/exists thing can get a bit confusing when you are mapping this into SQL.


We store key/value pairs in Hstore, since JSONB wasn't available when we started doing this.

We've since updated to PostgreSQL 9.5. Does anyone know if we would see a performance advantage switching this to JSONB (either in selects or updates).

My sense is no, but I haven't been able to find any good answers one way or the other. If it's yes, then HStore doesn't really serve any purpose at all anymore.


> Does anyone know if we would see a performance advantage switching this to JSONB (either in selects or updates).

I don't actually know anything from practical usage, but I think JSONB makes some kinds of membership tests more efficient (at the price of more expensive inserts). See 8.14.3. of https://www.postgresql.org/docs/9.4/static/datatype-json.htm...


I am in the inverse position: i used JSONB when 9.4 came out. I cant speak for comparative performance, but our 8gb db server has no performance issues at all querying an unindexed jsonb column across many millions of rows.

The one "gotcha" with the column type right now is comparing hash values to certain data types can be a bit confusing at times, but simple string comparisons in arrays and hashes is relatively straight forward.


We've used both at Heap. I wouldn't expect much of a performance change from switching in either direction. JSONB has the advantage of being in a data format that lots of tools speak, and has generally been less annoying to use. But if your app is pretty stable and all of your code already speaks hstore, probably not worth it.


I believe they are stored more efficiently, so if your data set plus indices don't fit into RAM, you may see some gains.


Summary: use jsonb unless you need non-json. Then use Hstore.


Or if you need GiST indexing, can't wait for someone to write the bindings for JSONB, and have non-nested data, in which case hstore is probably the right choice.


I’ve wondered before: why does hstore still exist? I get that JSON adds something over JSONB, but as far as I can tell (without ever having used either of them) hstore functionality is strictly inferior to JSONB; if that is the case, why would hstore not be deprecated and marked for eventual removal?


Today hstore does still have one thing over JSONB which is GiST indexing. Yes, normally GIN is still preferred but there are cases with you might want to use GiST. I do expect that JSONB will get GiST index support as well, at that time there really becomes no reason at all to use hstore over JSONB.


hstore has some stuff that's missing from JSON.

My favorite hstore feature is that you can produce a diff simply by using the minus operator. For auditing a table, in an ON UPDATE trigger you can simply do

previous_values = hstore(OLD) - hstore(NEW)

previous_values will only contain columns that have changed, with their old values.

You can then restore the row to its previous state by doing PREVIOUS_ROW = CURRENT_ROW #= previous_values.


Any real reason why JSONB shouldn’t support those operators? (Does it do something else with them?)


hstore are very different from JSON in the sense that they are a flat key/value structure, which is exactly what a record/row in the database is. This is why you have operators to convert and combine hstore values and actual records.

The ability to diff might come at some point, but interest for it is probably low because of the purpose of JSONB. Diffing works best for similar structure, and if you're storing things with similar structure in postgres, I doubt you're going to use JSONB. Diffing deep trees is also very costly.


Examples would make this a lot more interesting. Don't just say "you can query for events with email foo@bar.com", show me the operators!


Good feedback, in general we thought there are enough articles out there about the various operators that you can use with JSONB such as https://hasura.io/blog/the-unofficial-guide-to-jsonb-operato.... We may take another pass through though to see if we can add some more complex examples to highlight whats possible.


Would it be possible to store (and query) date/time values as well?

I have a use case where either HStore or JSONB might be neat (The most basic requirement is K/V, so HStore might be good enough. I could use nested structures as well though). Unfortunately some of the values I need to store are datetime values and I'd need to query based on those:

"Give me all documents where the datetime is from this year"

The article claims HStore is basically just strings (might work for bool as well, but not numbers/datetimes). JSON can express strings, numbers, booleans - but not datetime values.

Has anyone done something similar? Is that even possible?


why not make the date values a real column? does it need to be inside the JSON?


The use case I'm looking at is a legacy application that allows you to define various 'dynamic' fields. Think document, extracted data and you can expose fieldA, fieldB and fieldC, while another user wants D, E and F.

Right now that application stores that in a rather messy way. Exposing this as a table field would mean that configuration changes (New field/a different field) causes a structural change (alter table). Or you have to go down the 'Date1, Date2, Date3, Number1, Number2' ... path.

In short: I see no way to expose these pieces of information in fields without ugly hacks, hence my interest in HStore/JSON(B) - but datetime based filters would be quite nice to have..


Re datetime filters:

If the datetimes are ISO formatted strings, would regex matching not work for basic filtering?

    where json_field.datetime similar to '2016-07-.*'
Of course, it'll probably be slow-tastic, and less convenient than native datetime manipulation, but hey, maybe it's enough to get unblocked?


You can use an expression index, which casts to timestamptz, would then be very fast to query and filter on such.


why not do both?


One thing you'll run into is that json doesn't have a datetime type so you're left with using either a number of seconds since 1970 (fairly standard) or a string representation [1].

With either approach you'll have to do some mapping in and out of the system (though date probably serialises to string more or less out of the box for most json libs).

[1] http://stackoverflow.com/questions/10286204/the-right-json-d...


I mentioned the limitation ("no datetime in json") in my post :-)

That said, you and IanCal have a point: I could just drop the idea of DateTimes and use timestamps/integers - converting queries in my application.


Ah, yeah, sorry about that... I don't think I'd had my coffee!


Can you store the times as timestamps? You can query on things being in an integer range then.


Does jsonb work with CitusDB? Like can I have a sharded and indexed cluster of postgres dbs? That'd be really awesome if they work together.


Yep, JSONB works great with Citus. It's a model we've seen a few customers follow and the example towards the end are two common use cases where we see it most commonly used (event streams where data doesn't always exist across all columns, and when schemas may commonly vary a bit but you still need to find a shard key and data model that can be shared across them).


One other (big) difference between json and jsonb is that the former preserves extra white space. The exact text string of the json is stored. If that matters to you, say you're taking a signature of the original value, then you should use non-b json. Otherwise, stick to jsonb.


Probably the biggest gotcha of JSON is that there is no equality operator implemented for it. The reason is the philosophy of postgres to never do anything if you don't do it right, and you can't compare JSON as strings because of key order and whitespace.

In itself it doesn't seem like a big deal, but you'll lose generic things that normally always work. An example would be a trigger that updates an updated_at column, by checking

IF OLD IS DISTINCT FROM NEW

This syntax normally works on everything, but will throw if a column is of type JSON.


An even bigger difference is that with jsonb, strings and numbers are converted to their equivalent PostgreSQL types. That means that strings can't contain null characters or characters that violate the database's encoding rules. Consequentially, not all valid JSON documents can be stored in jsonb, which is usually fine but can sometimes introduce headaches.


Why does whitespace (outside of string values) ever matter in JSON? Most servers will strip as much whitespace as possible to reduce response size. Also stripping before signing is pretty trivial.


just throwing something out, but if you're allowing users to enter in their own json.


Why would you want to maintain extra whitespace (again outside of text strings) after a save? Any sort of JSON editor is going to have some sort of formatting function that could restore it on load.


One exmaple: implement a text editor like GitHub's gist naively.

You can try to be clever and encode the incoming data to reduce whitespace, but the business requirement is whatever user entered is what user expects to get back. Otherwise, your best bet is save the entire text.


But in that case, should you really be storing the JSON as JSON? Isn't it just a string as far as you're concerned? You're not going to be querying the user's JSON, are you?


You are correct and it depends on how you want to save your data (poor or bad!). In the article it does use log as an example. Perhaps that would be a more convincing-ish example.


my admittedly anecdotal experiences working with JSONB in Postgres: coding is generally easy and performance is great, it even plays nice with ORMs like hibernate (with just a bit more work). the downside is the query syntax: for complex queries where you need to search within the JSON, it's hard to write correct SQL. it's a trial-and-error process and the error messages aren't that great. once you get it working you're fine, but it can be hard to get there sometimes.


I had used postgresql on a previous project. Now with my current side project, I developed the initial version using just Redis and some BoltDB.

After figuring out exactly what data I need to store, I decided to move back to Postgresql. I may still use Redis as I like the geo-location features that come right out of the box. Recompiling the extensions to Postgresql as not my favorite thing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: