ZSON, PostgreSQL extension for compressing JSONB

brianolson · on Oct 4, 2016

Another way to make 'JSON' smaller is to instead use 'CBOR', schema compatible 'concise binary object representation'. (See IETF RFC 7049 or http://cbor.io/ ) CBOR encodes and decodes faster too. Or use the 'snappy' compressor.

afiskon · on Oct 4, 2016

I'm afraid Snappy will not help, at least a lot. PostgreSQL already has a build-in compression (PGLZ). I tested various different algorithms before this shared-dictionary-idea - lz4, bzip and others. Some compress a bit better, other a bit faster, but in general result is almost the same.

tveita · on Oct 4, 2016

Did you test e.g. LZ4 with a prebuilt dictionary? With a good way to find substrings for the dictionary it might generalize well to other kinds of data.

afiskon · on Oct 5, 2016

Frankly I don't remember, it was more then half a year ago.

LZ4 is a LZ77 family algorithm which means its dictionary is a "shifting window". I don't believe such kind of dictionary will fit in this case.

jkot · on Oct 4, 2016

It would be nice to automate dictionary training. Make it part of vacuum.

afiskon · on Oct 4, 2016

Thank you for an interesting idea! I can't promise I will implement it myself any time soon, however. You know the saying - pull requests are welcome :)

exo762 · on Oct 4, 2016

It would be nice to see a comparison to a general compression algorithm - e.g. deflate.

vog · on Oct 4, 2016

There is a comparison table at the end of the project's README (https://github.com/afiskon/zson). However, the table columns are not explained very well. It is not 100% clear to me whether "before" means "uncompressed" or "PGLZ compressed".

  Compression ratio could be different depending on documents,
  database schema, number of rows, etc. But in general ZSON
  compression is much better than build-in PostgreSQL
  compression (PGLZ):

     before   |   after    |      ratio       
  ------------+------------+------------------
   3961880576 | 1638834176 | 0.41365057440843
  (1 row)
  
     before   |   after    |       ratio       
  ------------+------------+-------------------
   8058904576 | 4916436992 | 0.610062688500061
  (1 row)
  
     before    |   after    |       ratio       
  -------------+------------+-------------------
   14204420096 | 9832841216 | 0.692238130775149

afiskon · on Oct 4, 2016

Frankly I don't remember all details since I did this benchmark in February. IIRC its "ZSON + PGLZ" vs "JSONB + PGLZ".

Please note that everything depends very much on your data. PostgreSQL is smart about what to compress and what not. In general it could be all combinations of "ZSON +/- PGLZ" vs "JSONB +/- PGLZ".

Don't believe any benchmark I or anyone else did. Re-check everything on your data, configuration, hardware, workload, etc.

X86BSD · on Oct 4, 2016

How does this work while running zfs with LZ4?

Has this even been tested on zfs?

afiskon · on Oct 5, 2016

No, I didn't test it on ZFS. Sorry for asking, but is there any reason to run DBMS (which I would like to remind has a build-in compression) on ZFS which itself is a small DBMS? Sounds like too much unnecessary overhead to me.

X86BSD · on Oct 7, 2016

Several reasons. Snapshots, Cloning, Replication. Customized record sizes per database. There are many good reasons to run PG on ZFS and almost everyone does. Except the MySQL/Linux crowd. The rest of have moved on from the pain of the well worn FS's of the distant past.

afiskon · on Oct 10, 2016

Very interesting. For some reason I always thought that no one runs PostgreSQL on ZFS. If it's not a secret, what company uses PostgreSQL on FreeBSD like this?