Redis: new disk storage to replace VM

ShabbyDoo · on Jan 4, 2011

So, as a non-Redis user, am I correct in my understanding that an admin may define a maximum delay before write-behind persistence must be attempted? For applications which can afford to lose a few seconds of data in the case of a failure, this seems like a great way to improve latency.

More generally, why are there no MySQL (or whatever) engines which offer similar capabilities? Wouldn't it be possible for DB clients to "commit" transactions to memory and then have them flushed to disk asynchronously with all other ACI(d) properties maintained? There are many applications which can survive a few seconds of data loss but need transactional properties to avoid data corruption.

justinsb · on Jan 4, 2011

All modern relational databases implement exactly what you've described... transactions are written to a transaction log, which is flushed to disk every few seconds (or whenever you want to guarantee that a txn is durable). Changes to the actual data need not be persisted in a timely manner, because in the event of a crash the data is recovered from the transaction log.

ShabbyDoo · on Jan 5, 2011

"in the event of a crash the data is recovered from the transaction log"

Doesn't this statement imply that a disk hit occurred before a client is told that a transaction committed (vs. being told that a unique key constraint was violated, etc.)? I'm talking about a more extreme form where I don't have to wait multiple milliseconds for a disk platter to spin around before continuing with my processing.

justinsb · on Jan 5, 2011

For full durability, you configure/ask the DB to fsync the transaction log before reporting the transaction committed to the client.

Most people can tolerate a few seconds of data loss, so a sensible config will only fsync every few seconds and will report a transaction committed before it hits the disk. If the DB crashes, you lose those recent transactions in this mode.

All (?) relational databases let you choose which fsync style you want. Most (?) ship with this setting set to the conservative 'fsync on every commit' mode. Once you configure a SQL database with a more relaxed setting you get a database that performs much more similarly to NoSQL. But some people need full durability - or want it for particular transactions. In that mode, you're basically bound by the the number of IOPS your disk can do, but are guaranteed full durability.

sokoloff · on Jan 5, 2011

Also note that you can get the best of both worlds with a battery backed RAM cache contained in a SAN storage backend, such that the storage subsystem can be extremely low latency and yet "guarantee" that what it has accepted will get persisted to a disk for durability. (Predictably, this isn't cheap, but it's very effective.)

Your DB host tells the SAN to write this block, the SAN ingests the write to local RAM and reports "got it" to the DB server in sub-millisecond. The SAN will then dump that data to actual underlying discs over the next (hand-wavy) short timeframe, but from the DB's perspective, it got a durable fsync in under a millisecond.

dorianj · on Jan 5, 2011

On MySQL / InnoDB, this is innodb_flush_log_at_trx_commit and how the buffer log is flushed can have a tremendous impact on the latency of writes.

ShabbyDoo · on Jan 5, 2011

So, no physical disk write need occur before a client can continue with processing? If so, cool.

electrum · on Jan 5, 2011

http://dev.mysql.com/doc/refman/5.1/en/innodb-parameters.htm...

``If the value of innodb_flush_log_at_trx_commit is 0, the log buffer is written out to the log file once per second and the flush to disk operation is performed on the log file, but nothing is done at a transaction commit.''

siculars · on Jan 4, 2011

This is kinda what mongodb[0] does now and is a question of durability[1]. Current rdbms systems that lay claim to consistency must flush to disk to ensure ACID[2] compliance. Redis does not claim such compliance and I am of the opinion that it should not.

[0]http://www.mongodb.org/

[1]http://en.wikipedia.org/wiki/Durability_(database_systems)

[2]http://en.wikipedia.org/wiki/ACID

cmelbye · on Jan 5, 2011

Yeah, basically what you do is configure how often you want Redis to sync to the disk. You define a period of time and a number of records, and if that amount of time passes and that number of records have been changed, it saves to the disk.

This is the default:

    save 900 1
    save 300 10
    save 60 10000

After 60 seconds, if 10,000 records have been changed it will save. After 300 seconds, if 10 records have been changed it will save. And so on.

nas · on Jan 5, 2011

Are people aware of Python object databases like ZODB and Durus? I'm not very familiar with Redis. However, the model used by ZODB and Durus (on disk durable storage, in memory client caches) can be extremely efficient depending on workloads.

snissn · on Jan 4, 2011

sounds like he's trying to implement something similar to a 'dbm' (tokyo/kyoto as current/modern implementations) which are k/v caches that seem to intelligently write to disk.

Presumably the keyset can still be in ram (so that [star]foo[star]bar[star]1 searches on keys can work?), but the dbm model is a fairly efficient key/val implementation, and tokyo/kyoto is fast, and fairly smart about writing to disk, although I haven't explicitly testing their limitations as you approach ram limits in production.

Not sure what tradeoffs are in mind, but atleast a feature/perf comparison to kyoto compared with diskstore as an internal back for redis would be interesting

[1](doesnt seem i can escape an *)

pashields · on Jan 5, 2011

I don't think the keyset will be in ram: "Redis will never use more RAM, even if we have 2 MB of max memory and 1 billion of keys. This works since now we don't need to take keys in memory."

I haven't looked at the current code to see if there is a way to favor keeping the keys in memory, but it would seem that wildcard searches here can/will be disk-bound.

snissn · on Jan 5, 2011

tokyo supports a b-tree index on the keys written to disk which would optimize blah* queries but not [wildcard]foo[wildcard], and then writes become log(N)

IgorPartola · on Jan 5, 2011

I am curious as to what people who use Redis in production think of these types of changes. Is this alarming or hopeful? Seems like a rather large shift in trade-offs and a whole new set of tuning parameters to play with.

antirez · on Jan 5, 2011

This changes only affect a small percentage of users that need to run single Redis instances with datasets that are larger than the computer memory.

Our default back end is to run as an in memory DB, and most of the design and goals are related to this mode. But I think that most of the value of Redis is its data model, and I bet it will survive Redis itself, so the idea is, let's look to alternatives that make this data model working well with data sets bigger than RAM.

Our old solution was VM, but we found it is not ideal, does not work well with the Redis in-memory back end persistence ideas (that are instead working well without VM). What to do then? Keep trying with the wrong solution? :) I guess not, open source also means that if the cure for a disease is not good enough we put things into the trash and try again and again, as the sole goal should be the progress of the technology we are trying to put in the hand of users.

So we have a new model now, and will test how it works in practice. What we said is: for write heavy applications where performances matter, use Redis as an in-memory DB. It works well, it's well tested, and we can count many happy users.

But if the Redis data model solves your problems, and you have a read-heavy application with tons of data, we are going to provide an alternative that could work well.

jashkenas · on Jan 5, 2011

Alarming, I'd imagine:

https://groups.google.com/forum/#!topic/redis-db/ZTSm-1w-6AQ

To quote from the end of the post:

    so, to sum this up -- after a while, you are stuck with an 
    in-memory database that you cannot backup, cannot replicate to 
    a standby machine, and that will eventually consume all memory 
    and crash (if it does not crash earlier).

    conclusion: redis with vm enabled is pretty much unusable, and we
    would really not recommend it to anybody else for production use
    at the moment. (at least not as a database, it might work better
    as a cache.)

madlep · on Jan 5, 2011

Actually, that is talking about the existing virtual-memory (VM) implementation, which swaps data in and out to disk, and doesn't work so great.

The change being talked about here is all about replacing that exact flakey VM with a more solid disk backed approach

jashkenas · on Jan 5, 2011

I'm sorry, but that's what I meant. I'd be more alarmed than enticed to discover that the current implementation of datasets-larger-than-RAM for my chosen database was considered "flakey", and was going to be swapped out for a green-field approach in the next release.

For reference, this is the blog post that introduced the VM idea: http://antirez.com/post/redis-virtual-memory-story.html

antirez · on Jan 5, 2011

> I'm sorry, but that's what I meant. I'd be more alarmed than enticed to discover that the current implementation of datasets-larger-than-RAM for my chosen database was considered "flakey", and was going to be swapped out for a green-field approach in the next release.

As Redis is mainly an in-memory DB, currently larger datasets than RAM were not our first goal, and there was even the idea to drop support at all for this use case. I think that what matters for most users is that the default mode of operations is working great, and that for an alternative mode of operations developers are not dogmatic and don't fear to drop what is not optimal to replace it with something better. In many other contexts this would be regarded as bad marketing and not done at all, but I try to follow a scientific way to make progresses, and I tend to accept that I and the other developers are not perfect and need to make mistakes and improve the design again and again ;)

I like Redis data model and I think this is our biggest value, and we need to find different underlaying implementations for different use cases, and keep trying to provide more speed, better durability, better replication, and so forth, ad libitum.

samstokes · on Jan 5, 2011

If Redis was "my chosen database" then I should have done my homework better.

Redis has always been billed as an in-RAM database, which hard-implies that your dataset must fit in RAM, and thus that Redis is unlikely to be your only database.

VM - i.e. even the possibility of having a dataset larger than RAM - is a recent addition, and one flagged as experimental. It had some issues which meant (for us and for the author of the post you linked to above) Redis was still not feasible for datasets larger than RAM. This new diskstore model, and the associated rethink, is very encouraging news, as it means that scenario might one day be possible after all.

samstokes · on Jan 5, 2011

Based on mailing list traffic and comments from the developers, it seems like most people using Redis in production have datasets that fit in RAM, in which case they will be utterly indifferent to these changes.

dchest · on Jan 4, 2011

See also: http://news.ycombinator.com/item?id=2053594