There is a lot in this article that shows the author has very little RDBMS knowledge. For instance:
"What is slow are the arbitrarily complex index interactions that come with the relational world. "
Index's are arbitrary if arbitrary created. It has no bearing on the relational world, no more than a bad driver makes all cars bad.
"Given the example above I think SQLite could be a great implementation leveraging the power of stored procedures to implement atomic datasets."
1, SQLite doesn't support SPs (You can use C to write extensions however) and 2, I've got over 10 years experience in RDBMS and I've never heard of an atomic dataset. Has anyone else? I struggle to understand how a set can be atomic.
"Using BerkleyDB or Innostore could remove the slowness and pain of interacting with the data through SQL."
This kind of comment needs to be addressed. There is NO pain in using SQL, the pain that is perceived is the lack of understanding and skill to use it on the part of the user. Don't call something painful simply because you haven't taken the time to understand it.
First off, I apologize for my thoughts being all over the place at 3AM in the morning ;)
I've written several transactional systems using Oracle and DB2 in the financial industry in a former life. The trappings of SQL are directly proportional to it's power. As demonstrated in mysql-sr-lib they can be packaged neatly to great effect however. Also I was talking about transactional operations on datasets.
Lastly, on slowness of SQL. I just meant that query plans cost and are not free. There have been several experiments of late of the amazing performance you can get by working with InnoDB directly instead of going through MySQL for example. Basho is just one of the players that has done work in the space. There is also hook that can be directly embedded in MySQL to bypass the standard workflow and interact directly with the datastore.
My main point in all this is that it will take some time to reach a 1.0 version of a viable, modern persistence solution in Redis. Time would be better spent now at least evaluating other open source persistent engines before diving head first into a very shallow pool. The main point is that most of the current persistence tools have been battle tested in a number of environments, antirez should leverage that considering the small size of his team; just two people.
Thanks for the article. It's always good to read the open opinions of others and freely debate them .. sort of like the scientific community :D
I agree on your point about query plans; they do cost a hell of a lot. This is one of the reasons why we use caching solutions such as Redis and Memcached on top of a well planned out relational databases that we can poke and prod with SQL.
Bret Taylor of FriendFeed/Facebook fame implemented a scalable system schema free system on top of MySQL where he could attach thin tables arbitrarily to JSON collections as the index. When migrations happened the collections would be scanned and the tables rebuilt. Digg has done similar work, but no open source project ever made the light of day. This schema free system powered most of the site (one true datastore).
After cracking open Redis and then looking at mysql-sr-lib it brought that FriendFeed idea back full circle. Basically I think mysql-sr-lib 2.0 could be a 1:1 implementation of Redis functions implemented as MySQL stored procedures. The only adaptation I would make is that functions would be namespaced (a table per namespace) so that collections could be made. It would be up to the client to decide how to take advantage of namespacing. Do use it for functional collections or sharding for example? Clients would also have to use a distribution strategy to map keys to a namespace/collection. The idea is a lot clearer in my mind now thanks to making the mysql-sr-lib discovery.
[MYSQL] SPs are a pain to implement, encapsulate and maintain. Looking at other RDBMS, SPs don't appear to be very portable and doesn't seem to offer speed advantages unless you only use simple queries.
It seems better to write SQL and make use of the DB drivers for me. Why would you consider SPs?
They offer a clean and easy (IMHO) way of making sure I've got the exact same code for common jobs encapsulated and tracked within the database. One routine to add a user (for example) which is common to all client interfaces, handles all the exceptions in exactly the same way, tracks all the security in exactly the same way. I've had my share of problems where you get an odd bug because the app has subtly different SQL for the same job in different places, which just don't happen with stored procs.
In theory you can get this benefit from any common code repository and indeed, where SPs haven't been available, I have written code with a common SQL cache held outside the database. SPs have the advantages though of being incredibly easy to update - no need for a build and deploy, you can correct the code on the live server without disturbing it almost immediately if you need - and of doing so in a way that is then fully trackable by the database itself. I can easily query the information schema to find all procedures that reference a certain object that haven't been updated in the last week, for example.
They've also got security benefits. It's common practice in some areas to lock the tables down completely and restrict access to purely through SPs by permission, which can offer major benefits because you've got far greater control over what user accounts can do what to your database.
They're not perfect and I agree they're not portable (though in practice I've needed that very, very rarely) but I find the benefits hugely outweigh the downsides and will pull a face if told not to use them because it's simpler just to treat the database like a slightly funny version of Excel (or words to that effect, which I have had and don't get me started...!)
One application is to allow very fast patching. If we need to patch an issue and our query is in code that will take N minutes to update, perhaps compile, and push out to every web server, it can be a tiny fraction of the time if all you had to do was update the SP in the DB server(s).
I can do an alter proc with literally seconds of overhead (beyond the obviously required time to actually change the SQL). If it doesn't work, or makes things worse, I'm seconds away from reverting or taking a second bite at the apple. If your SQL is in the app on your web servers, most places running at any sort of scale can't match that.
(as a result of being heavily [99.8%+] SP based, we've had to develop some compensating technology to allow us to do releases without downtime when each release needs its own specific SPs. Solved, but took work. We're also predominantly MS-SQL with limited mySQL, but I doubt that changes much.)
If it's not a proprietary secret, can you elaborate on the compensating technology you guys developed?
I prefer encapsulating database logic in SPs as well, and am always interested in learning how others have solved some of the problems incurred with them.
I've written about it briefly before on HN, and it's very beneficial but not proprietary tech. Basically we create a "shim" database containing our app's sprocs, views and functions, and additionally views to the transactional database tables in another database.
That means that release N can be pointed at the "real" database, while release N+1 can be pointed at the shim DB and they're both using the same transactional data. You can run both in parallel until you commit to N+1, whereupon you shutdown all N app servers, update the sprocs, views and functions in the main DB, and (optionally) point your web servers back at the main DB. There are a few other details to take care of, but that's the gist of it, and the details are minor, or at least we found them to be.
Disclaimer: the above is my experience on MS-SQL at a pretty good sized eCommerce site. Other RDBMSs may not work quite as well, or your app may use heavily features that don't work in "shim mode".
OK, I can accept that. But would it not remove instead of "could remove" the SQL layer?
The author(s) just doesn't seem to have a sense of understanding about the subject matter. It appears they are just throwing buzz words and jargon into sentences in the hope that some kind of substance ensues.
I'm sure the author(s) have good intentions and it all makes sense to them, but it is very misleading information at the end of the day. And to quote their About page "Do you love high quality, trusted, and authoritative news" ..
As far as "Atomic Datasets", he's referring to the ability to send off an atomic operation against a data structure. Yes, you can fake this by normalizing and storing it in a table (or any number of other permutations), but for counters, the ability to fire off an atomic increment is very useful.
If you have a bunch of counters, you don't want a hot counter to require a transaction every time you want to increment a page view. I'm sure there's a solution in high end DBs, but in mysql/postgres UPDATE somecounter SET value = value + 1 typically ends up being very painful for any significant write load. With Redis and Mongo, there are "atomic" increment operations that do a +X on a given field of a list or hash that is very fast.
It would also be interesting to find out what kind of logic is being performed in those stored procedures. It sounds like somebody's relying too much on them for logic (besides data-logic).
This is a favorite topic of mine, but the post is weird.
I had a bit of a problem gasping exactly what their problem was and how sqlite (of all things) fixes them. Some of the sentences seemed like they were edited multiple times before ending up not making sense. Take this bullet point for instance:
MySQL Stored Routines Library (AKA ”Arrays,Sets,Hashes,Queues,and Stacks” implemented efficiently in )
implemented efficiently in what? I'm dying from suspense! :)
What problem does sqlite solve that mysql doesn't?
edit: I can see that the author has already made some edits, so the sentence makes sense now. As in:
MySQL Stored Routines Library (AKA ”Arrays,Sets,Hashes,Queues,and Stacks” implemented efficiently in stored procedures )
Wait what? Is the author suggesting that antirez stop his efforts on Redis and switch to working on MySQL stored routines? It says, "It’s just proof that antirez has options about where to spend his time." And then it switches gears to talking about SQLite. And then about BDB and Membase and Innostore and Bitcask and Riak and Solr and his MIT class number.
I'm sure this author has a point he's trying to make, but I can't seem to make out what it is.
I think the point the author is trying to make is that antirez should not be wasting his time re-inventing the persistence wheel. As in, there are already many persistence plays available that are proven to work within known bounds.
PostgreSQL has to be one of the best kept open source secrets. I'm so tired of the attitude that if MySQL does not work for you, then you have to throw out a relational database entirely and run with one of these klugy NoSQL things. From everything I have seen the last few years, PostgreSQL will scale across multiple CPUs, where MySQL will not, due to differences in the locking mechanisms. To say nothing of the lagging feature impoverishment of MySQL. Yes, MySQL sucks less now, but how long did they simply not bother to implement transactions at all (no rollback, no isolation)? How long until they had stored procedures? Referential integrity to avoid dangling foreign keys?
Open source databases would have never gotten to where they are if there weren't two top-notch competing projects. Without MySQL, PostgreSQL developers would have probably not as heavily focused on performance. Without PostgreSQL, MySQL would have languished in MyISAM or broken InnoDB world forever.
MySQL would have languished in MyISAM or broken InnoDB world forever.
Ahem. Not to rain on anyone's parade but in my world InnoDB is still broken.
I maintain quite a few production MySQLs and they blow up with great regularity.
Usually it hits individual tables, sometimes (gladly more rarely since the 5 series) it suddenly feels insecure about the contents of ibdata altogether. And no, it's not the hardware.
Anecdote needed?
Just last week we had to drop/re-create an InnoDB-table in one project because it would not allow to add an index anymore, no matter what we tried. If some MySQL expert around here can shed some light, I'm all ear, the error message was:
Mysql::Error: Incorrect key file for table 'foo'; try to repair it: CREATE INDEX [...]
Google revealed a handful of familiar sounding bug-reports mentioning column-renames, which the table had indeed undergone, but nothing really helpful, so we resorted to the drop/create sledgehammer.
Gladly that worked. This time. And every time I wonder why people can't just use PostgreSQL, where such issues (believe it or not) simply don't exist.
This article makes no sense whatsoever. Redis is an in memory database, that's what it does. Obviously what this guy wants is something entirely different. This is like complaining that your apples lack the needed orange colour and make don't even produce orange juice!
Redis by itself is an in-memory database. The author is writing about Redis-vm, which supports paging to disk for datasets larger than ram. Basically all he's doing is recommending some alternatives to the current approach, other than writing your own btree implementation.
Redis is not just an in-memory database though. It's many things, of which persistence database (if you set it) is one. Although I disagree with the article, what this guy wants is what Redis seems to be heading towards, but as far as I know has not claimed publicly to be. He seems to be worrying too much over nothing.
Apparently stone tablets make a very effective data store. They are persistant, replicable, don't have arbitrarily complex index interactions, and of course don't require the pain of interacting via SQL.
The point of this article is that Redis cannot be the "one true datastore". By "one true datastore", he means that Redis is the only data store you need in the stack, because it has an append mode that persists data in memory to disk.
The author's mistake is that he is setting up a strawman argument. I've yet to hear Redis claiming to be a end all be all solution for data storage. In fact, it would make sense that most production environments use Redis as the non-persistent cache layer backed by an ACID compliant datastore for the time being.
The author's second point is that Redis fails "at scale in production environments". I'm curious to see some examples here of production Redis failing and I'd like to know how those companies were using Redis. Either way, the scalability debate should wait until 2.2 is released, because 2.2 is supposed to address primarily multi-server Redis instances.
I think the main problem is just believing that there can be "one true datastore". Different technologies excel at different things. You pick what you want your app to do, and then take a look at the tools given to you and pick the best one for the job.
Redis happens to be a great choice for many things, but ACID compliant storage is not really one of them at the moment, although I suppose you could set up 2 Redis instances, one for fast in-memory usage and one that syncs to disk on every command for now.
From my experience, Redis is great for the following:
- Session storage
- basic LRU cache
- Publish / subscribe
- key value store where you want to manipulate data in the form of common data structures
- simple / medium complex queries
Redis is not good for:
- ACID compliant data storage AND fast in-memory access at the same time (the holy grail)
- Complex sql-like queries
Right now, if you try to make Redis persistent in the event of a crash (fsync constantly), you will get a performance hit because the writes to disk will be too frequent.
Antirez is trying to solve the problem of persistence while still keeping fast in-memory access. The author does not feel like Antirez will succeed here, and became "angry", which I find is a bit unreasonable.
I personally use Redis in production for what it's currently good for. If the persistence with speed issue is solved, all the better. It's not an all or nothing type of situation. The worst thing the author could do in this situation is to pick SQLite as your one and only data store for the entire stack. There's nothing wrong with having multiple types of datastores. In fact it is the norm and trying to fight against it at this time is kind of pointless.
I couldn't agree more. One situation I'm using redis for is loads of writes (fast ones) from a rails app.
Because this data adds up and I later need to query / report on it, I'm taking it out of redis after it reaches a certain age (every night) and bulk inserting (archiving) it into postgresql with a bit of aggregation. This gives me the advantage of super fast writes using redis, but also super fast SQL queries for reports that don't have to be real time.
This is probably more hassle than you want to deal with for most situations, but when you do need those fast writes, and you're not super concerned if you lose some data if the server craps out, it seems to be a nice fit.
Funny - I use amazon simpleDB/RDS in the exact same way - the index on RDS makes insertion really slow, so I use simpleDB as a cache and load into RDS nightly.
the connection time makes it relatively slow (on the order of a few milliseconds), but you can write to multiple keys at once, and/or multithread your client.
You wouldn't. Redis doesn't support implementing an LRU cache. Redis does function well as a cache, and is far better than memcached for any data you don't want randomly disappearing, but it just doesn't do an LRU cache.
Redis will only evict data when it's explicitly removed or expired, and it will store as much data as you put into it.
memcached will evict the least-recently used data based upon memory pressure, and will only use as much memory as you configure it to.
They overlap in functionality, but I find they work better in complement to each other than having one replace the other. Redis for data you want to persist, possibly with a timeout -- user sessions, for example. memcached for data you want to cache, as long you have the free memory for it.
You can configure a redis DB using the maxmemory flag instead of setting expiries, so yes it can be used as an LRU cache.
Probably not as fast as memcached, but one situation where it would be appropriate is if you're already using redis in your stack for something else (fast writes). Rather than adding another piece of complexity to your stack you can instead get double use out of redis as a cache as well.
Why wouldn't you? As I understand it, Redis is a superset of memcached (since you can turn disk writes off) and comparable in speed. I, at least, use it as a memcached replacement for storing sessions with the added benefit of not having them all expire if I need to restart the server at some point.
Basically, I exactly agree with the GP's use cases. ACID for things I want to keep, redis for small things that need to be fast (sessions, stats counters, etc).
Agreed on that. Plus, i noticed a mistake made by a lot of people, Redis is NOT memcached. That's just as easy as that. You can use Redis the same way as you would use memcached, but that would be a waste.
The power of Redis resides in its data structures and the usage you make of it. You can make some pretty powerful stuff, easily and with performance.
Nested Comment system (https://gist.github.com/803251), autocomplete (http://pastebin.com/E5h1TLWz), all of that with high performance over large dataset, and much more, stuff that is not doable with Memcached. Also, the role of Redis is not to replace SQL databases, i personally view it, and use it, as a complement to it, and it works just perfectly. (I am fully aware that the examples given are possible with SQL solution, but experience in production/development as shown me that Redis is easier and faster when it comes to not too complex operations like these ones)
Also, complex queries should be handled by performing them periodically on your backend stroage offline, storing the results in something like redis and then simply read from the redis DB online.
I think simple redis queries have the perfect level of data logic you would need just to be able to retrieve the corect/relevant results from the complicated queries you produce.
There are so many people who have written about methods on how to do something like this, eg. facebook, disqus, etc.
Agree, I'm using:
PostgreSQL for normal SQL
Memcached for non-persistent fast changing cache
Redis for session storage, slowly changing persistent cache
A little offtopic but from my brief understanding of redis is that it manages it's own memory using LRU? Why not let the operating system handle this? It seems redundant to handle memory with LRU when Linux, etc use LRU or ARC or something similar.
"What is slow are the arbitrarily complex index interactions that come with the relational world. "
Index's are arbitrary if arbitrary created. It has no bearing on the relational world, no more than a bad driver makes all cars bad.
"Given the example above I think SQLite could be a great implementation leveraging the power of stored procedures to implement atomic datasets."
1, SQLite doesn't support SPs (You can use C to write extensions however) and 2, I've got over 10 years experience in RDBMS and I've never heard of an atomic dataset. Has anyone else? I struggle to understand how a set can be atomic.
"Using BerkleyDB or Innostore could remove the slowness and pain of interacting with the data through SQL."
This kind of comment needs to be addressed. There is NO pain in using SQL, the pain that is perceived is the lack of understanding and skill to use it on the part of the user. Don't call something painful simply because you haven't taken the time to understand it.