Hacker News new | past | comments | ask | show | jobs | submit login
SQL Databases Are An Overapplied Solution (And What To Use Instead) (heroku.com)
56 points by conorgil145 on Feb 17, 2010 | hide | past | favorite | 62 comments



There's definitely a lack of imagination when it comes to proponents of NoSQL solutions, and this article shows no exception. While we do want to store, for example, an entire e-commerce order in a single operation we also don't necessarily want to retrieve it that way. That type of storage makes otherwise simple operations considerably more difficult. Do you want to know how much you made in sales today? How many of widget #453 are still in stock? The most popular items for sale? The least popular? Start writing code.

The other example of a user profile seems to be the perfect fit for storage in a single table, so I don't why that's there. Now, do you want to know how many users logged in this week? How about you want to delete every account that hasn't been accessed in a year? Much more slow code.

Every article about NoSQL goes on and on about the supposed advantages, but rarely talks about the considerable disadvantages. And honestly, for most people, the advantages are simply not worth the trade off. I fear that we'll have decades of e-commerce stores written with document stores and mountains of code slowly chugging away to calculate the basic stats that any business needs.


Did you use any NoSQL solution seriously? Let's see some examples (imaginary python-like interface, so I don't have to be language/backend-specific). Like you say: Start writing code.

Do you want to know how much you made in sales today?

    sum(amount in db.filter(type='order', date=xxx))
    -vs-
    SELECT SUM(amount) FROM order WHERE date=?
How many of widget #453 are still in stock? (IRL it's never that simple, but...)

    db.filter(type='stock_widget', part_id=453)['amount']
    -vs-
    SELECT amount FROM storage WHERE id = 'widget 453'
Most popular:

    for r in db.filter(type='item', date=xxx): histogram[r['part_id']] += 1
    histogram.sort_value()[0].key()
What I wanted to show is - you're writing the same amount of code for both cases. In some databases (like Tyrant) you can also run the script server-side and just report the result if you prefer. Also depending on the database, you don't need to read the whole record every time - you can just request a list of fields in most of them.

Have some fun with a NoSQL database before rejecting it for reasons like the ones you mentioned... It's also not always about processing speed - I could use either solution, but coding for TT is just simpler than for any SQL in most of what I do (see how my db.filter examples give you the solution in the current language, but queries are just... queries that you have to run and retrieve results (I'm ignoring SQL-LINQ now)).


The question is what is this doing:

    sum(amount in db.filter(type='order', date=xxx))
If you're iterating every order in the database to apply your filter or to calculate the most popular items then you wasting a huge amount of processing power and RAM for something an RDBMS can do very efficiently. It's not an advantage that your incrementing values in your programming language of choice either, it's a disaster.

As for API, you can hide the SQL pretty well behind an abstraction as well. But there is an advantage to using SQL itself -- the database engine analyzes your query and produces the most optimal way to get the answer you want. You're choosing to write poorly optimized code instead.


You don't seem to understand how stored views work. This computation cost is felt at insertion time and is not theoretically worse than the exact same computation cost at insertion time in a rdbms to build almost identical index structures.

What current implementations of rdbms's gain you is the ability to write completely ad-hoc queries and get reasonable performance most of the time. This is an implementation advantage, not a theoretical advantage.


I understand that you can have indexes -- which, in my opinion, seems to defeat the purpose of NoSQL in the first place. You take your unstructured data store and structure it.

Also, in the authors example the items are properties of the order. How exactly would you index on those?


> If you're iterating every order in the database to apply your filter

I'll stop you there - filters can use database indexes. Exactly the same way SQL databases do it. Database driver chooses how it fetches those results. Getting the records based on some criteria and indexed attributes does not iterate through every record.


Yes, I did assume that was the case. However, once you've started adding indexes aren't you already losing some of the advantage of NoSQL. An index is a well defined set of columns.

In your "Most popular" example, you're still iterating records locally (and sorting locally) to calculate the results rather than having it done on the server. You've also presupposed that you have normalized data -- which in authors example, you do not.

My point about having the RDBMS optimize your query still stands. Your code and your SQL statements are not the same -- the SQL is a description of the result but your code is the actual process. And you've over-simplified the process; in reality it would be much more complicated.


The only thing you lose by adding indexes is insert speed. Indexes in most document stores don't change anything apart from shifting the get/put time balance. They don't define anything and can be often created / dropped / recreated without affecting the whole application.

As mentioned before - I could put the "most popular" script on the server and request only the result if you wanted. TT can do that for sure and I remember reading about another solution that may (redis I think? I may be wrong here).

About the normalised or not data - you have to design your data storage to your solution and needs. If you made something hard to query, maybe it's time to redesign it, or add another denormalised storage, or keep aggregate copies, or use some other solutions... Same applies to rdbmses really.

You know how to run your queries on your DB efficiently (or you should start learning now). Sure - exploratory searching for patterns, custom reports, etc. - that's where SQL is great. If you need it, either you should have additional "archives" in some better data store (sql-capable column-based store?), or simply use rdbms for everything. Over-simplification is as bad as over-generalising ;) just use what you need, instead of what is popular.


> About the normalised or not data - you have to design your data storage to your solution and needs.

I understand your point. But once you have normalized data, don't you lose most/all the advantages of a NoSQL solution? RDBMS's are designed to handle the searching, joining, and filtering of normalized data -- NoSQL solutions are not.

My original comment about "lack of imagination" applies -- If you're designed your data to handle as many scenarios as possible (which obviously the author of the article did not) then you will end up with something that is fully normalized.

If you need performance, whether it be in an RDBMS or not, you'll need to sacrifice some flexibility. But it seems like with NoSQL you're prematurely optimizing -- picking a solution with a very specific use-case when you're most likely not going to need any of the supposed advantages.

I'm not suggesting that NoSQL doesn't have it's use, but most of these articles advocate using it by default and in situations where it's clearly not appropriate. The author of this article, using it for e-commerce orders, seems like a very inappropriate use to me.


I realized that most of your examples here have a flaw. They assume you've got a normalized data store. But in the authors example, that isn't the case. The items in the order are properties of the order.

So, for example, your most popular query is wrong. You have to iterate on the orders and then iterate on the items. I doubt you can use an index for that. And you have to fetch the orders when you really only wanted the items.

I was going for the same sort of thing with the number of items still in stock. A better example might be asking how much of a particular widget has been purchased.


Ok, so write me a query that shows me which orders were placed for a specific widget within a date range for a customer belonging to a particular company where the payment is still being processed.

You say it yourself: (IRL it's never that simple, but...)

What's the but?

And I'm still not certain how different applications access a NoSQL store without clobbering each other, where RDBMS systems enforce regulated access.


"... but I'll show a trivial example, because methods of orders processing and querying could take a whole blog post."

The key to such query is using the correct data representation, and the query itself is no different to the previous one: condition and condition and [...] You can solve the date + payment being processed by first getting a list of orders satisfying those conditions and then checking which of them have the widget you need. That's still only 2 queries (if the db can handle multi-key get). It's not slow/complicated/whatever and it's more or less what your typical RDBMS will do. How often do you write complicated new queries IRL?

Stopping document stores from clobbering each other? Reader/writer locks if you're working on a local store. Or setting up an arbitrating db access manager if you're working on a remote one. Exactly the same as SQL servers (SQLite and any RDBMS respectively)


Do the same thing you would do in SQL, run the efficient query (you're almost guaranteed have a view for date limited for a specific customer) then filter it based on product. This is exactly what your RDBMS does, and I don't see how you're gaining anything except a different language.


2 words: Query Optimizer http://en.wikipedia.org/wiki/Query_optimizer

If you think you're doing exactly that an RDBMS does, you're kidding yourself. Even if you do, you're still wasting effort re-coding something that already exists.

The fact is, running the query on the order and filtering on the product may not be the most efficient way of getting the data. It might be faster query on the product and then filter based on the order. RDBMS's make those sorts of decisions all the time.


Look, I'm not going to argue with you because you have no idea what you're talking about. Go away.


I think more specific context on "SQL databases don't scale" might be appropriate here, especially for the noob hackers that are figuring out just which technology to use in building their first or second apps. Is it worth ignoring defaults and conventions on a simple Rails app and not use an SQL based db? Probably not? And I reckon that Heroku itself hosts a good deal of apps where "will this db scale?" isn't really relevant. I could be wrong.

Another question worth knowing the answer to is how much of hassle it is to switch horses midstream (from a well normalized SQL db to something else), after there are some data in the tanks.

Maybe this is Adam's recommendation specifically to the dreaming-big community, which I can certainly appreciate. And maybe everyone should be dreaming big.


Gridspy monitors power usage in real time. My original plan was to effectively stream the live data through the database as it passed from collection (sensors) to display (on the website). The frequent writing to and polling of the database to keep data "live" was absolutely killing performance.

Now I have moved to a Twisted application that aggregates the data and does occasional writes into the DB. It can answer webserver queries for the latest data out of its internal datastructures and streams live data to the user's browser via Orbited

See http://blog.gridspy.co.nz/2009/09/database-meet... (the database side)

and http://blog.gridspy.co.nz/2009/10/realtime-data... (the whole application structure)

[I posted this on the original site too]


I liked the definition of "transient data" - I've been promoting Redis for that kind of thing but I didn't have the vocabulary to explain what I meant (I've been staying "stuff like stats and counters"). Defining transient data as being frequently written, infrequently read is useful too.


I have been describing this type of data as "high velocity" to my friends, and they seem to get it.


Hang on, SQL databases are poor at highly structured things like an order and that's the sort of thing suitable for a document database?


From the article, on where relational databases are appropriate:

> Small records with complex, well-defined, highly normalized relationships.

Why do the records need to be small? And honest, in software development, a large amount of your data is going to be well-defined and easily normalized. The author provided 2 examples that would fit perfectly in a relational database.

> The type of queries you will be running on the data is largely unknown up front.

Or the types of queries you are running are more than just retrieving a single record or simple list of records. I'm afraid a very large number of queries fall into this category.

> The data is long-lived: it will be written once, updated infrequently relative to the number of reads/queries, and deleted either never, or many years in the future.

Yes, a relational databases are for storing long-lived data. For temporary data, you could use an in-memory table or just some other solution entirely. There's no need to mix your permanent data with your temporary data. Databases handle writes and deletes extremely well (in bulk even) so I'm not sure what the author was getting at here.

> The database does not need round-the-clock availability (middle-of-the-night maintenance windows are no problem).

What kind of middle-of-the-night maintenance does a relational database need? I've been running at least one database for several years straight without any downtime or maintenance.

> Queries do not need to be particularly fast, they just need to return correct results eventually.

Relational database queries aren't particularly slow -- in fact, RBMS are heavily optimized to return data very quickly. In the vast majority of cases, this is going to be more than fast enough for nearly every application.

> Data integrity is 100% paramount, trumping all other concerns, such as performance and scalability.

Damn straight. I want the data coming from my data store to be 100% correct always. If I need to trade performance for correctness then I can easily add some caching. But I'm not sure how document stores would solve this any differently.


I can only assume the reason this was at negative 1 was most people interested in this topic are already on the (not just SQL) bandwagon. IMO, scaling is a non issue for most well designed websites and as computers get faster this only becomes more apparent. There is a significant advantage to separating complex sites into independent modular components and a only tiny fraction of sites need to scale beyond this point. When you actually need to expand fine, go down that rabbit hole but, for most people it's a complete waste of time.

PS: I suspect the main problem developers actually have with SQL databases is they there ORM is significantly less powerful than SQL. All to often developers focus on row as object and forget the power of more abstract data structures.


Although this topic is frequently discussed on HN, and the portion about document stores is well-hashed, I found the tidbit on binary assets/file uploads very interesting, as I had only used disk before.


Yeah, I was happy to find out about Tokyo Tyrant as well, had never seen it mentioned.


Consider the source: Heroku needs an alternative to Postgres to be a true competitor to AppEngine.

There's something that never gets brought up in these NoSQL discussions: SQL Databases don't scale down. They aren't very good in multitenant situations where you have a lot of random small-fry users -- you end up just sharding the users across a bunch of different master-slave pairs, and hope that they don't step on each other's toes. Because they take up real resources even if not being used, it's difficult to pull off a freemium model.


SQLite scales down fine, FWIW. It's just not appropriate past a certain amount of concurrent writes, but it's great for prototyping and smaller services.


That is a clever idea I'd not thought of, but it isn't really multitenant -- it just pushes the problem down into the filesystem.

It also requires a discontinuous transition to a different SQL database once you graduate from being a small-fry, and from there you're in the same boat as everyone else trying to scale that to multiple machines without application changes.


I'm not sure what you mean by, "it just pushes the problem down into the filesystem". Could you explain?

On the upside, SQLite will take disk space, memory, and CPU proportional to actual usage, and compared to (say) Ruby or Python, it's a drop in the bucket.

As to the discontinuous transition between it and a bigger system, sure. It's a trade-off. So is worrying about scalability plans during early prototyping, though.


If you're building a multitenant distributed system like GAE, the SQLite blob needs to be stored in a distributed system too, even if you don't need to handle concurrent access.


Got it. SQLite also explicitly notes that its locking does not work properly with many broken implementations of NFS (http://sqlite.org/faq.html#q5), so that's not an option.


It doesn't feel the author has real world experience on relational databases. Let's see :

"Small records with complex, well-defined, highly normalized relationships."

Maybe, but not necessarily. Example? Call Data Records for a phone billing application. While there may be relationships in the inserted rows, even to other databases (i.e. customer information) the data stands pretty much on it's own. The challenge is to get floods of data into a single table, so that it can be later sliced and diced for billing purposes.

"The type of queries you will be running on the data is largely unknown up front."

Excuse me? This is so very wrong. I don't even no where to start. EVERY good RDB application is designed in a way where (ideally) all queries are known up-front. If you have to guarantee response times (i.e. think of a cash withdrawal at an ATM) you absolutely MUST control the queries that run on the db. Ad-hoc analysing and reporting MUST be relegated to dedicated, possibly replicated databases.

"The data is long-lived: it will be written once, updated infrequently relative to the number of reads/queries, and deleted either never, or many years in the future."

Mostly so, but absolutely not necessarily the case. I work on an application where the entire data is toasted after a couple weeks and in fact: today and yesterday would suffice.

"The database does not need round-the-clock availability (middle-of-the-night maintenance windows are no problem). Queries do not need to be particularly fast, they just need to return correct results eventually."

This is so full of crap, I won't even get into it

"Data integrity is 100% paramount, trumping all other concerns, such as performance and scalability"

Yes, 100% integrity is paramount, but most certainly not at the cost of scalability, let alone performance.

Recently, methinks, there are a lot of proponents of new and improved data management capabilities, who see their little walled environment, but seem to have no whatsoever experience running databases in a real business. A normal (even big, huge or multinational business) does not have those "cloud-data-management-requirments" that very, very few companies really have.


There is no explanation of why a filesystem is a bad place to store "binary blobs".

If you're collapsing the metrics that you're storing IN SQL there is something really wrong going on.

Logs are OK to store in SQL, assuming you're scraping your logs properly and are logging the proper things. Logging every clickthrough in a relational database is somewhat insane. Logging 10 minutes worth of aggregate clickthroughs is perfectly fine. If you think logs should be a ring buffer I challenge you to tell that to any admin of a system that is subject to laws governing the length of time you must store logs (which is pretty much all e-commerce?).


Meh, I record a couple million page views to a MySQL database (as a new row for every page view) every day, for 6 years. Works fine. I don't expect a million new users to show up any time soon.


six billion five hundred seventy million rows?


Yes, I do actually have billions of rows. See http://www.w3counter.com


I agree wholeheartedly with this article except for the part about s3/cloudfront being the only tool for storing binary assets. MongoDB's GridFS extension does a fine job of handling files in the database since they shard and replicate just like database records. As for serving them out, you can run it on a cheap unmetered server using nginx + http://github.com/mdirolf/nginx-gridfs


I can feel the sincerity of the author, but ultimately I think he's wrong that the solution to this is NoSQL.

It really would be nice to send an e-commerce order as JSON data and have my database know what to do with that. I think we still need the flexibility and power of a relational database behind it, but if someone extended PostgreSQL to take records the way CouchDB or others take records, and taught it how to store into rigid, joinable, relational tables, that would be just great and would help a lot. All of the advanced and relational functionality would still be available when needed, but by default, if one could write and retrieve data in a default format that had been mapped onto tables, etc. previously transparently from the database, that would be awesome.


I think you just described inserting to a SQL view. Never done it myself, but sounds helpful.


Does Heroku offer any alternate persistence engines? I thought they were PostgreSQL only.


I love the document record pattern. I use this on google app engine, which is a tad tricky to use as a relational db. It actually ends up being way simpler and faster. For instance, you can model an app with just one "table" (Kind). You can put say a user id primary key, then have blob binary data fields which store your serialized document records. The reason it's so fast is all you're doing is a simple index scan to lookup all the data you want. It also makes it really simple to store items into memcached/memcache.


I disagree with the comment "SQL databases don't scale". If you've got a decent database architect, scalability (not capacity), stops being an issue.


These noSql people are missing the point of relational modeling, that you can easily incrementally evolve your data model. It's why object databases never caught on.

SQL databases are absolutely beautiful and elegant when you think of them in terms of the codd relational model, in my opinion the best thing that computer science has produced so far.

The only limitation of relational databases currently is their lack of automatic infinite horizontal scaling on commodity servers, but hopefully someone will solve that soon.


Well, is this a complaint about relational SQL databases? The only issues the author has that actually are tied to the SQL language are #2 (queries are largely unknown up front) and possibly #1 (small records with normalized relationships). Most of the other complaints have to do with ACID data consistency and the overhead it imposes in simple implementations (e.g. no row-level locking).


By leaving sql you're foregoing the ease of development aspects that it has. It's like leaving python to program in c. Memory is cheap these days, for anything but the largest websites it makes more sense to use a relational database with all data in memory (and this will be as performant as any other solution you create).


almost exactly the opposite. its much easier to just get all my data related to some 'entity' in some place then to have to gather it over a half a dozen columns with a weird syntax (I know SQL very well, I just hate it with a passion :)


I prefer redis to SQL for all-in-memory data. Fast, easy, and simple.


And you can easily and incrementally evolve your data model in an object database too with the added benefit that the application is updated to match the new data model at the same time.

> SQL databases are absolutely beautiful and elegant when you think of them in terms of the codd relational model

And horrible and brittle in terms of the application model, which in 99% of cases is not relational.


"""And horrible and brittle in terms of the application model, which in 99% of cases is not relational."""

I refer you to Philip Greenspun's explanation of why object databases don't work -

"""After 10 years, the market for object database management systems is about $100 million a year, perhaps 1 percent the size of the relational database market. Why the fizzle? Object databases bring back some of the bad features of 1960s pre-relational database management systems. The programmer has to know a lot about the details of data storage. If you know the identities of the objects you're interested in, then the query is fast and simple. But it turns out that most database users don't care about object identities; they care about object attributes. Relational databases tend to be faster and better at coughing up aggregations based on attributes. The critical difference between RDBMS and ODBMS is the extent to which the programmer is constrained in interacting with the data. With an RDBMS the application program--written in a procedural language such as C, COBOL, Fortran, Perl, or Tcl--can have all kinds of catastrophic bugs. However, these bugs generally won't affect the information in the database because all communication with the RDBMS is constrained through SQL statements. With an ODBMS, the application program is directly writing slots in objects stored in the database. A bug in the application program may translate directly into corruption of the database, one of an organization's most valuable assets."""


Which boils down to "relational databases are easier for non programmers to use", but I'm a programmer, I don't care. OODB's allow me to write applications much faster, and bugs affect both kinds of databases equally. Data corruption happens just as much in relational databases. I've used both, various relational db's for more than 10 years. I'm well aware of their strengths and weaknesses. Given the choice, I'll take Gemstone over a relational DB every time.

Treating the data as if it's separate from the application just leads to a big ball of mud schema that many apps share and the whole mess becomes a big steaming pile of crap that no one wants to change for fear of breaking a bunch of applications. The temptation to share state through the database is too strong. By focusing on the program, OODB's push you towards having each app with it's own db and programs interacting with each other through services, a much better architecture that's much easier to evolve applications independently with.

Exporting data from the application to a relational database for reporting reasons is trivial for those times you need it. Greenspun is simply wrong, object databases do work, and they work well for their intended domain.


From your profile i can see that your a smalltalk fan. Well, Alan Kay was onced asked where he thought the future of programming languages lies and he replied something along the lines of knowledge representation/logic programming.

guess what - relational databases are based on the first order logic, which is the most common method for representing 'knowledge', look at prolog or opencyc.

When you write sql you're actually working in a very high level language, higher than python or ruby or the language du jour.

I doubt Philip Greenspun is wrong. He learned programming directly from people at the mit ai lab.


The future may very well lie there, the present most certainly does not. Above all else I'm pragmatic and no matter how pretty SQL might be for getting the data you want, even if I ignore that it isn't actually relational, it simply doesn't fit well with the rest of the environment necessary to put a working application in front of a user today.

While I admire Alan Kay greatly, he's always been a bit of a dreamer. Dreams don't pay my bills, but object databases most certainly make me more productive, make my apps better, and my job much more enjoyable; SQL however, despite the fact that I know it very well, is always an annoyance that makes any project take long and require much more code than would otherwise be necessary.

As for Greenspun, everyone, no matter how famous or who taught them, is wrong about many things. Look at this website, we're having this discussion on a very popular site which uses a home-brew object database precisely because for many such sites a relational database is simply unnecessary and massive overkill and would create as many problems as it would solve. Relational database have their place, but they are a massively overused and over-applied solution. No blog, small website, or small biz appliction needs a relational database, they'd all be much simpler and work just as well with an OODB and quite frankly be faster and cheaper to develop.


Nearly right. Relational databases are based on a solid logical foundation. SQL is not.


if by "solve" you mean "completely redesign modern RDBMS's" you may be waiting for a while. i'm no expert, but my limited understanding is that most of these services provide a singular interface to a database. to be "automatic infinite horizontal scaling" they'd need to support an infinite number of interfaces on any of these commodity servers. i could be wrong, because again i don't fully understand them, but i think that means basically implementing the equivalent of an 'intelligent' distributed parallel fault-tolerant file system. at that point i figure you could add "lots of glue" (heh, understatement?) to MemcacheDB, MemcacheD, MemcacheQ and some other crap and get something similar. but i could be crazy.


http://ostatic.com/blog/a-bright-future-for-drizzle

I guess this is old news but this is basically a much moreand flexible form of what I was thinking of (I was only considering the infrastructure/fault-tolerant portion of an RDBMS). You could definitely design a well-scaling horizontal cluster using this design; even if the stock components don't support what you need you just modify stuff outside the microkernel.


The google appengine datastore is moving in the direction of being a full sql database, and I bet oracle have people working on it (because appengine, salesforce and aws is a threat to their business), so I don't think it's all that far-fetched.


Appengine's GQL is a very limited subset of SQL. The only queries which can be performed are those that can scale well, so it's doubtful that even simple things like JOINs will be supported in the future.

http://code.google.com/appengine/docs/python/datastore/gqlre...


Appengine datastore works by creating indexes for every query. In relational databases you make joins execute fast by creating indexes. I don't see why joins couldn't be added to appengine.


I believe there's a few reasons. one is that the google file system blocks that bigtable runs on and "reads" for your query could be on any number of machines. been a while since I read the google architecture papers but I would recommend it to all. it is fascinating how they implemented gfs, bigtable, chubby lock, etc. the work around is essentially to write your own code that does what a database does - something to the effect of getting a list of keys for the records you are interested in and using that to then scan just the area in your data structure you want.


It hasn't got anything to do with that. Creating an index on a join means that possibly hundreds of indexes need to be updated every time a you make a write - but that is where appengine would actually shine since those writes can be parallelized to hundreds of machines. It could easily make postgres/mysql/oracle redundant.


Yeah, Heroku should put up or shut up until they start offering alternate NoSQL databases (and no, MongoHQ is not an alternative because of network latency).

Interesting article though.


As they are hosted on EC2 couldn't you use Amazon Cloudfront with them without too much trouble and with no latency issues? http://aws.amazon.com/cloudfront/

The main issue would be integrating Cloudfront with Rails not getting it to run on Heroku. (A quick search doesn't show much support for using Cloudfront with Rails in general)


If Heroku offers their own NoSQL alternative, they will put it on a different EC2 instance and have the same network latency so MongoHQ seems like an equal solution. They provide private IP addresses to connect to so it is the same as running it on your own separate EC2 instance. How does anyone separate their app and db server without creating latency in your opinion?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: