The first law of DB comparisons says: in order to compare two database systems a deep understanding of both systems is required, however the more expert an user is the less likely is to compare two systems (he will just pick what seems more suitable for the task) thus almost all the comparisons attempts are garbage.
Hey, at least he published his benchmark code on github, where experts in either system are free to fork and improve them. That's already better than most of the benchmark comparisons I see.
He really should include his mongo and pg config files as well though.
Actually, I thought his comment was quite delicate.
I used to think that DB comparisons were bad because of vested interests, like Oracle refusing to allow anyone to benchmark them. But I now think that Antirez's view is much closer to the truth: poorly informed people can't set up benchmarks correctly because they don't understand the underlying DB model and possible optimizations, and well informed people can't set up benchmarks because they can't think of a fair benchmark that doesn't explicitly handicap one model or set of optimizations versus the other.
(Of course, in this case, the comparison is pretty bad. Other commenters have noted that an inverted index is more sensible here and that Postgres' text search support should have been used. But the big issue to me is that the graphs are incredibly misleading, with totally different units on the y axis, so that a 0.004 second win for Mongo looks like as big a win as a 3 second win for PG.)
Hi 55pts, don't take me wrong I understand what you mean, but it is like to say that because I happen to contribute to some software that is popular I don't have the right to express freely my thoughts.
garbage here means: not very useful. More politically correct people are going to use a different word, but the concept is the same. DBs comparison / benchmarking is hard and I read many blog posts about people that apparently don't even have a solid understanding / experience with one of the systems (sometimes with both the compared ones), so this articles are really not going to help but to confuse other users.
Garbage means "shit", it never meant "not very useful".
You being Denis Ritchie's reincarnation or John Doe changes nothing to that, and to the fact that you have been disrespectful of someone else's work.
Personal opinion: on HN you have the right to be disrespectful of a work that is of mediocre quality (which is the case) if you provide the community with more details on why it is bad and how to improve, and, more importantly, if you do not deny being disrespectful just two posts below by changing on the fly the meaning of a word.
That's your opinion about how HN should look like, mine is different, more like that:
1) This kind of puritanism you express is not part of the hacking culture at all.
2) To compare two database systems that are the hard work of a lot of people, without a solid understanding, is the real problem here, because it means to spread shit to one or the other, and also to have an effect into potential users trying to understand what to use.
So I think that the only that should be worrying about their behavior is people posting incorrect stuff.
But because I trust in freedom I don't want to prevent them from posting with some kind of pre-post filter. Nor you should try to prevent me to say this kind of blog posts are often garbage.
An expert engineer like antirez (or Linus, etc) calling your work garbage actually is constructive criticism, if you take it that way and act on it, instead of feeling entitled to some additional work or time contribution from them.
Which does suggest an interesting activity: a DB shootout. Presented is a set of data and specified queries. Build solutions against a database you're fond of and show us how fast is can be.
Would have to be something like a public github repo, because valid solutions would need to at least include DB configs and possibly post-query data massage.
> MongoDB being a NoSQL database using dictionary has a much more flexible structure.
Not really, Postgres has hstore[0] which provides query-able schemaless key:value storage.
Furthermore, testing is done with 8.4, current version of Postgres is 9.2 and has further performance improvements over 8.4 (and more flexibility for "unsafe" data storage in the form of unlogged tables).
Finally, no note on the configuration of Postgres and whether it's been tuned correctly, only leaving the inference that he's using the default configuration... which isn't exactly performant.
Not completely worthless: it does say something about the ease of getting good performance without having to know much about the ecosystem. But it doesn't speak to even the first "easy" 80% of what one would do to optimize performance, though.
I appreciate this work with numbers, and the code to run it. I'm sure it can be adapted to take on these two suggestions.
"Getting good performance without having to know much about the ecosystem" is pointless, because choosing a tool is all about learning how to wield it. It's true about any tool, even something like a hammer.
Software is complicated stuff and not something to treat superficially. Like everything in life, optimum results require a deep understanding.
This article is worthless because the results are not useful to anyone for anything. Doing serious benchmarks requires a particular and fairly rare set of skills.
> "Getting good performance without having to know much about the ecosystem" is pointless, because choosing a tool is all about learning how to wield it. It's true about any tool, even something like a hammer.
There is no complete understanding of any non-trivial system, even for its authors. There is a question of how much one can understand in a reasonable amount of time, and what fraction of a perceived ideal one can achieve. Some systems have steeper curves than others.
> This article is worthless because the results are not useful to anyone for anything. Doing serious benchmarks requires a particular and fairly rare set of skills.
It is not worthless because there is some code posted to run it. Someone could beef up the corpus (it's pretty small right now) and apply their tweaks (or try other data types like hstore, or other indexing strategies). This person tried his/her best and put something together.
I agree the benchmark is pretty flawed, but given the thankless (and often adversarial!) nature of running benchmarks and publishing results, I'm grateful for anyone who does one and then publishes the code.
Do you happen to know the performance characteristics of hstore?
I use mongo for some analytics but it doesn't perform well on big datasets(not mongo's fault, memory on the machine is limited). I was wondering if hstore can replace it.
I read somewhere on the mailing list that hstore doesn't perform well on big datasets( > 500,000 i think). Is that still true?
If there are no benchmarks, I am going to do them myself.
As is always the case, disk I/O rather sucks for performance. On the other hand, Postgres's I/O scheduling -- while far from state of the art compared to proprietary databases -- is probably better than leaving it to the operating system entirely, particularly for writing.
500,000 small hstores should perform just fine. We have tables with millions or quite possibly billions of not-entirely trivially sized hstores sitting around on a pretty unremarkable machine with a few indexes.
Hstore's storage format is probably not tuned as carefully (it's basically text) as bson, and yet its indexing can be richer (GIN/GiST). Binary formats can be overrated, though, especially when the amount of meta-information is quite or people use them with strings. However, it's an interesting area of optimization that I think will happen because document storage in Postgres has become something people are actually using now.
Performance with concurrent requests is also not addressed, which seems crucially important for a web application and the area I'd expect to see the most interesting differences.
Not sure what you're trying to say, but hstore's querying most definitely isn't "only string matching" no. hstore values, written in queries, use string-serialization for embedding in SQL. Which is very different. And this serialization is only needed when adding content to an hstore or performing set-operations on hstores, you can perform queries on hstore fields pretty much directly via the `->` operator:
SELECT id
FROM Table
WHERE (hs->'size')::INT >= 42
AND (hs->'name') IS NOT NULL
As much as I like Mongo and comparisons to other DBs I doubt that your comparison does really help anyone:
=> I found it hard to get your message by just skimming through the blog post: the first results lack measuring units, is it request per second or total request? The graphs were even more confusing—they do not give the reader quickly the relevant information and the conclusion isn't clear.
=> I missed information if you set up respective indexes for querying Mongo (correct me if I am wrong). And I don't know if you used the new hstore feature on Postgres.
=> This is still a micro benchmark with limited expressiveness. The overall use case, data model and queries are directd rather towards the NoSQL and general full text search capabilities. Take a use case with lots of joins/links, random reads and writes and we will have a totally different picture (thus, your title is misleading).
So if one wants to make a comparison between Mongo and Postgres I would like to see how Postgres' hstore and the new JSON datatype feel in terms of daily usage and performance compared to Mongo (because these two new features compete directly with NoSQL DBs and the latter one with Mongo in particular).
of course that post is not best practice for a sql-database search and not optimized at all. what it shows is that even then, on a single host, a sql database is within limits to nosql databases.
> a NoSQL database has a much more flexible structure
In some cases you really have too much diversity of data structures and then it a NoSQL solution can be helpful. An example can come to mind: a notification engine with tens of diverse notification format, each with their variable data and their localized display templates.
But if your data is well structured, like in this example, it is not true that SQL is not flexible. I have used PostgreSQL to manipulate on the fly the relatively big UNIHAN list (50 000 Chinese characters with their attributes) and could rearrange data in the way I wished, creating, updating columns, creating grouped tables, and I got a working database for an online dictionary without any pain.
If I would have to run scripts for each SQL command I ran, it would have taken much longer, and I would have the easy begin, modify, check, commit or rollback loop.
I sometime wonder if enough people who use SQL use it through the command line interface.
> In some cases you really have too much diversity of data structures and then it a NoSQL solution can be helpful. An example can come to mind: a notification engine with tens of diverse notification format, each with their variable data and their localized display templates.
Actually the relational database solution to that problem -- an arbitrary mix of various data types -- is to use a table with a variety of columns of different data types. For instance: Str1, Str2, Str3, Str4, Int1, Int2, Int3, etc. -- for as many as you need to store what you want to store. Obviously they should all be nullable and you would need a PK (and FK) to whatever it is you are storing it with. Next, you build views on top of this for each different type of data stored in this mix table, and in this view you then name each column sensibly. Using triggers on the view (or without: Oracle can figure out most inserts if the SELECT query is trivial) you can insert data into the mix table as well.
1. it's confusing (atleast for me)
2. a lot of his charts don't start with 0, this leaves the impression that e.g. mongodb is 3 times faster then PG, when in reality the difference was like 5%
3. one scollable page with multiple pictures would have been much better, imo
4. About the page itself: i don't see how many times the benchmarks were run.
The conclusion shows that even the writer is mislead in interpreting the charts...
"PostgreSQL performs worse than MongoDB in the retrieval queries but the time it takes is negligible (we are way under 1 second)"
What does it matter that it's one second?! That's stupid. His charts show for example for get_email 0.001 seconds versus 0.005 seconds. That's 5 times faster! Saying "that's so small, i don't care" sounds bad. Unless the author meant some other charts...
Also, when i'm at it, i am not sure if this is a good benchmark at all. 0.001 seconds is an extremely small value to measure. I doubt it is usefull for comparison. But it may be ok, if this is would be some "real-world" example where queries don't take long. But the benchmark is not "real-world". At all.
I would have liked to see a more realistic one (the dataset is ok), with multiple concurrent queries and also with measurement of CPU and memory consumption..
I wouldn't say it's confusing, but the inconsistent baselines are very misleading. I'm also pretty surprised that some of the tests are anywhere near comparable, since they had different indexes.
166k is kids play. At this amount we will not see any measurable advantages among serious contenders. The key data is how these perform on multiple servers, with sharding and master slave configurations, etc.
(not very informative, but..) i am not the one who did the blog post. i just found it interesting (reasons stated somewhere else). so i cannot answer any questions about it.
What exactly is the author's conclusion? To ask more questions?
It seams to me that in a post named X-vs-Y you should arrive at some sort of conclusion. Say, X is the way to go or something like that.
I'm not a db geek and I'm not pro-NOSQL or pro-RDBMS. Having no horse in the race, and a very modest knowledge of DBMS, I wonder what would be the best choice in this specific use case (building mailman archive)?