Hacker News new | past | comments | ask | show | jobs | submit login
Why I love databases (medium.com/jeeyoungk)
238 points by strzalek on Nov 9, 2014 | hide | past | favorite | 168 comments



Nice article. I love databases too for similar reasons but, as someone that designs database engines, some of the technical points are off the mark. I never really stop learning in this area, the technical range is incredibly deep and nuanced.

Some of the points that caught my eye as being quite off:

- Contrary to footnote 2, modern database designs bypass the OS file system cache and schedule their own I/O. This has an enormous performance impact versus the OS cache (2-3x is pretty typical) and is a good litmus test for the technical sophistication of the database implementation. It is the primary reason many open source database engines, even "write-oriented" ones, have relatively poor write performance to disk.

- The three data retrieval models enumerated are textbook but if you were designing a new engine today you probably would not use any of them for a general purpose design. Modern spatial access methods (ex: Hyperdex, SpaceCurve, Laminar) are superior in almost every way though the literature is much sparser. Also, a number of real-time analytical databases use bitmap-structured databases (ex: ParStream), which are incredibly fast for some types of query workloads.

- Many distributed database challenges are a side effect of "one server, one shard" type models. It is not necessary to do things this way, it is just simpler to implement; some distributed database systems have thousands of shards per server. The latter model is operationally more robust and better behaved under failure and load skew.

- Tombstones are usually trivial if the database engine is properly designed. Complications are a side effect of poor architecture. The big challenge for tombstones is deciding when and how tombstoned records are garbage collected. It is outside the scope of the normal execution pathways but you also don't want a Java-like GC thread in the background.

Of course, any of these is a long blog post in itself. :-)


Oracle talks directly to NFS servers bypassing the OS file access layer altogether these days, just using a socket, reading and writing from a huge chunk of memory it's grabbed off the OS on startup, and now manages itself.


Doesn't Postgres rely on OS caching? Would you say it's missing out on large performance gains based on that?


I believe that active data is stored twice (once in the Postgres buffer pool, and once in the FS cache). This is not ideal, because it is not making optimal use of RAM. To minimize this effect, PG recommends a relatively small buffer pool, which is not great if you believe the DB can do a better job than a generic OS.

I think this is quite fixable as well (I think I've even fixed it myself once) - just use O_DIRECT. Any PG maintainers able to tell me if PG still "double-buffers" and whether this patch would be useful if I could recreate it?


It can be stored twice, but I don't think that's the ordinary case. Pages that are hot in PG's buffer cache are likely to stay there, making the same page in the OS buffer cache cold (because there aren't many requests for it). That's not always true, because writes to hot pages will end up going through the OS buffer cache maybe a couple times per checkpoint cycle, but it's still not (on average) stored twice. I don't have empirical numbers here, unfortunately, so someone else can correct me and fill in details.

At least from the discussions I've seen, there isn't a lot of interest in using O_DIRECT or otherwise taking on the I/O scheduling problem into postgres. It's not particularly exciting to me, because

* Takes on the I/O scheduling problem, rather than the simpler method now of just handing pages to the kernel for lazy writing, and fsync'ing at checkpoint time.

* Requires a lot of new code, tuning, configuration, etc. with a high maintenance cost.

* Not portable, so it's easy to make a tweak that helps on kernel and HW configuration X but hurts on kernel and HW configuration Y.

* Not very strategic: it helps some workloads with a lot of real I/O by some small constant factor; which doesn't necessarily open up new use cases or market opportunities

In my opinion, it's much better to focus on innovative features, or more low-hanging performance gains (which exist in postgres), or scale-out features. All of those will be slowed down if the code becomes bulkier (making correct patches harder to write, especially for new developers) and the maintainers become distracted by I/O scheduling issues.

It seems more like something to do when innovation slows enough, otherwise it doesn't seem worth it.


Fair enough. Thank you for the very thorough answer - and for saving me the effort of putting together the patch!


Postgresql also uses Shared Buffer Cache which seems to be what he's talking about ?


by bitmap-structured you mean bitmap indexes?

any opensource db that schedule their own io? I guess postgresql, innodb?

what is your opinion on tokudb fractal-trees ?


I hate databases. People tend to have way too much faith in them (or their surrounding marketing), and thus make poor database choices that don't actually fit the shape of their data. Persistence is fundamentally the programmer's responsibility; a magic box behind a socket can't design it for you.

Most applications I've seen wouldn't even need a database, but apparently a lot of programmers are conditioned into believing that writing anything to disk must involve building a database query string and transmitting it over a socket to another process which parses the string, executes it on an interpreter and stuffs the extracted data into a generic 1970s data model that finally gets written to disk in an opaque format from where it can only be retrieved by sending more strings over sockets. This stuff made sense when 1MB was a huge amount of memory, but today it's just not necessary.


Hmm. You are not thinking in business terms.

You run a software house: do you want your developers reinventing data storage on each application? Or using a fairly decent data storage that is RDBMS.

Most of the time RDBMS is a very good choice. Think about the tooling, support, hiring knowledgeable people etc. Lets face it RDBMS are good at the very small (single table, replacing a text file) up to the very large.

In the RDBMS/SQL world you have everything from sqllite to save as a single file, to Oracle for your Enterprise app.


First-Order Logic (and therefore SQL, which is based on it) sucks at describing graphs and highly heterogeneous structures. If you're working with these, you should have a better representation.


But if the storage is a storage library the developers wouldn't be reinventing it every time. And if a new technolgy came along it might be easier to start using it than to say that all applications in an organization must use the same (often relational) db. The latter seems to be very common. And the dba's don't seem to push the use of newer technology much, or look into how much time it would save the developers. Regarding tooling, you are right, it is a bit lacking with the newer database solutions but they seem to be working on it and I think it would help if everyone involved tried to help describe what kind of tooling is needed/wanted. Now a lot of discussions stops with saying the tooling isn't good enough and noone should ever use the new database solutions.


I really question your experience with databases. Because blindly grouping all SQL databases together is a sure fire way to get yourself into a world of trouble. They don't store data the same way. They all have subtle differences in their support for the standards. They all have proprietary features. And their operational characteristics couldn't be more wildly different.

I do agree that SQLite is an excellent choice for most small applications.


"I really question your experience with databases. "

I have used SQL Server, Oracle, MySQL, SQLite, MS Access(!), Sybase and ... CouchDB! I have optimized queries in SQL Server and Oracle. Most of them commercially (not CouchDB).

"Because blindly grouping all SQL databases together is a sure fire way to get yourself into a world of trouble. They don't store data the same way. They all have subtle differences in their support for the standards. They all have proprietary features. And their operational characteristics couldn't be more wildly different."

Spot on. Not sure how it relates to my post.


Most of the time a plain text file is good enough. Databases are overrated.


JWZ wrote a great article about choosing between flat files and a database for Netscape Mail:

http://www.jwz.org/doc/mailsum.html

In Netscape 4.0, the new team went both C++ and Database happy, threw away the tightly-tuned mail summary files I had designed, and generally screwed the pooch raw.

My code had summary files that were on the order of 2% of the size of the folder they were summarizing, and was blazingly fast in all respects. The 4.0 code had an overhead closer to 30% (last time I checked) and was insanely slow, not to mention extremely fragile: their summary files got corrupted all the time.


This is simply not true, otherwise we would all be using text files. Text files can work in some situations. I disagree that this is "most of the time."


We are using text files. How many programs in a typical Unix installation need to communicate to a database? Next to none. How many are communicating via text files? Almost all of them.


Try doing any remotely large data handling. I got midway through writing a direct disk CSV parser for some Python data analysis recently before I realized that this problem has been solved far better by SQLite.

A database is all about not reinventing the wheel for every problem. You shouldn't hand roll your own crypto, and if can avoid it you should also try not to hand roll those things other people make are already very good at.


> Try doing any remotely large data handling.

I was doing grid computing stuff in experimental particle physics. Data as large as it gets. Flat tuple storage (initially on tapes) was all we needed, and we already have a much better way of handling tuples than anything that DBMSes could ever offer.

No databases whatsoever. Only streams, no random access.


What's an example of work you've done that used database systems that could have been better done using flat files?


That's the point - I did not need any database systems whatsoever for nearly everything I ever done. Besides hierarchical DBMSes for CADs - but that's a totally different story, that's another area where RDBMSes fail miserably. There is no use for RDMBSes outside of the CRUD niche (which includes web, enterprise applications, and that's pretty much it).

I can't see any use for RDMBSes in the embedded world. No use for them in most of the data analysis tasks - stream storage works much better here. No RDBMSes in the system level (OSes, compilers, linkers, etc.). No RDBMSes in the scientific computing.


I used to think the same thing as much of my early career was spent in graphics. Then I got much more familiar with relational databases and realized I was an idiot. For example I spent about 3 months hand tuning code that read a quad from a TB sized image stored on disk into a GPU so it could analyzed by computer vision algorithms. Then I realized that using a DB would have been about as fast and I would have saved 3 months of time. Unfortunately I didn't realize this until I left graduate school.

In conclusion, yes you often don't need a database and sometimes it is wrong tool but usually it saves you a lot of time because it is fast enough and general purpose enough.


How exactly using an RDBMS would have helped accessing a plain two dimensional array?


Accessing is easy. Making it fast enough off disk to feed the beast (GPU) is not. The DB has already solved that problem and given a really simple interface: SQL. My hand tuned code was probably faster but not enough to warrant 3 months of effort. I should mention this was several years ago and on a laptop since that was what the researchers used who needed to analyze the data. So I couldn't really solve performance issues with more RAM, etc.


Sorry, I cannot believe this. SQL faster than mmap?!? Something unheard of. And I really do not understand why it should take 3 months to write a single line with an `mmap` call.


There are a lot details I am leaving out because I don't have time or patience to explain them in a HN comment. That 3 months turned into an entire chapter (~15 pages) of my thesis. But I will tell you a little bit about it.

The data we were analyzing was biological and it wasn't actually an image. However it could reformatted as an image so we could apply computer vision algorithms to it. The thesis was first about the computer vision stuff being applied to biology and second about moving the computation to the GPU to speed it up. It went from ~1 hour to 15 secs when the computation was moved to the GPU. The problem with the GPU though is it only has so much RAM, and the PCIe bus to transfer data to it is rather slow. To make matters worse, getting data off disk is even slower and the GPU expects a certain data format for optimal processing.

The majority of the 3 months was not writing code. The majority of the 3 months was developing our own custom file format. The GPU wanted a particular format for optimal data transfer and the computer vision algorithms needed a particular format to be able to work. So it was about creating a file format that was fast off disk and through the PCIe bus and still had the correct formatting necessary for the computer vision algorithms. The exact specifics of the data format and why it had to be a certain way are what would take many pages of text to explain so I will stop.

If I used a DB instead, I could have simply put the data in a table and used rather simple SQL commands to get the data off the disk in the format the GPU and computer vision algorithms wanted. And it would have been about as fast and only taken maybe a week at most of effort (remember most of the time was spent on the data format). In other words, I could have graduated 3 months sooner which seems worth it to me.

Just to be clear, I never said SQL is faster than mmap although there are cases where it can be. What I said was it was usually not worth the effort to make it as fast or faster. It is kinda like C/C++ vs Python/Ruby/Javascript/etc. Yes your program will be a lot more efficient if you use C/C++, but is it really worth the time if you can do the same thing in a higher level language with little effort? There are cases where it is appropriate to use C/C++, but I would argue it is more the exception than the rule.

Databases are kinda like that for me now after learning how to use them properly. Can I beat them for my specific use case? Probably yes, but is it really worth the effort? Most of the time no, but there are exceptions.


Very interesting indeed. Is your thesis published?

Btw., have you considered using a uniform memory architecture (e.g., Intel HD, or an embedded GPU)?

And I still cannot understand why do you need the "R" in "RDBMS" in your case. Sound like you could have been benefited from a simple tuple storage (see a bit of discussion in this thread). I doubt you need indexes and relations.


It is published and you can find it somewhere in Purdue's library. It is not online though. It is a regret of mine that we didn't publish at a conference or journal, but at the time I was too concerned with getting out of there. I like school and research, but I didn't like the poverty.

At the time (2006), embedded GPUs were not available or just becoming available.

In this case, I didn't need relations but given the data size, indexes would have been very helpful since I had to retrieve quads so not to overflow GPU RAM. Any DB (in the most general sense) with indexing would have been fine. Since I can setup Postgres in about two minutes, that is easiest for me.


So, back to my point - you did not need specifically RDBMS. I have nothing against tuple storages, stream DBMSes, berkeley-db style caches, etc. All my hate is reserved to the relational crap.


Oh I see. Specifically for RDBMS and graphics: video game engines. Game engines are a bit weird because sometimes you need hierarchical data structures, stream processing, multi-core processing (e.g. physics), etc. I worked on a game engine for a few years, but again I did not know RDBMSes would be helpful. As far as I am aware, most game engines still do not use RDBMSes, but I think there are cases where they would make life a lot easier because parts of the game engine are relational.

For example, a game object often has many properties and those properties relate to other game objects or pieces of the game engine itself. A game object usually has a graphical representation, a physical representation, a representation related to animation, etc. And you need to keep these different representations in sync. If the physical representation moves, then the other representations have to be moved in sync with the physical one. If one of the representations gets deleted, then they all need to be deleted. So tables and foreign keys would make this a lot easier. It is necessary? No because you can do it with OO, inheritance, trees, but it is actually a lot of effort to do it that way.

That is one example, but I think there are several areas where RDBMSes can be helpful inside a game engine. Just to be clear, there are several areas in a game engine where RDBMSes make no sense too.


Disclaimer: I don't know much about the graphics stuff.

Judging only by my experience in the other areas (which include CAD/CAE and GPGPU as the closest things to graphics), every time I needed to do something relational in nature, it was only dynamic data, never a relational storage. So I used quite a lot of Datalog. Never needed anything like SQL.


Yep, "using mmap" is worse performance-wise than what database engines do. It's not a suitable interface for shoveling data onto disk in a manner such that you can say, "I am done writing this to disk," and "This file will never be in an inconsistent state in the event of kill -9 or powerloss," with maximum performance. Also, it's not enough to cry "use mmap," if you don't want a corrupted file.

I think your antipathy towards relational databases has more to do with your unwillingness to learn how to use them than anything else. Otherwise you'd be capable of speaking with other people here on their level.


I'm pretty well aware of how to use RDBMSes. I even implemented RDBMS engines before. In this case, nobody is writing into a file, so mmap should have been a perfect solution.


Why does iOS ship with SQLite then? I used it for an iPhone app that I wrote. It was more convenient than making my own flat file format.

> No RDBMSes in the system level (OSes, compilers, linkers, etc.).

OS X has them. Any OS with file search needs some kind of indexing technology.

As do mail clients, web browsers, and a bunch of other stuff that works with data.

It's a lot easier to add fast indexing after-the-fact when you didn't design up-front to use flat files.


All the stuff you've mentioned is pretty niche, and, as someone mentioned somewhere above in this thread, mostly done this way by the cargo cultists who defiled the sacred Unix Way. Look at a typical Unix distribution, count how many packages depend on sqlite or berkeley db - a tiny proportion of a total number.

And, no, file search and all that fancy UI stuff has absolutely nothing to do with OSes. It's an application level.


Enterprise applications typically need a data store that offers multiple connections, transactions and atomicity.

Even for very simple apps though I like using a DB to store data, and then text files for import/export to other programs if necessary. Once you have a DB you get so much useful functionality without having to code.


The fact that enterprisey coders got some weird habits does not mean they're doing it all the right way. Outside of the enterprise broken mindset there is very little use for the databases.


How many programs in a typical Unix installation are complex? Not many. Most of designed around the concept of pipes and filters. They don't need a database; they are transformative maps.

Complex systems often require complex data storage. If you have more than one process storing information into a file or set of files, you have to control that type of information. When faced with the option of writing complex, ACID like code or dropping in a PG or MySQL or SQLite option that takes from 1 MB to maybe 128 MB, most will go with using existing tools at the expense of RAM.

Is Oracle the solution for every project? Sure, if you're Oracle. However, one shouldn't throw out databases for alternatives simply because they are a bit complex. In fact, I think that raw file IO is more complex because it is not the norm.


And how many programs need to be complex? Most systems are better off when built from a number of simple components, communicating via text protocols. If multiple entities want to store data in the same place, most likely you're doing something wrong.

I very rarely met things that really needed a database. And in most such cases it was not really a RDBMS. For example, a CAD needs a hierarchical database, relational does not fit there at all.


The file system is just a shitty db. Why does it matter whether it writes to a file system or to sqlite3?


Because all you need most of the time is a flat stream of characters. You don't need any of the DB crap, no random access, no indexes, no structure.


It depends on the problem domain you are working in. I can't think of an application I've done where this wouldn't have required complex serialization and parsing. Those tools are already built for me in the form of ORMs and RDMSes. Why should I write my own?


> It depends on the problem domain you are working in.

Unix is a pretty general purpose thing. And yet, you won't find anything in it that needs a database, nothing at all among hundreds of applications, including some fairly complex ones, like CAD/CAE tools, IDEs, compilers, etc.

> I can't think of an application I've done where this wouldn't have required complex serialization and parsing.

Are you doing CRUD mostly?

> Those tools are already built for me in the form of ORMs and RDMSes. Why should I write my own?

Parsing?!? In ORMs and DBMSes?!? C'mon, try to parse me some C++ with your Oracle.


> Unix is a pretty general purpose thing.

Yea, until you want to query a normalized data set. Then unix is pretty miserable, because text is very ill-suited for indices and rapid and/or random access.


Turns out, you need to do such things pretty rarely. Not "most of the time", as RDMBS proponents claim.


>Parsing?!? In ORMs and DBMSes?!?

What exactly do you think a database does? Provides a way to durably store tabular data on disk and read it back. Without a database, I have to make that myself by writing serialization routines to put my data structures on disk and parsing routines to go from "stream of characters" to my data structures.

DMBS places some limits on your data structures but does all that for you.


If serialisation is hard for you, then you probably doing something badly wrong. In the tools I'm using it's done by a single line of code. And since most of the time data structures are trees or graphs, being sledgehammered into a tabular structure is not pleasant at all.


> Parsing?!? In ORMs and DBMSes?!?

Yes. Stuff stored in files has to be parsed when you read it out of files.


And how exactly RDBMS crap could help with parsing? Mind parsing a C++ source file with your RDBMS?


The grandparent was talking about the serialization and parsing of the data that you'd be storing in flat files, not parsing of C++ files. Work that you avoid by using a DBMS.


Parsing text files is not a problem at all. Never been a problem. And there are far too many benefits of having flat text streams vs. some inaccessible stupid binary ad hoc formats hidden inside some enterprisey RDBMS guts.


Maybe you can replace a document database with the file system (though reinventing indexes for reasonable performance--which in so many use cases becomes remarkably important past a few hundreds of thousands of records--will be fun for you), but modeling relational concepts on the file system is both hard and time-consuming. And I'm comfortable saying that most nontrivial business applications are fundamentally relational.


> And I'm comfortable saying that most nontrivial business applications are fundamentally relational.

Now try to assess, how many applications are supposed to be "nontrivial business applications". It's quite a niche thing, and that's where all this DBMS stuff belongs to. Please do not impose this crap on the rest of the world, we don't need your CRUD.


Why are you so randomly hostile about this? Your tone and attitude are super shitty throughout this thread. Chill.

As for CRUD--I'm a platform and devops engineer. I'm not a web developer (except incidentally). But I do often write relational things--like, say, "I have this node, running this application, using this artifact from that data store"--that are best modeled by a relational data store. They're otherwise trivial applications, but reinventing the relational data store would take orders of magnitude longer than writing the application against SQLite or Postgres. How do you propose to model relations across the file system in a way that doesn't make it the tail wagging the dog?

While I'm thinking about it, I'm pretty comfortable going further than "nontrivial business applications" to "a majority of networked applications". The world is relational. Modeling it relationally makes a lot of sense for a lot of use cases.


No, it's RDBMS proponents who are hostile. Their attitude is "you will be assimilated, resistance is futile, we already won, so shut up and obey". This cannot and will not be tolerated.

I was pressured many times into giving up doing things the right way and adopting RDBMSes, and that was not pleasant at all. In some cases someone else who picked up that projects tried to move to RDBMSes (from hierarchical or document-oriented storages used before) and failed miserably.

So, again, I never came across anything that would need a relational storage. Not even when I was doing heavily distributed HPC (which is as "networked" as it gets). Yes, of course I did a lot of things that are relational naturally - but all of them were in Datalog, with no storage required.

Btw., wtf is "devops"? I've heard this a number of times recently, but nobody could explain the meaning.


Your anecdotes are interesting, but they don't present anything substantiative with regards to this discussion that merits this spittle-flecking you're trotting out here. Nobody's telling you to not do as you feel is right, at least not around here--what you describe isn't happening in this thread. You're being strangely aggressive and hostile just at the idea of using a relational data store to store relational data and I have no idea why.

As far as devops go, the lack of a solid term is why I use the term "platform engineering" for what I do. I build things to make other developers more productive and systems more responsive and scalable.


Wait a minute. Cult of Codd is proclaiming that everything is relational and everyone should always use RDMBS, and if they don't, they're doing it wrong and wasting their time. OP went even further with claims that RDBMSes cover pretty much everything in the CS. If it's not an aggressive cult, I don't know what else could be more aggressive.

I'm ok with using RDBMS for storing relational data, everything has its niche and every tool could be used efficiently when needed. What I'm not ok with is when cultists try to force everyone else to use their precious hammer everywhere, for every little nail. I'm not ok when I'm told that if I did not use RDBMSes I did not do any large scale data analysis.


How do you randomly access variable length fields in a text file and modify them and add new ones?


Yes, web is a niche thing. Get over it.


Why would I want to randomly access variable length fields in a file, to start with? It's a very niche thing to do. Filesystem driver may want to do something like this, some large scale caching proxy, probably. In most cases it's just an optimisation (like a berkeley db records backing text files such as /etc/passwd).


> Why would I want to randomly access variable length fields in a file, to start with? It's a very niche thing to do.

If by "niche" you mean every interactive website does it.


I agree. These days, I tend to approach a persistency implementation by asking myself:

* will it be read or write heavy?

* by what data/key will I access the persistent data?

Most commonly for the type of work that I do, it's write once - read N number of times. What I end up doing then is just to keep the data in-mem in a hash table while writing it async to disk. Concurrent reads hit the hash table while the disk write is being performed. The file(s) are indexed by using a hash of the data itself, multiple ways to to lookups can also be done by using softlinks. Hashing lookup-keys are almost always prefered because of possible locale/security reasons (makes it harder for you to implement an accidental LFI if you only deal with base36) If needed and/or depending on access patterns, I also keep the N most accessed items in a hash table in-mem over time. The API for this usually ends up as two functions, Get and Put.

The amount of code for doing this is a couple of hundred lines at most and it's complexity I've found by experience to be preferable to using a db for the type of work that I do. YMMV.


That's my usual approach as well. Usually an app's data will easily fit in memory within one process (and if it doesn't, that means the app has become so popular that adding an extra X GB of memory is an insignificant cost). Using the language's native data types and an append-only file for persistence is simple and offers high performance compared to doing all access over a socket.


RDBMS have completely dominated in a way that few other technologies have. Especially for anything that might be described as business data (which is the prime value market for databases). Everything else is relegated to share a small niche. If you count SQLite as an RDBMS, the domination is even more complete.

So, my question to you is: why?

It can't be some kind of vendor lock-in, because there is no one vendor; there are several commercial vendors and open source projects available to choose from (each individual vendor might lock in their users, but that doesn't explain why they are still all RDBMSs and not something "better"). It can't be legacy, because SQL is as popular as ever; perhaps more so, as it seems that using a normal filesystem to store data is not very common any more.

So you need to explain why they are so popular yet so wrong; otherwise it just sounds like a case of "everyone is dumb except me".


I can kind of see where you're coming from.

The problem is this -- if your developers are good enough to understand all the operating system nuances of handling IO -- (how does dirty page writeback works, buffer cache vs page cache, opening file with O_DIRECT, what does append only mode do exactly to the file, how do you use append-only mode effectivelt, what operations are atomic on the file system, how does mmap function work, what about fadvise, etc. etc.), they are probably smart enough to not use the database and write their own storage layer, OR they are also smart enough to read through the fine print and pick a good database.

If they don't know those things, they will want to reach and pick a database off the shelf, so to speak. They might pick a bad database. Because they might not know how to read the fine print, how to benchmark, how to benchmark failure, how to not read into the marketing bullshit "But it said Webscale on the front page, how come it ate my data, I don't understand...".

Unfortunately quite often databases are treated like black boxes and they are chosen not necessarily becuase of technical merits. Sometimes it is stupid marketing sometimes it is golfware ("CEO spent time playing golf with an Oracle saleman. Salesman lost just enough, but in the end everyone is forced to use Oracle from now on).


While I'd love to be done with passing strings around, still fundamentally the relational model is a great way to store and fetch data, and ACID guarantees are something a my software should provide, soi use tools that provide ACID guarantees to me.

I'm down with throwing out SQL, but don't toss the baby with the bathwater. ACID + relational are great!


Thank you everybody for your valuable comment. Like I've said in the post, this was my first attempt to explain my passion. I read every one of your comments, and spent the next few days fully understanding their implications.

This post is a bit biased towards what I've interacted with. I think I do have a good sample of various workflow, but there are obviously large number of databases and use cases that I'm missing, and my view of what is "modern" may not incorporate a lot of bleeding edge theory and technologies.

Operational complexity is definitely my #1 concern. The service I originally maintained is a OLTP system, located at the top of the service & data dependency graph. Availability is the top concern.

The current system I'm writing is metrics database. The operational burden is much lighter. It is almost a leaf node in the service & data dependency graph, and I can take downtime to restart the cluster. A very different workflow, indeed.

Thanks!


"Designing Data-Intensive Applications" is shaping up to be an excellent treatement of modern databases and their underpinnings. It's at an excellent level of abstraction, deep enough to convey database internals while high level enough (so far at least) to be able to cover a wide variety of database systems. It also has its feet firmly planted in database history, and is NoSQL-koolaid free. Highly recommended.

http://shop.oreilly.com/product/0636920032175.do


When you say "NoSQL-Koolaid free", do you mean "focuses on Relational" or "treats the topic objectively/academically"?


Not OP, but the latter. The author talks both about data models (relational, document, graph-based, with some nice historical tidbits about IMS and CODASYL) and storage models (from B-trees to SSTables, passing through bitmap encoding and everything else).

It is an amazing book so far, and I'm pretty excited for the rest of it to be written. It's the best text I've read so far on databases that gives all the references you need if you want to go deeper, but still allows you to get a very good higher-level understanding without it.


That sounds fantastic. I'll give it a look


It covers modern NoSQL databases but puts them in their proper historical context by comparing them to the (largely rejected) hierarchical database systems. (At least when we're talking about things like 'document stores.')


I've always loved databases, but after having discovered write-only timestamped databases like Datomic, I can't imagine going backwards. It's a real shame that Datomic isn't fully open source.

(Aren't BigTable and Spanner also write-only and timestamped?)


> Datomic isn't fully open source

Couldn't agree more.

I've worked at two Clojure enterprises recently and both really wanted to use Datomic. But in this day and age nobody wants to commit their data to another proprietary database.


Maybe it is more about paying and less about being open source?


Ok, assuming write-only means what I think it does, what is the point in a write only database? /dev/null/ is write only.


I think he means append-only. You can add data, and you can query data, but you can't mutate.


Probably just means "immutable"


Append-only means that every fact is timestamped and nothing is ever thrown away. (Actually, Datomic provides excision, because some regulatory use cases require it, but it's a "black magic" feature.) The idea is that you should never use UPDATE or DELETE, unless it's a hard requirement or a well-studied optimization. The fact that it's 58 degrees today doesn't invalidate the fact that it was 45 degrees yesterday. New facts may replace old ones relative to time-sensitive queries ("what is the most recent temperature?") but you never get rid of the old ones.

This means that queries often have an additional parameter, which is the time. So a type signature for a query type (in a Haskell-like language) would be:

    type Query a = ([Fact], Time) -> a
where the typical use case would use all stored facts and the current time. (Obviously, the DB itself isn't going to do a full-table scan for typical queries. There are optimizations involved. But, conceptually, a query is a function over all facts.)


I've actually recently been thinking of extending this logic to my Anki geography deck. Populations change, but population at some specific point in time shouldn't.


I agree. For me Datomic is the new standard, and I use something else if there is a good reason too. As with everything there are many good reason to do so. It used to be PostgreSQL now its Datomic.


"The study of databases intersects almost every topic in computer science" - I've heard this before, especially for Compilers. But it has been false for a long time, CS is far more diverse now. For example, how do Databases intersect AI/Machine Learning/Computer Vision, Computer Graphics, Numerics/Simulation, Robotics, Bioinformatics, Computer Architecture or Cryptography?


Taking some leaps here but

AI/Machine Learning/Computer Vision - online index selection / query plan optimization. Computer Graphics - geo 3d indexes Numerics/Simulation - query plan optimization Robotics -if you go meta here there is a lot of overlap with distributed system coordination Bioinformatics - minimal except that this by definition requires uniquely rich data structures that need to be able to be accessed and stored in a database Computer Architecture - databases certainly are themselves architected to perform on computer architectures and the reverse is probably true Cryptography - databases often employ encryption and other security capabilities that are of critical importance but true this isn't the field you mention


> how do Databases intersect ... Bioinformatics

You must be kidding. The whole field of bioinformatics was born when the biologists started to have large enough collections of protein and DNA sequences that working with them by hand was not practical anymore.


But it has been false for a long time

Actually it has been getting more true for a long time. With computers in nearly every aspect of society we have more data than ever to sift through. AI/ML is a big data problem. Numerics and simulations become more effective with large amounts of data. Instead of simulating a result from 1 data point, try 100s, 1000s or millions or data points.


> how do Databases intersect AI/Machine Learning

I am not in any way a computer-science expert, but surely these topics have substantial overlap? (For example, a machine that is learning must store its accumulated knowledge somewhere.)


There is data mining, which is like machine learning + databases. But for many, many machine learning problems, the learning occurs offline and the data is simply stored in files, then the learning algorithm is run. The result is a model, which in many cases has a small constant representation (common exceptions: kNN and SVM). There is no database involved. Now you can of course try to learn from an existing database (data mining) or from huge amounts of data, probably stored in a distributed database (big data). But many systems will not and even then the database part is usually not of much concern. So I still don't see much overlap.


The overlap goes well beyond the question of whether an literal database software system is used: Knowledge representation (as in modern AI), feature representation (as in ML), and relational data modelling are essentially all different branches of the same domain of study.

Consider, for example, the connection with logic resolution languages like Prolog.


All of the fields you've listed are largely about the representation of data.

A database is simply data that's structured to support an algorithm that models some aspect of reality based on that data.


Sure you can say, everything inside a computer is about representing data, so all is the same. But it's not particularly useful to do that.


>you can say, everything inside a computer is about representing data

hence: "The study of databases intersects almost every topic in computer science"


I just claimed this point of view is useless, so you can't just pick it up as if I agreed.

Furthermore data representation in that broad sense is more related to data structures, not databases.

Really, we all know what a database is. It is something very different from, say, a block cipher.


You claimed that point of view was useless, but you didn't argue why it was useless. My point of view is that none of "AI/Machine Learning/Computer Vision, Computer Graphics, Numerics/Simulation, Robotics, Bioinformatics, Computer Architecture or Cryptography" can be implemented (or even well conceptualized) without figuring out how they are to store, retrieve, encode, efficiently memoize, guarantee consistency or atomicity, or apply bulk transformations to their data.

It's silly to dispute that databases touch all areas of computing by simply declaring that databases touching all areas of computing is meaningless, because it's obvious that databases would touch all areas of computing, because computers are data storage, retrieval, and processing devices.

Maybe you're restricting your definition to RDMS (while mentally drawing some arbitrary sharp distinction between data structures and databases.)

>Really, we all know what a database is.

This is simply a declaration. If you know what a database is, define it, then make your case about what areas of computing that it is irrelevant to. Argument by common sense is empty.


I love databases, but I loathe SQL. And no, I don't mean NoSQL is better - that's throwing out the baby with the bathwater.

To me, SQL is the Common Lisp of relational languages - a brilliant invention of its time that has since long-overstayed its welcome and should be replaced by modern considerations of the problem it solves. The difference is that there are a million rethinks and descendents and redesigns of LISP out there that happily threw out the mistakes and made great strides in the language. You could argue that every modern programming language is a descendant of Lisp thanks to the prevalence of great concepts like lexical closures. SQL, on the other hand, has a teeny tiny few spiritual fringe descendents like the various attempts at Date and Darwin's "Tutorial D".

I love the relational model, but who says the only way to manage the relational model is this hoary old thing? It's immensely frustrating that every implementation of SQL bolts on a tacky and half-assed procedural language, but doesn't solve simple underlying frustrations.

Simply accessing related objects is immensely wordy for a "relational" language. In an algol-derived language, I can say Group.Manager.Person.Address.PostalCode to walk the graph. In SQL, I have to deal with a zillion joins.

Yes, some SQL variants let you join by the foreign key name to make the join a little more terse, but it's still hairy compared to every modern functional or procedural language.

And the APIs - maybe the reason so many sites have SQL injection problems is the hideous APIs. Ever tried to build a WHERE IN (id1, id2, id3... idN) statement with a proper parametrized queries? Holy crap what a icky mass of boiler plate. I mean, it's not a hard problem, but how many times have you solved it, and how many times have you found a tedious bug in your solution? Just give me a proper way to concatenate the parametrization inline with the query FFS.

  db.RunQuery("""
    SELECT * 
    FROM MYTABLE 
    WHERE ID IN + " + db.SomeParameterListFunc(a, b, c) + " 
    ORDER BY HOLYCRAP_WAS_THAT_SO_HARD"
  """);
The above syntax would be trivial in any language with operator overloading on the "+" sign, on the off chance that your SQL dialect is so messy it's impossible to safely build a properly-escaped initializer for the list containing a,b and c in text form.

And that's not even getting into real actual first-class language support like ORMs give you.

And speaking of APIs, the fact that a single "SELECT" is the baseline operation... that you work on one resultset at a time. I don't want a single pile of rows. This is not an excel spreadsheet, it's a relational database, and that means I want a graph of data. I don't want to write three queries to get my Customers, their Personnel, and their Addresses, nor do I want a single row of CustomerPersonnelAdddresses. Once and Only Once is good for the data, why the heck isn't it good for result sets?

Where's the code reuse? Why can't I have a pile of SELECTs and a pile of WHEREs and combine them however I see fit? Oh right, I can use a VIEW... but see the previous point, a VIEW is a single glorified Excel spreadsheet, not a proper graph of data. If I want to bundle a bunch of SELECTs together, I have to just write a stored procedure, but then I can't use the proc with a JOIN statement against my VIEW that provides a custom WHERE clause. You could do something monstrous with table-valued parameters, I guess, but those aren't generally well-supported at the API level. This is not a hard problem in every modern language (except Go, of course - yes, you do freaking need map/reduce/filter).

Namespaces. Real, actual, organizational tools for your giant list of 9000 tables and their related objects. No, schemas don't freaking count - you can't nest them and they're overly tied to the security model - using schemas for organization instead of security leads to madness, besides the fact that you can't nest them.

And of course, so many common problems simply aren't nice to work with using the relational model. How do I make a nice audited row where I have the full history of all the row's changes? Well, I can insert it every time, but that's a lot of wasted space. Yes, again, there are ways to do this, but it's something I'd expect to come out-of-the-box since it's such a common problem. Common problems should be solved by the standard library. But SQL can't solve things like this with the standard library, because a SQL standard libraries are limited to crude things like functions and views and procs and not actual large-scale reusable constructs. It's like a programming language where they gave you a bunch of general tools for manipulating unicode points and dynamically sized arrays but no coherent "string" object.

Or trees. Holy crap, you have a "relational" database where a relationship like a "tree" is a nightmare to actually query out! I know that's not what "relational" means, but still - this ins't exactly a rare edge-case, y'know? But it's not in the standard library because the standard library is limited to crude objects like data-types, functions, procedures, etc. that work below the row-level. Any concept of reusable schema concepts is completely left off the table.

/rant


The language issue you raise is in part solved by C# and Entity framework.

Because of the lazy way it works you can define a single select over a table, then apply extension methods, joins etc.. over the return of this select and you end up with a highly reusable clean easily tested data layer. Done right you get code like the following,

  var url = _urlRepo.GetUrl().OrderByCreateDate().ByCreatedUser("boyter");
  var url2 = _urlRepo.GetUrl().OrderByCreatedUser().ByDomain("geocities");
Or with joins,

  var users = _userRepo.GetUser();
  var locations = _locationRepo.GetLocation();

  var result =  from user in users
              join location in locations on user.locationid equals location.id && location.name = "Parramatta"
              select user;
I wrote about this a bit here http://www.boyter.org/2013/07/clean-repository-data-access-c... and it is to date the cleanest way I have come across in any language to deal with SQL. Compile time checked, reusable and testable.

Its the best ORM I have found so far across any language. I would dearly love to have it ported to other languages and frameworks.


I use entity framework. Like any ORM it has some abstraction leaks and a few warts, but all in all it only cemented my opinion above that SQL is holding RDBMS systems back.


It's funny you mention entity framework. That's what blows up tempdb on one of my many environments with some of the stupidest queries I've ever seen hit the server.

Maybe your experience with entity is better than mine, but I personally think it and most other ORMs are complete garbage.


I agree with this (sort of). I find Entity Framework very usable and it does help you write descriptive and maintainable code, but the SQL queries it ends up writing makes it a performance nightmare. YMMV of course but for applications where performance is important I haven't found EF to perform adequately. As ORMs go all I really want is an abstraction layer and the mapping from query to POCO classes without the performance hit. Currently using Dapper.


Although I understand your frustration if SQL queries were blowing up tempdb would you blame SQL or would you blame the developers writing those stupid queries?

ORMs are not a magic bullet and developers can do stupid things with any tool.


Your rant doesn't need a point-by-point response. Problems always exist when you attempt to force-fit two idiomatically divergent languages.

SQL was designed for reading and updating relational data. Data in tables that can be joined with relational intersections, unions and complements. For that it does its job. Idiomatically, it has its own "code modularity" paradigm. (And no, I don't mean stored procedures here).

Object-oriented folks want to overlay the object paradigm and the attendant idiomatic language aspects on top of SQL. Painful, but not the fault of SQL's.

This is your proverbial impedance mismatch. We have to "deal with it" without compromising the idiomatic strengths of either paradigm.


What?

> SQL was designed for reading and updating relational data. Data in tables that can be joined with relational intersections, unions and complements.

Obviously. And it fails miserably at providing you with tools to reuse this code.

> This is your proverbial impedance mismatch. We have to "deal with it" without compromising the idiomatic strengths of either paradigm.

That's a cop-out. Purist functional programming languages provide namespaces, packaging systems, etc. even though nowhere in the academia of "lambda calculus" is stuff like that mentioned. At some point the rubber hits the road and you have to interface with your beautiful kernel of relational algebra, and you have to organize all that relational stuff.

Nowhere in Date and Darwin's work does it say that a relational set must be defined by a name. Why can't it be defined by a tree of namespaces? Or packages? Or anything other than simply a flat list of names?

At some point it's time to stop making excuses for this antediluvian system. SQL platforms make exceptions from the SQL spec for stupid things like bolting on a procedural scripting language, but never for smart things like better organization tools for code-reuse.


With some small changes to SQL you can get something similar that probably gets closer to what he describes. Seems to be called NewSql in a lot of cases. One thing some people like with sql is that the query engine translates the language to the actual query. If this is what you want you can build it on top of many databases. Being able to return something more advanced than simple tables would also be nice in a lot of cases.


Use hierarchical DBMS is you need trees. Relational does not fit here.


TABLE tree { id INTEGER, data VARCHAR(8), left INTEGER, right INTEGER, PRIMARY KEY(id), FOREIGN KEY(right) REFERENCES tree(id) ON DELETE CASCADE FOREIGN KEY(left) REFERENCES tree(id) ON DELETE CASCADE }


Now try to convince any CAD company to use this abomination to store their naturally hierarchical geometrical data. Good luck. Your RDBMS will choke to death on even most liberal performance constraints.


The point is that relational algebra happily models trees quite well.

The fact that the actual RDBMS and SQL support for trees is piss-poor should be considered a failing of those systems - users very often have data that mixes the relational model and a tree model, because a tree is a special case of a relational model. Splitting that data between two systems would be a bad solution.


> models trees quite well.

Not quite. Performance is far from acceptable, comparing with the heavily tuned hierarchical DBMSes, and I suspect it won't be easy to ever overcome it. Overhead is too big.

> because a tree is a special case of a relational model

There are many applications which do not need a relational part, when everything fits hierarchical model well. Of course, if there is a need to mix both worlds, there is no other choice but to stay within relational model.


Seems like there might be an opportunity here. Definitely a lot of hard work, and adoption would be pretty hard, but you have to start somewhere. Is there nothing quite like what you described yet?


Amen amen amen.


> You cannot give up partition tolerance.

It'd be more accurate to say you don't want to give it up all the time. You don't want CAP; you want PACELC.

http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-an...


This is an interesting categorization - somewhat like adding an axis to the political compass. Thanks.


I wonder why RethinkDB hasn't gotten a mention here. Have folks here used it? Thoughts?


Since we're discussing databases...is there any "golden standard" learning resource/introduction to PostegreSQL? As a college student, I do not have much experience with them yet but I am aware of how important it would be to be comfortable with them in your day to development. Something tells me that my usual approach of diving in and tinkering by building out an idea wouldn't serve me as well for db's; it just seems to me that there's some fundamental database concepts that I would be missing if I went down that path.

Any advice as to where to start a structured approach to learning about databases would be highly appreciated :)


I would recommend "Architecture of a Database System" by Joe Hellerstein, Micheal Stonebraker, and James Hamilton [0] as a start if you want formal perspective without getting too much into theory.

[0] http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf


As a sidenote, Michael Stonebreaker himself is nowdays building VoltDb which is a memory-db that he thinks is much more evolved than relational systems. I havn't used it but it looks interesting. I did try OrigoDb which is an open source .NET based memory db and it worked great for my needs and the queries are really fast.


(I work at VoltDB)

VoltDB is easy to try. There is a free 30-day trial available on voltdb.com, and a docker-based demo (http://voltdb.com/voltdb-demo).


Apologies for the self-plug, but you might find http://pgexercises.com/ useful for the SQL side of learning Postgres. In terms of rest of the concepts, the Postgres documentation is actually really rather good!


I have lived a very sheltered existence. I have never worked on an application which had a database cluster or sharding, rather than just running on a single server.

Of course, the servers are a little bigger now than they were back around say 1990.


These things are necessasry only at internet scale. Your typical business application does not need a sharded distributed database even in a large enterprise system.


I love databases for the data they contain. And for the ability to make sense of that data more easily when it's in a nice, structured format.

I don't care as much for the dev-ops side of it.


What I find both fascinating and scary about databases is how to choose between the wide variety of databases without understanding exactly how they work? And it doesn't help that there are new databases springing up all the time.

Is there a way for application developers to understand these databases quickly without spending weeks working with them?


The simple answer is to use PostgreSQL, unless you have a good reason not to.

There are definitely cases where PostgreSQL is not the optimal choice, but if you don't know why you're not a good match for Postgres, you're probably not in that group.


Postgres, and MSSQL if you can afford it, are the Swiss Army Knives of database systems. And I mean that in a good way.


For a second I thought you said MySQL and was a bit puzzled.

I know indirectly that MSSQL is quite good database, but since you are putting Open Source and proprietary database next to each other it makes me wonder why you picked MSSQL and for example skipped its competitor such as Oracle.

This is an honest question, I know very little about MSSQL or Oracle so I'm curious why you picked this one. I know that at least 10 years ago MSSQL and Oracle were close competitors, did Oracle lose their race?


MSSQL can be a great value for the money if you need all of the additional tools (SSIS, MSAS), and you are already in a Windows environment. For a straight rdbms, pg is a great solution.

Oracle is something you only use when you have reached some problem that only Oracle can solve. Even then, I would still try to work around Oracle. It is expensive and requires a large amount of tuning to make work well.

Oh, and pretend MySQL doesn't exist.


I've used Oracle since 1998 and SQLServer since early 2000s. Oracle has great defaults now and the free version is very good. Licenses for MSSQL and Oracle can be competitive - obviously if you are a MS shop SQLServer will become cheap(er) in comparison. I prefer Oracle because I know it better and I sometimes work with *nix. PL/SQL packages are far better than the comparable stored procedures in MSSQL. APEX is pretty unique too.

If I was starting on an open source or startup project with a modern web framework I definitely be looking at using Postgres - it looks great.


Fair enough. It has been awhile since I used Oracle and we used to joke that the defaults were set such that you had to hire Oracle as a consultant to set up the db and tune it :)

Clearly competition has helped there.

If I was starting on an open source or startup project with a modern web framework I definitely be looking at using Postgres - it looks great.

I'm in complete agreement, if I was starting a new project today pg would be the default to decide why not to use.


Actually if I was starting something from scratch AND I needed to get something done quick AND I had an Oracle licence/host I'd use Oracle APEX.

http://www.oracle.com/technetwork/developer-tools/apex/overv...

Which as an oracle user and pl/sql programmer is a great way to build an app quickly.

I guess if I was building something (quickly) from OS I'd use Django or Java Spring or Grails.


I mentioned MSSQL specifically because of personal experience with it over the past 8 years. I see PostgreSQL as the best open source alternative. I have little to no experience with Oracle, so I can't speak for it, but I suspect it's just as functional and preformant.


Everyone avoids MSSQL, Oracle, DB2, Teradata etc these days not for any technical reasons.

It purely comes down to cost. Both in the short term and the long term. Even in enterprises its becoming hard to justify the vendor, consulting and hardware costs.


Don't listen to anyone who blindly suggests one direction. Would you listen to someone who said "use C++ for everything" ? No. Because it's dumb advice. Use the right tool for the right job. If you're an application developer you need to invest the time to try some of these databases. In every startup and enterprise I've worked at in the last 10 years it has been application developers who chose the database.

You can cover your bases by trying out a few: PostgreSQL, Cassandra, MongoDB, Couchbase. Sites like MongoHQ, Cloudant, Amazon RDS can spin up a database for a few dollars and give you an opportunity to try building a simple application.


Its pretty funny how right above this comment in the thread suggests the exact opposite.

I think "use the right tool for the job" is a notoriously misleading phrase. What does "right" mean, and what is "the job?" If you are at the point where certain databases simply won't work, then of course there's a way to measure this, but most of us are not in this situation. There is operational and mental overhead for each data store you choose. If you end up running half a dozen data stores because for some esoteric reason they didn't feel quite "right" for each individual problem, you're going to have a much different system if you compromised on having a "suboptimal" general purpose data store that could be re-used for several of them. There are all sorts of negative downstream effects of introducing a new data store, which can bleed all the way up to the higher level organizational structure.

When you consider the fact that deciding something is the "best" tool is often an exercise in prediction, it's pretty clear that it's easy to get wrong so a strong argument can be made you should lean towards one general purpose data store for most projects until incidental complexity from 'impedence mismatch' or performance needs dictate a change. If I had to choose, I'd say you should have a RDBMS and a persistent durable log. (Like Kafka.) There is something fundamental about representing your data as tables and events (one is just a projection of the other.) From there, a document store, k-v store, graph, fancy distributed sharded system, etc, are essentially optimizations or 'sugar' over the same data and so YMMV as to when to introduce them.


I just don't think you understand the expression. "Right tool for the right job" means from the perspective of the person choosing the tool. There is no "right" answer. And saying "use PostgreSQL for everything" is no different to saying "use C++ for everything". It's dumb and sloppy.

Choosing a database is an important decision. Just like choosing an application stack. People should be encouraged to spike different approaches before going down a path that we both agree will be difficult to unwind from.

And I agree that tables/events are somewhat fundamental. But they aren't exclusive to RDBMS systems. I have built CQRS systems using Cassandra many times before.


Saying "use PostgreSQL for everything" is much more like saying "use Linux for everything" than it is like saying "use C++ for everything." It is a reasonable and profitable constraint to make on a given system architecture.


Use a "NoSQL" database when you have very large amounts of data (so you need 1000s of separate disks) or large volumes of queries (so you need a 1000s of separate servers). Otherwise MySQL, SQL Server, Oracle etc. is probably fine.

NoSQL has an learning and 'new technology' overhead that isn't worth paying unless the pain of using traditional databases is too high.

Don't forget SQLLite too - nice for the very small applications.


There are many other reasons to use "NoSQL" not just big data. Sure MySQL would probebly be fine, but somebody might just think that RandomDB might just be nicer to use, easier to set up, closer to your data model or any number of other things.

> NoSQL has an learning and 'new technology' overhead that isn't worth paying unless the pain of using traditional databases is too high.

You say that as if everybody is born with knowlage of SQL. I would of course teach SQL to every new programmer still but im just point out that is not the best argument.


Support for tooling, ease of hiring both programmers who are familiar with querying and DBAs is much better on relational.

The pain caused by relational being a 'bad fit' needs to be pretty high to consider NoSQL in my opinion. Except for hobby projects (I did some CouchDB stuff in my spare time for the thrill, but MySQL would have been adequate.)

True, noone is born with SQL knowledge (learning to breath is the first priority). However stick your ad up for an experienced Cassandra DBA and then an experienced Oracle DBA and see how you do.


Its not necessarily just about the quantity of data. What you want to do with the data (the queries you want to run) should play an important role in deciding your database.

For example: a query workload with lots of deep traversals would make a graph database attractive even for a relatively small dataset.


I like SQL. It makes sense to me, personally. Even when I get really unwieldy, I can always take a step back and break it down to parts. If I'm allowed to use temp tables, I can do almost anything in it and usually a lot faster than developers in their language can.


this was a great article. extremely helpful to me, coming from a position of being a server-side application developer often tasked with getting several disparate data-stores talking to each other.

the author does a great job of showing the full depth of the field while providing useful hooks and links for further study. this one made my bookmarks folder. i'll surely be going back to check it out again.


This is what i needed. Great read


In my opinion, something more important, does the database love your data? Only then you can truly love your database.


I'm glad he loves databases, databases have been the bane of my existence.

However, the torment they have given me has also lead to a similar fascination - and now I'm writing my own database! So I've become very familiar with the topics he writes on, and they are very good points for anybody interested in the subject.

Why would I write my own database? Because databases are hard, and I am determined to make them easy (even if that means me sacrificing years of my life into doing all this crazy research). Check out http://github.com/amark/gun !

- CAP Theorem, he is correct, P cannot be sacrificed. GUN is AP with eventual consistency. The beauty of this, though, is that you can always build strong consistency out of eventual consistency (it just requires knowing X amount of peers in advance, and doing a trivial lock until you've heard back from all of them - in fact, I do this in one of the example apps) but you can never go from strong consistency down to an eventually consistent system.

- Distributed Systems, this is incredibly incredibly important. I cannot repeat this enough times, there should be no "master" or "single source of truth" in any database. If there is, you're going to have a nightmare of a time being woken up at 3am to fix it when it crashes (my personal experience with other databases). Why? Because single points of failures will fail, centralized systems suck. Solution: Distribute and Decentralize! We make this easy for you.

- Correctness vs Efficiency, as he says, Paxos is difficult - all of them, Raft, Quorum, leader election, etc. DO NOT USE THEM unless you are Google, Amazon, Walmart, or what not. Even then, do not use them. Instead, I've solved this challenging problem by developing a new Conflict Resolution system that (very poorly) can be summarized as Vector Clocks + Timestamps, you get the advantages of both without either of their weaknesses. What his means is that data integrity is guaranteed because every machine is using a deterministic algorithm, without any extra gossip between machines. Let me repeat, you'll get the same result on every machine, eventually consistent, without any multi-machine coordination. This means every peer is a master, and that is awesome, even if you are running it on an ephemeral server/cloud - completely resilient to failures, terminations, restarts, and reboots.

- Empowering the App. Yes. Databases should serve you, not the other way around. Answers to his questions about abstractions are at http://github.com/amark/gun .

- Operational Challenges. This is where I diverge from him. If something seems wrong, like things suddenly becoming slow, you can easily just restart it without any damage/harm/failure occurring. And then you can look through your logs, taking your time, to see what went wrong.

- Basic Building Blocks. Because GUN is a graph database, you get key-value like access, as well as documents and relational styles. That is because mathematically graphs are the superset of relational algebra and hierarchy trees.

Happy to answer any questions!


> you can never go from strong consistency down to an eventually consistent system

Sure you can. In fact, almost every eventually consistent database is built on a collection of strongly consistent single systems. Eventually consistent systems are even frequently built on top of consistent distributed systems, and some of the largest infrastructure on the planet works just like that.


I'm not quite sure where you get that. A strongly consistent system requires other database peers to lock on writes, which isn't needed for eventual consistency.

See http://aphyr.com/posts/313-strong-consistency-models for some great descriptions of how it works, he is quite the expert at testing these things.


> Paxos is difficult - all of them, Raft, Quorum, leader election, etc. DO NOT USE THEM unless you are Google, Amazon, Walmart, or what not. Even then, do not use them. Instead, I've solved this challenging problem by developing a new Conflict Resolution system

Just in case anyone else is considering this: it's a bad idea. Developing a correct consensus algorithm is a lot of hard, unusual work. It's also useless until you can prove it correct.

Paxos, Raft and ZAB (Zookeeper's protocol) have already been proven correct. Pick a library for one and use it. Raft is probably easiest to understand and use.


CRDTs are probably the best thing out there, and the most understandable. They focus on data that does not require complicated leader election algorithms - and this is coming from Facebook and the Cassandra team.

Why? Because these systems are very weak and prone toward failure, just check out some of http://aphyr.com/ 's stuff, he tears a lot of popular databases apart. Even if the theories are "proven right" the implementations are often wrong.


None of the systems that failed aphyr's tests used a proven consensus algorithm correctly. If you look the Zookeeper test, it performed perfectly, as expected. Even etcd/Consul would pass if they chose to, the defaults are merely bad by choice.

CRDTs are useful for salvaging consistency out of an inconsistent database, but they're not the only choice possible.


Good point. BUT that is a tautology - of course systems that implement it correctly are correct. Unfortunately, there are very few of those things out there... and Zookeeper etcd/Consul are all relatively new, too.

This second point goes back to my statement regarding "strong consistency can be built from eventually consistent systems, but you can't go from a strong system to an eventually consistent system." And this is where and why gun makes sense. Start with an eventually consistent system, because it is far far easier and covers 98% of the use cases out there. And only then, when you need consensus on edge-case data, build it out but make sure you understand what you are doing.


It's always fun the first time. Good luck.

Some notes from my experience implementing AP:

- Easiest way to support scaling out is pre-sharding an entire database at the beginning, so there will be no need to ever reshard it. Like have 16 shards spread manually across all nodes and put into configs which nodes synchronize with which. Once you add another node simply move some shards to that node and be done with it.

- Synchronization is easy with hash/Merkle trees but they don't need to keep in-memory hashes of every single record or recalculate them on insertion or deletion. Instead every time new record is inserted its hash should be added arithmetically to the one stored in particular leaf. Same on deletion, but subtracted arithmetically. No scanning of other records for no reason or anything like that.

- Jitters are super important. Synchronizing data or doing anything periodically will cause lags if you forget to schedule at random times.

- Also, don't forget that random disk accesses are slow and organize your data in such way, that disks are read sequentially for most of the bulk access patterns, like during synchronization.


Thanks! Although I would like to disagree with some of your points:

- Sharding is only necessary for centralized systems, where there exists some "Single Source of Truth" of the record. GUN can and will store things in multiple places, so it behaves more like BitTorrent. This reduces the complexity and thus makes the database significantly easier to use and scale, however I had to solve some hard conflict resolution problems to get that to work.

- Yes, Merkle trees are awesome. But deletion is a very dangerous action in a database and should be avoided (as noted in the original article). It causes all sorts of potential data-corruption/divergence problems.

- All data syncs in realtime, cause GUN sports push notifications. However, if the there is a network partition, the data will sync (without problems) after it is restored. So yes, agreed, don't do things periodically - do them immediately!

- Disk access is slow, correct. That is why you should keep as much of the working set in memory. Doing so minimizes disk access, so the majority of your write/read ops are lightning fast.

Do you have any links to your stuff?


I'm not gonna argue with your points, just gonna warn you, that you have too much confidence in your algorithms, be careful.

> Do you have any links to your stuff?

No, it's not open source. Maybe with next implementation I will finally be ready to share it.


Algorithms should never be used based on confidence, only whether they work or not. Try them out and let me know your thoughts!


Hey, can you suggest some good reads for understanding databases enough to build a simple relational database ...


You can look at all the open source databases that are out there. You can also look at things like FoundationDb and their SQL layer. It shows how you can build a relational api on top of a NoSql db with transactions. I think that will be open source on GitHub https://github.com/FoundationDB/sql-layer.


coming soon! (i also write my own databases, because it's fun)


Jerry, would love to see your list. Have any links to your database work?


A database...... in javascript... The only thing next(probably already done) is a mailserver.


I think Haraka[0] pre-dates JS databases.

[0] https://haraka.github.io/


Since we have:

http://bellard.org/jslinux/

We also should have the haraka package.


Joking aside, Isn't node.js created for streaming data almost exactly like a mail server should?


Zawinski's law of software envelopment:

Every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can.


"The beauty of this, though, is that you can always build strong consistency out of eventual consistency (it just requires knowing X amount of peers in advance, and doing a trivial lock until you've heard back from all of them)"

Sounds like a way to trade A for maybe-C, there - what happens if Godzilla stomps one of the servers you're waiting on?


Yes, you are right.

Lol, you get skrewed. That is the danger of globally consistent systems, unless you have slaves to that master that got stomped that can do leader-election to survive the failure.


The commit messages on that repo have me worried!


Haha, sorry about that. I'm in the middle of cleaning up the directory structure to make things more coherent.

Try out some of the examples and the readme tutorial and let me know your thoughts. It is MVP status, and I'm trying to add more docs and get things ready to be stable.


Well, intro was promising and full of energy. Like "I am really going to transmit you some good love for databases and explain why it's so stimulating to deal with them!", but then right after this sparkling start it's all just about same old redundancy, consistency, scaling...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: