Over the last couple years, I've become a fan of in-process databases. Being abl...

CHY872 · on Oct 2, 2014

This is almost always not the best approach. Database software has been written in order to make developing software as painless as possible. Transactions (even at the single write level), foreign keys, indices, constraints are all there to make sure that your database should as much as possible not be able to end up in a broken state.

Writing a database yourself is immediately reinventing the wheel.

In terms of the advantages you mention:

- Datastructures - this is a niche advantage. There are probably some constraints, but all of the datastructures that I've seen map nicely to a serial form - one that might not is probably bad smell (though that might just be because I haven't seen any decent examples).

- Avoiding the network - you can do this already with (say) sqlite, which is probably ideal for what you're describing.

- Serialization - bad luck - if you want to persist your data you must still serialize it.

- GC - with a modern runtime that can do generational garbage collection this is as good as free.

- Clean - usually at the expense of correctness. If you have enough data and time, you'll start seeing interesting bugs. I agree that writing raw SQL is frustrating, but there are usually libraries that do standard tasks whether that be configuration or ORM.

- Fast - you sound like you're at a data scale where the difference in speed doesn't matter - and even if you did have data on the order of hundreds of millions of rows, you'd likely only see a very small perf difference (assuming sane batching etc).

Writing your own database is probably fine if no existing database meets your needs (whether that be price, performance, storage capacity, etc) but this is rarely the case for all but the most demanding applications.

latch · on Oct 2, 2014

I think you make good points. I wanted to clarify how I used in-process DBs because it clearly wasn't obvious from my post.

I still rely on a "normal" database as the authoritative source of all data, and all writes go to this DB. But changes to the data are queued, workers pick up these changes, and update the in-process DB which handles 90% of read-requests (which, for most apps, is probably > 90% of their traffic).

In some ways, it's a cache. But, it doesn't do misses (it's assumed all the data is there), it's kept relatively consistent (you can normally propagate changes in < 2s), and you have access to much more than just a GET.

Also, about performance...this lets us handle read requests with less than <1ms response times vs say..50ms (or 200ms if we're talking about most ORMs). That can have a huge impact on (a) usability (especially for things sensitive to low latency (autocomplete)) and (b) cost (difference between $200/m and $10K/month). You also don't need to put ALL your data in this. Just the most frequently accessed.

Hope that clarifies a little.

imanaccount247 · on Oct 2, 2014

>Transactions (even at the single write level), foreign keys, indices, constraints are all there to make sure that your database should as much as possible not be able to end up in a broken state.

But you can have all that in process. Consider the most obvious case of just inverting things: writing your app directly in postgresql.

tlack · on Oct 2, 2014

I agree. Using rich data structures that you can manipulate instantly is awesome. And usually your language will have superior query and manipulation abilities, without the mental gymnastics of adapting your algorithm to the underlying data store's query interface.

I'm looking for a good Node library that handles serializing /restoring my program state. Have you seen any of note? Seems like an easy task, but I'm hesitant to deploy anything substantial without much experience with long term operation.