Hacker News new | past | comments | ask | show | jobs | submit login
On Building A Stupidly Fast Graph Database (directededge.com)
159 points by wheels on Feb 27, 2009 | hide | past | favorite | 43 comments



I wrote this article pretty much for Hacker News since when there have been previous articles that have made it to the home page there have been questions about what exactly our graph database was.


My first question whenever I read about a new database system: Does it scale horizontally, by throwing more machines at it?

That's the one basic requirement for use in a website backend these days.


That's the one basic requirement for use in a website backend these days.

There are plenty of quite profitable websites which do not have this requirement. It is almost peculiar to sites which are trying to show display advertising to groups of users larger than many nation-states.

You can make an awful lot of money with one commodity server if your business model supports it. I used to have an effective CPM of $80 and I know one which has in excess of $500. No, that is not a typo. (That is on six digits of pageviews per month.)

You know how much scaling you need when essentially get 50 cents a pageview? Not much at all.

FogCreek has, if I recall correctly, one database server. I haven't read how many total machines they're using recently but its a "count on your fingers and toes" number rather than a "zomg we need a colo facility to ourselves" number.


Well, a business model that pays 50 cent a page view sounds nice for sure - I fear most of us don't share that luxury.

I figured that most sites who would be interested in such a database either fall into the retail (recommendation) or the community (social graph) category. Both operate mostly on volume and the last thing you want is a hard bottleneck just when you're at the verge of becoming successful.

But well, if your business model doesn't require scalability on the web-tier then yes, these concerns ofcourse don't apply.


I don't believe you'd put all of your information in this kind of custom db. The author remarks that they are running with memory-mapped disk back-end, which means a single machine should be able to take you pretty darn far.


At this point no, but that's a hurdle we'll cross when we get there. In theory it would be fairly easy to split things using one machine to do the hashing from name to indexes and then segmenting items across multiple machines.


No it isn't, it's a nice to have but in no way is that a requirement. Most websites still run on standard RDBMs that don't scale well across many machines and frankly, most websites won't ever need to scale beyond one big beefy box. Stop drinking the cool-aid, it's killing your perspective.


Actually it's well understood how to scale a RDBMS horizontally (sharding and partitioning). I simply asked whether there is a comparable strategy for this graph db.

Also most websites probably don't need a graph database. But the few who do will likely also need the scalability - at least I cannot imagine many interesting web-applications where your one beefy box could possibly scale to a significant userbase (unless you're talking the z-series category of beefy).

Ofcourse there are still many interesting applications outside the public interweb.


Assuming a simple system a single box could be averaging around 5k requests per second. That's 18 million pages per hour, and just under 1/2 a billion pages per day. That's around 2.5 pages per month to every person on the planet but let's say you get 100 million people that's 150 pages per month. Now if you think you are going to get to that point in the next year feel then it's an issue but IMO 99% of people are far from that point.

Granted the real question becomes storing data not handling that number of requests, but a database that knows where a bunch of dumb files scales really well. (If you look into things this is Facebook's basic approach.)


I wonder where you are taking those figures from. The article stated about 100 queries/sec on a MacBook without any write load (if I interpreted that correctly). If 5k/sec are sustainable on a single box, including writes, then yes, that will probably go a long way.

I'd venture the guess that you'd be talking quite a different budget than a bunch of pizzaboxes in a horizontal setup though. The SAN to handle 5k IOPS alone will set you back by an interesting amount (even more so when you consider mirroring, which you'd probably want to have at that scale). I'd also be worried about the network - GBit/s is probably not going to cut it at that rate anymore.

So, all in all this is precisely why I asked about horizontal scalability. A setup of 5 machines that handle 1000 reqs/sec each is usually cheaper than a single machine to handle all of the 5000/sec.


If his MacBook can handle the entire database so I don't see the need for a SAN (yet). I don't know how much benefit there would be to increasing his systems RAM, but if the database fit's on a laptops HDD, then you can probably get most of it into ram which would make things insanely faster. My guess is upgrading to a 10,000$ box with 64Gb of ram and a 1 + 0 RAID of SSD he could probably get 50x increase in speed which would be ~5k operations per second. Granted he might develop issues with network bandwidth or some other bottleneck, but even just averaging 1k/second represents huge revenue potential relative to the cost of that system. And a back of the envelope calculation should give him a rough estimate of it's value.

PS: Upgrading 10gb Ethernet is not really that expensive now days if he is only linking a few web servers to two databases.

EDIT: To give you some idea what flash can do http://advancedstorage.micronblogs.com/2008/11/iops-like-you... (Granted, it's a stupid video, but 150,000 Read IO's and 80,000 write IO's and 800MB/second of bandwidth on two PCie Cards in 09 / 10 with fusion IO doing the same type of thing today).


If his MacBook can handle the entire database so I don't see the need for a SAN (yet).

The SAN comes into play when a single box can't deliver the IOPS anymore - remember it's not just a matter of adding SSDs. At those rates you start touching the controller and bus limits. Likewise a saturated 10Gb ethernet link causes a significant interrupt-rate (older cards would bottleneck on a single core) that often exposes interesting corner-cases in your OS and hardware of choice.

I'm not saying it's not doable and I know what SSDs are capable of (we just fitted a server with X25's). I'm just saying that your estimate of $10.000 is very optimistic, add a zero and you'll be closer to home. That's because I still think you'd definately be talking an xfire 4600 class machine and a SAN.

Anyways, this is all speculation. Wheels made some reasonable statements that they have it on their radar and I'm definately looking forward to some real-world benchmarks with a concurrent write-load.


So, assuming by "scaling" in the web space, you mean "be able to quickly respond to many reads and writes as my app generates" and not "handle large volumes of data", then you can do what LinkedIn does, which is to just copy the exact same graph to multiple machines, and rely on the fact that eventual consistency is good enough. You make things immediate for the user most concerned with immediate feedback (make them sticky to the graph their write updated) and for everybody else, a couple minutes of lag is no biggie. This is even more true in retail and other applications.


Scaling in the web space usually means both; many reads/writes and large volumes of data.

Yes, mirroring may work to a point but falls down eventually in write-heavy applications. Ideally you want something that you can just add machines to and it will scale near lineary. I'm not sure if that's entirely achievable for a graph search, but that's where my question was heading.


Yes, mirroring as I described will fall down when the throughput of writes into the network exceeds the ability of a single machine to just write, because you will never have the opportunity to "catch up."

However, that is a lot of data to be writing into your graph, especially with how crazy fast writable media and storage is getting these days. YAGNI.

Now, on the theoretical side of things.. "How would you create a linearly scalable graph database across machines?" I don't know how I would make it so that I could maintain the same kinds of speeds for interesting graph traversals.


You of course can do it since all modern web search is graph-based. The real question, and one I don't have an answer to, is at what point do the additional performance of multiple nodes begin to trump the induced network latencies.

Splitting things is the relatively easy part -- it's building the consistency model for multi-node systems that's tricky.

For read-write partitioning it's pretty simple -- each item is largely independent; it has its columns of data and is handled by an index, so once you're just reading / writing / updating items, it's no problem to do the hashing from key to index then using that index to locate the appropriate node.

The devil is of course in the details. If we see that barrier approaching we'll plan ahead for scaling out this way.

However, just doing some quick calculations, looks like the English wikipedia gets 5.4 billion page views per month, which translates to about 2100 per second. On my MacBook I get an average query time on our profile dataset for wikipedia's graph of 2.5 ms per query -- meaning 400 requests per second, extrapolating from there, scaling that up to 6x that on a hefty server doesn't seem unreasonable, and that ignores the fact that we could go further in caching the results (since most of them would be duplicate requests) to push that number up even higher.

So, yeah, it's an issue that's in the back of our heads, but not one we're currently dreading.


On my MacBook I get an average query time on our profile dataset for wikipedia's graph of 2.5 ms per query

Is the dataset changing (being written to) while you make those queries?


No, not in our profile set, but because of our locking system (where readers don't block writes and writes don't block reads) I believe it should hold up well under moderate write conditions (in the case of recommendations applications, we assume that writes are infrequent relative to reads).

Still an untested assumption, but the system is architectured to hold up well in those situations and I think that it will reasonably scale there.


What are the odds that you'll become bigger than plentyoffish without having an influx of money that will let you rebuild?


Thanks for taking the effort. Are tags implemented as just references from a tag "foo" node to the tagged item?


Yes and no. This is the "small lie" mentioned in there where I said that there's only one sort of rows. The StringListColumn maintains a separate row of string values and each reference to a tag just gets an index to the tag, but tags are not "first-class" nodes in that they're not a separate node in the database.

The first time that I implemented a system like this back in 2004 I did things that way. That's in theory more flexible, but since we had a specific class of applications in mind in this case it's for our uses faster to check if an item has a given tag just by having a list of tags associated with each item. The typical access patter for us means that we're already looking at an item and just want to know if it has a given tag.


Thanks for the detailed answer. I've a start-up idea that needs a big dag offline for the production of a smaller (in bytes, not nodes) one used online. My intention was to use Berkeley DB. It's good to see you say it's the fastest of the off the shelf options, but it only reaches 5% of your own code! Any thoughts on what BDB is doing "wrong"? Were you using hash or B-tree with BDB?


We used hashes with BDB. I didn't dig down deep to see what was going on since we weren't really considering using BDB because of its licensing -- GPL, and applications link directly to it, unlike, say, MySQL, and while we're currently only offering access as a web-service, we'd like to have the option open to licensing the recommendations engine in other ways down the line. So mostly we were trying it out to have another data-point to see how our implementation stacked up.


Interesting. Looks like Berkeley DB will be good enough for me to prove the concept then, and if, or seemingly when, it becomes the bottleneck I'll know it's possible to improve on it. Thanks again; it's great having first-hand access to those that have done it instead of just theorising about it, like me. :-)


This kind of storage should work very well with SSDs instead of disks. Is this the case?


I'm curious about this too, but the gap between now and being able to order SSDs on non-co-lo boxes means that we're not thinking about that too much just yet.

The place that would be most relevant would be if we were considering bypassing the file system altogether and moving to doing raw-I/O on the disk itself and tried to account for disk geometry, which would be less useful with an SSD. But in practice that's not on the near term radar anyway.


This is great stuff - many thanks!


This is a great article, and I don't doubt you have a stupidly fast graph database, and I am jealous that you get to spend all day working on graph-theoretic problems. That said:

I'm not so sure of your policework on mmap() vs. read():

* The "extra copies" you make with read happen in the L1/L2 cache, making them comparable to register spills. Buffer copying just isn't expensive.

* (and here I start paraphrasing Matt Dillon) On the other hand, you are absolutely going to take extra page faults and blow your TLB if you address a huge file using mmap, which is not only going to slow down I/O but also hurt performance elsewhere.

It seems to me like you did mmap() so you could easily externalize vectors and maps. Which actually leads me to a second question:

Isn't the reason people use things like B-Trees that they are optimized to touch the disk a minimal number of times? Isn't that kind of not the case with a C++ container "ported" to the disk?


To be honest, that rationale was applied after the fact. One of the many backend iterations that I wrote, prior to the mmap backend, was using read() and friends and after porting the code to mmap-based I/O the output was faster on both warm and cold disk buffers.

I was planning a big blog entry just on the backend options that we used since I tried several combinations of I/O backends with different numbers of reader threads in our profiling dataset with different I/O elevator scheduling algorithms (switching the algorithms, disappointingly, had a negligible effect on performance and I/O throughput degraded when increasing the number of reader threads to more than twice the number of active cores) -- but that kind of slipped into the background as we started filling out the bits of the database to give it an acceptable level of robustness.

The hashing scheme that we're using is optimized for keeping a tight memory profile -- and hence disk profile. Again, much of the rationale was applied after the fact to try to explain the results of profiling. At first we tried things with B-trees and with the combination of the VM's paging and our access patterns the hashes were faster. It's possible that if we were using the direct I/O APIs that many databases use and doing all of our own caching internally that we'd be able to achieve higher throughput with B-trees.

In our case, letting the OS handle our caching and keeping identical C++ structures to our disk structures simplified the code enough to merit leaving things this way for the time being. At the end of the day, our product is a recommendation engine, not a database, so we'd like to keep the codebase relatively lean.

So, yeah, a lot of the explanations are applied after the fact from what I know of systems programming, but the results were validated through actual test runs through multiple competing backend implementations. The one we described gave the best overall results.


> I'm not so sure of your policework on mmap() vs. read()

Scott's conclusions agree with my experiences very well: if you design around mmap(), and let the system handle the caching, you can end up with something several times faster than the traditional alternatives. This isn't to say that your criticisms are completely wrong, just that they don't match up with the actual testing.

* "extra copies" [are cheap]

True, but the real cost is the greater memory footprint. Less application buffering means more room for cached pages. And this cache is transparent across multiple processes.

* extra page faults

I think the opposite turns out to be true. Letting the system handle the buffering results in more cache hits, since the memory is used more efficiently.

* blow your TLB

Theoretically a problem, but in practice one doesn't linearly access the entire file. The beauty of mmap() is that it allows for brilliantly efficient non-sequential access.

* B-trees vs C++ containers

While it's true that you have to think closely about the memory layout of your containers, if you do so the access patterns can be even better than a B-Tree. If the container has been designed for efficient memory-access with regard to cache-lines and cache-sizes, it tends to have great disk-access as well.

What's really beautiful about the mmap() approach is the simplicity it offers. In this model, RAM can be viewed as a 16 Gig L4 cache, and disk as a multi-Terabyte L5. Just as one currently writes code that doesn't distinguish between a fetch from L1 and a fetch from main memory, mmap() allows extending this syntax all the way to a fetch from disk.

Now, this doesn't mean that one can just substitute mmap() for fread() and get any significant improvement. One needs to re-optimize the data structures as well. But the nice part is that these techniques are the same techniques used to optimize existing cache accesses, and certain 'cache-oblivious' algorithms already work out of the box.

Anyway, thanks to Scott for the writeup!


When you say "fread()", I wonder whether you're considering that fread() does stdio buffering in userland above and beyond the small window of memory you reuse on every read (and that is going to stay in-cache) when you use the read(2) syscall directly.


Two simple test cases:

http://pastie.org/402608 (read)

http://pastie.org/402607 (mmap)

Each opens a 10M file and accesses aligned pages. Depending on how many bytes in the page you ask the mmap() case to touch, mmap ranges from 10x faster to 10x slower for me. Reading straight through without seeking, it's no contest for me; read() wins. But you knew that.


Thanks for encouraging me to look at this closer. I was testing with this: http://pastie.org/402890

I was having trouble comparing results, so I combined your two into one, tried to make the cases more parallel, took out the alarm() stuff, and just ran it under oprofile.

My conclusions were that for cases like this, where the file is small enough to remain in cache, there really isn't any difference between the performance of read() and mmap(). I didn't find any of 10x differences you found, found that the mmap() version ranged from twice as fast for small chunks to about equal for full pages.

You might argue that I'm cheating a little bit, as I'm using memcpy() to extract from the mmap(). When I don't do this, the read() version often comes out up to 10% faster. But I'm doing it so that the code in the loop can be more similar --- I presume that a buf[] can optimize better.

I'd be interested to know how you constructed the case where read() was 10x faster than mmap(). This doesn't fit my mental model, and if it's straight up, I'd be interested in understanding what causes this. For example, even when I go to linear access, I only see read() being 5% faster.


I went back and forth on whether use read() or fread() in my example, and I wasn't sure which to choose. For the purpose of this example, I don't think there is a functional difference between them.

In current Linux, I'm pretty sure both of them use the same underlying page cache. fread() adds a small amount of management overhead, but read() does just as much system level buffering. mmap() uses the same cache, but just gives direct access to it.

But it's possible I'm wrong, and I don't seem to be able to find a solid source for this online. This page references this, though: http://duartes.org/gustavo/blog/post/page-cache-the-affair-b... I feel like I've read other more explicit descriptions, although possibly offline.


stdio does its own buffering, which is why you have to turn output buffering off with setbuf() when you want to do debug prints. But I may be on crack in the read case, vs. the write.

I don't follow the rest of your caching arguments, though. read(2) exploits the buffer cache; in fact, the rap on mmap() is that it makes worse use of the buffer cache, because it doesn't provide the kernel with enough information to read ahead. Apocryphal, though.

The big issue is that the mmap() case is much more demanding on the VM system. You're thinking only of the buffer cache when you talk about caching, but the X86 is also at pains to cache the page directory hierarchy (that's what the TLB is doing). Hopping all over your process' address space rips up the TLB, which is expensive. There are also hardware cycle penalties for dicking with page table entries.


You don't really get into why mmap is an unpopular choice. It's not as if other programmers just forgot to read the man page. Traditional RDBMSs dislike the OS's buffer cache because the dbms has information that could better drive those algorithms; e.g., streaming data should not be cached, and should not compete with useful items in the cache. The page replacement algorithm is similarly blind; yeah, madvise exists, but it rarely has teeth. mmap is convenient, and performant enough. But if you found yourself driving hard to get the last 1% of performance out of this system, I would argue that you'd end up doing explicit file I/O and manual management of memory; e.g., the only way to use large pages to reduce TLB misses on popular OS'es is to use funky APIs like hugetlbfs on Linux.

Also, a pet peeve: mmap != "memory-mapped I/O." The latter refers to a style of hardware/software interface where device registers are accessed via loads and stores, rather than magical instructions. If you're not writing a device driver, you don't know or care whether you're using "memory-mapped I/O". mmap is ... just mmap.


I'd be interested in knowing more about why it's unpopular. I'm a fan of mmap() because I like the way it can simplify my code, and so far I've been pleased with the speed as well. But if there are subtle downsides I'd love to be aware of them. My instinct was that mmap() isn't used much because it's relatively new, and because it's traditionally had poor support on Windows.

I'm primarily a Linux user, but the best discussion I was able to find with a quick search was this exchange on freebsd-questions from several years ago: http://lists.freebsd.org/pipermail/freebsd-questions/2004-Ju...

Do you have know of any updated articles about it's performance tradeoffs?


Excellent article.. I was wondering if you could help me understand why Franz's Allegrograph or Aduna's Sesame were not sufficient for you needs. Have you had the opportunity to perform any benchmarks against these graph DB's?


Tried Sesame, it was one of the graph DBs that I mentioned not being up to snuff. Also looked at Franz's DB, but based on the benchmarks they publish on their site (they've also imported a smaller Wikipedia dumb) it looked like it's about 5x slower than ours.


Did you investigate any other network DBMSes? If so why did you find them inadequate?

http://en.wikipedia.org/wiki/Network_model


Nope, or at least none that called themselves such. We tried neo4j, which exploded trying to import data on the order that we're working with and a couple of RDF databases, which survived the import, but were a couple of orders of magnitude off from the performance we were hoping for.

After writing some 8 different backends for our store class and none being within an order of magnitude of our own prototype for the sorts of applications we're doing, it seemed more fruitful to round out our own application rather than continuing the seemingly endless recurse of possible data backends which ranged from mildly to amazingly disappointing.

If you've got something specific that you've worked with in the past that you think would be worth our while to evaluate, I'd consider investing the time to try it out. But just that there exist more options that we could evaluate at the moment doesn't necessarily imply that it's reasonable to keep writing new backends, which sometimes take a non-trivial amount of effort.


I'm part of the Neo4j team and I'm puzzled about the import problem. I don't know about the size requirements you have but you mention 2.5M nodes and 60M edges and we run systems in production with a LOT more data (billions range). So it definitely shouldn't blow up. Maybe you ran into some bug in an older release or something else was wrong.

It's also important to note that Neo4j through the normal API is optimized for the most common use cases: reading data and transactional updates. Those operations are executed all the time during normal operation, whereas an import is typically done once at system bootstrap and then never again.

To ease migration, as part of our 1.0 release (June time frame) we will expose a new "batch injection" API that is faster for one-time imports of data sets. This is currently being developed. If you have feedback on how an API like that should behave, feel free to join the discussions on the list:

   http://neo4j.org
Cheers,

-EE


I'm assuming this is a proprietary, so any other comments regarding RDF databases would be helpful. I've used ARC (arc.semsol.org) before, and it works adequately. Though I haven't run performance tests personally, ARC is based on PHP so it probably gets blown away by this C++ version.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: