Hacker News new | past | comments | ask | show | jobs | submit login

> The performance of a graph traversal in all current implementations will be at best equivalent to a well crafted join at worst much slower.

A join must be implemented with a loop / scan to find the relevant target rows by foreign key - This is why foreign keys pretty much have to be indexed, to give the scan any hope of acceptable performance. So, remember, a join in a relational database generally means a join PLUS an index lookup.

If it's not clear, indexes are not free - they have storage cost as well as imposing a cost on writes to the table. Indexes also make lookup very very fast, but still not free. On a large table, you may be incurring the cost of 20-30 comparisons to find the desired rows (note that this is not a made up number. Assuming a btree index which reduces lookups to O(log_2(n)), you will need 20 comparisons to find a record in a table with 1,000,000 rows, and 26 comparisons to find a record in a table with 100,000,000 rows.)

Graph databases (at least neo4j) always talk about index-free adjacency. Now you hopefully understand what this means. The edge between two nodes in a graph database is stored directly with no need for an index to find the destination node - the traversal is index-free. This can be a tremendous advantage, and that's also not made-up. You can subtract the cost of index lookups that I gave earlier .




These methods only work for very small graphs though. Secondary indexing structures are inefficient and don't scale at all regardless of the implementation details, hence why even non-graph systems don't use them at large scales on single machines. The scale threshold where using indexes is worse than the alternatives is low. Direct representation without indexes (like your Neo4j example) are even worse due to the latency introduced for each step by very poor locality. There is no possible version of these methods that will work well; this would have been a solved problem 30 years ago otherwise.

Latency is the performance killer for graph traversals, so locality is everything, and preserving locality across recursive traversals is the critical computer science trick that is often missing. The computational cost of a comparison is irrelevant to the performance of graph traversal.

The way most "pure" graph systems implement graph traversals is unusably slow outside of trivial cases because they mitigate latency poorly. In practical graph systems, common traversals are typically materialized ahead of time so that a graph traversal operation is not required to satisfy queries, which defeats the point in many cases.


Very well summarized, particularly about the point of locality.

In terms of intractability large graph problems, here’s an example of a scalable solution that provides incredible performance (measured in terms of response times, queries per second, etc) by Facebook to represent the social graph https://www.usenix.org/system/files/conference/atc13/atc13-b...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: