http://janusgraph.org/ Support for various storage backends: - Apache Cassandra®...

rajman187 · on Nov 29, 2017

There are two main paradigms here

1) "native" graph db Neo4J is an example of this. This takes advantage of index-free adjacency. Each node knows what other nodes it is connected to and hence traversals are very fast. The issues you run into are when you try to scale. Data that fits onto a single machine is fine and you can replicate your data for fast parallel reads/traversals across disparate regions of a massive graph. However you no longer have the concept of data sharding and distributing the graph as index-free adjacencies don't translate across physical machines. And another drawback is highly connected vertices, you will expend a tremendous amount of resources deleting or mutating a vertex with, say, 10^6 edges. But that vertex is probably a bot so you should delete him anyway.

2) inverted index graphs, non-native graphs, whatever anti-marketing name it might have. These rely on tables of vertices and other tables of edges. Indexes make them fast, not as fast for reads but very fast for writes. And you get distributed databases (Cassandra, for example, a powerful workhorse of a backend with data sharding and replication factor, etc.). But then you have to yet another index to maintain and the overhead can get expensive. This is the model adopted by DataStax, who bought Titan DB (hence the public fork to Janus) and integrated it with some optimisations and enterprise tools (monitoring etc, solr search engine) to sit on top of Cassandra.

Both now have improved integration with things like Spark. Cypher is probably faster than Tinkerpop Gremlin especially with the bolt serialisation introduced in recent versions of neo4j.

So janus is the graph abstraction layer of the second type and so needs somewhere to save these relationships. It all comes down to use case (and marketing) to decide what works best for you.

Gulthor · on Nov 30, 2017

Recommended reads on the native vs non-native topic:

* https://www.datastax.com/dev/blog/a-letter-regarding-native-... (tldr; there is no such thing as a native graph database)

* https://neo4j.com/blog/note-native-graph-databases/ (tldr; native graph databases do exist)

Regarding Cypher vs Gremlin: serialization could be a thing but what matters among other things are efficient query optimizations, algorithm and (physical) data model. Ultimately, databases are all reading from 1-dimensional spaces (RAM or disk), either randomly or (best) sequentially. If you can colocate vertices with their respective edges, you're fine: this is trivial for graphs with no edges or graphs that form a linear chain. If not, then things start to become fun, especially in a distributed way. This will impact performance; the language, not so much.

rajman187 · on Nov 30, 2017

I'm familiar with Marko and his arguments hence my quotes around "native" ;) But it sounds fantastic for marketing

talove · on Nov 29, 2017

Here's one way to look at it: A graph database can be reflected with common, well-understood data structures. You can use a lot of backends to represent those data structures.

Graph database projects are often times just an adapter for doing Graph queries on-top of another store.

At their core, a graph database can be reflected simply with just documents and adjacency lists https://en.wikipedia.org/wiki/Adjacency_list

chiefalchemist · on Nov 29, 2017

Do you think it's fair to say traditional DBs are about data, but graph DBs are (more) about relationships (between the data)?

manigandham · on Nov 29, 2017

JanusGraph provides a graph data model on top of an existing storage layer. In this case it's using wide-column key/value systems. It works well, letting each layer do what it's good at while limiting the amount of separate systems needing to be maintained.

riku_iki · on Nov 29, 2017

> What exactly does a graph database actually do if it doesn't manage the data fed to it?

It provides tools to run complex queries on graphs, and manages data models and indices to execute them fast.

barakm · on Nov 29, 2017

As per the other commenter, Cayley provides a graph data model on top of an existing storage layer.

However, when we get to manage the storage layer (Bolt, Level -- that's being generalized into local-KVs in the next release) we get to build our own indexes for better data management and performance. But there's no reason we can't hand that job off either -- hence supporting multiple (remote) backends. For the local stuff, though, at some point, Bolt is just a very good BTree implementation.