Getting Started with Graph Databases

jval43 · on March 10, 2016

FYI: There are two "Graph Databases 101" posts on the front page now. This one and the older one here:

https://news.ycombinator.com/item?id=11257280 (4 hours ago, 15 comments)

dang · on March 10, 2016

We've changed the title of this one back from "Graph databases 101".

Submitters: the HN guidelines ask you to "please use the original title unless it is misleading or linkbait". Note how that does not read "please change the title to make it more misleading and linkbait".

zeeZ · on March 10, 2016

There's a third one posted 4 hours ago: https://news.ycombinator.com/item?id=11258416

fibo · on March 10, 2016

It seems he did it to gain votes.

_ugwa · on March 10, 2016

Except they're two different articles posted by two different people. This seems like an amusing coincidence.

dcw303 · on March 10, 2016

As I had zero experience with graph databases, this was generally a good intro, but the article could do with some polishing to save newbies like me from the underlying suspicion that they're missing something obvious. I tried just reading the tutorial without watching the video, and suffered some cognitive dissonance.

At the end of the article, there's two diagrams that show the behaviour of jcvd.out() and jcvd.outE(). The little gremlins are pointing at two vertices and two edges respectively, but from the 15 lines of code earlier, they're the wrong connections, right? jcvd only has edges to kickboxer and bloodsport, but the diagrams show connections to kickboxer and timecop.

So I looked at the code again, and realized the timecop vertex was never created, which seems kinda odd if you're going to use it in the diagram.

I eventually watched the video and saw animations where the little gremlins go to all three vertices/edges, so it's probably just a badly timed screencap for the article. Not that that explains why timecop is not in the code example, but whatever.

rustyrazorblade · on March 10, 2016

You're right, there's some inconsistencies in the code vs the slides. Must have missed that. I'll fix that going forward.

fibo · on March 10, 2016

Ahah you put the same title as https://news.ycombinator.com/item?id=11257280

mitsoz · on March 10, 2016

I was very interested in the subject, but hated this video.

Very superficial, started off with a complicated relational schema to criticize relational databases, but never ended up explaining how a graph database would simplify the problem. I thought that the graph database concepts + language was way more complex than SQL schema + language.

Very fast talking and moving of slides, is this supposed to sound or look smart? On top of that, 50% of the time the video was a close up to the presenter's face moving left and right in an awkward fashion.

owen11 · on March 10, 2016

Good feedback. I might need similar feedback for my upcoming talk. I am about to give a talk about Cayley (open source graph db written in Go) and I am working on my slides http://oren.github.io/adventure-graphs

Let me know what you think and also join us on IRC (#cayley on freenode) if you find it interesting.

throwaway41597 · on March 10, 2016

TL;DR: graphs are everywhere in the real world, so using a graph DB will be simpler and more efficient; examples of graph queries follow.

Thank you but may I ask who this presentation is for? Because from a quick glance, it's not very deep in technical details. I mean I'm curious about graph databases, but comparing them to vanilla SQL schemas isn't very informative. What I really want to know is what makes them different from denormalized schemas (which is what I expect most people would use).

jmartins · on March 14, 2016

there are a slack channel too in gophers.slack.com

mullsork · on March 10, 2016

Kudos for providing both a video walkthrough AND an identical text version. Made me really happy!

dperfect · on March 10, 2016

Maybe I still just don't "get it", but this explanation didn't really show me how a graph database is any better than an RDBMS, apart from a somewhat simpler interface (which in my opinion is still no better than many ORMs).

For good performance, it sounds like you still need to make good decisions about what to index, as well as putting hard limits on your data - even if not strictly enforced by the data model. And if those kinds of things affect performance, then surely changes to the schema (or whatever you'd call it here) will result in a need for migration/reoptimization. The trouble is, when that needs to happen, I personally would rather have tight control over when and how it happens (with a migration), rather that rely on a black box that supposedly makes everything simple. I'm assuming graph databases have ways to control that process, but that kind of proves my point - you don't get greater performance, simplicity, and flexibility for free, especially when you compare it to something as mature as the current RDBMS's. So what problem is it really solving?

Also, the comparison is a little unfair to RDBMS's - this makes it sound like you'd need separate join tables for every kind of person-media relationship, when you could certainly just use one join table with a column for various relationship types. And the complexity of TV shows with seasons and episodes? I'm pretty sure those distinctions would still need to be modeled in a thoughtful way with a graph database, but I could be wrong.

jonpaine · on March 10, 2016

Index-free-adjacency.

There are myriad pros/cons between graph/relational/nosql, but to me, a "real" graph db will have index free adjacency, allowing it to do deep traversals (friend of a friend-of a friend-oaf-oaf....) in constant time. It finds it's value in traversal of deeply connected datasets.

Any article or comparison that doesn't at least try to explain index free adjacency isn't going to make a compelling case for a graphdb, let along a native graph db. One reason for that may be that many "graph" databases don't have index free adjacency, so have worst than expected deep traversal characteristics.

dperfect · on March 10, 2016

That makes sense. I'm seeing index-free adjacency mentioned in some other comparisons. Sounds pretty cool.

So if each node has pointers directly to related nodes (without needing an index lookup), does that also mean that inserts and updates are slower? From what I understand, if you're bypassing the need for an index lookup at query time, you have to pay for that at some other point in time - specifically by looking up the appropriate pointers at the time of insert/update. Is that accurate?

jonpaine · on March 10, 2016

That's right. However, there are still indexes, even if they aren't necessary for traversal. Ideally you'll use an index to find a start node and traverse on from there (or in the case of your question, update from there).

ifcologne · on March 10, 2016

Using a novel index, which combines hashes with linked-list, it is possible to gain the same complexity O(n) when traversing the whole graph.

Index-free adjacency is an implementation detail - with drawbacks:

If you store the vertices at each node as list of direct pointers, then traversing all neighbors has complexity O(k), if a vertex has k edges. Note that this is the best possible complexity because O(k) is the size of the answer. Deleting a single edge also has the same complexity of O(k) (assuming a doubly linked list), which is much worse.

Furthermore, usually one will want to be able to traverse edges in both directions, which makes it necessary to store direct pointers on both vertices that are incident with an edge. A consequence of this is that deleting a supernode is even worse: To remove all incident edges one has to visit every adjacent vertex – and perform a potentially expensive removal operation for each of them.

In general, a graph database is “a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data” (Wikipedia) – independent of the way the data is stored internally.

okram · on March 10, 2016

Your last sentence is perfect. Each chosen internal representation has time/space-tradeoffs its making. Users needs to pick a graph database based on the tradeoffs they can live with for their application.

okram · on March 10, 2016

You have to consider both in-memory and disk representation.

http://thinkaurelius.com/2013/11/01/a-letter-regarding-nativ...

Some graph databases have direct references in-memory and thats great, but a poor organization on-disk and thats bad.

lqdc13 · on March 10, 2016

Is Titan going to survive even though datastax bought out the team? Their github repo hasn't been very active recently.

My issue with graph dbs is that as requirements change you usually have to add more granularity to the edges and nodes. Eventually the schema becomes much more complicated than a RDB.

rail2rail · on March 10, 2016

FWIW AWS recently added DynamoDB integration support for it.

https://aws.amazon.com/blogs/aws/new-store-and-process-graph...

sschueller · on March 10, 2016

Very cool. What are some of the issues to look out for switching from an old SQL model?

woodman · on March 10, 2016

I don't have any experience with this particular product, but done work with a bunch of semantic web software (which is graph based). The most difficult part of migration is related to ontology, the edges. Feature creep is very easy and if you don't set hard limits you can easily find yourself graphing metadata about graph metadata :) You can do this with relational databases as well - recursive logging tables and the like, but it is easier to catch because of the exploding table count. Authorization isn't as easy either, so you'll want to give that some thought before you jump in.

marknadal · on March 10, 2016

This article overly inflates the complexity of graphs and databases in order to sound fancy. I've written a response that is very direct and shows how simple a graph database can be: https://github.com/amark/gun/wiki/Graph-Databases-101 .

rustyrazorblade · on March 10, 2016

Hi. Presenter here. Honestly the goal wasn't to sound fancy. If you're going to work in the GraphDB world, you're going to come across this terminology.

If your concern around my intro is the complexity described of the relational world, well, that's kind of the point. Anyone with at least a few years experience in the RDBMS world has probably come across a project that's spiraled completely out of control with a outrageous number of many to many relationships that are almost impossible to work with. The role of the DBA just to manage your queries and tables is a reflection of that difficulty.

GUN looks like a cool project. Good intro, & thanks for the feedback.

snacksthecat · on March 10, 2016

I didn't think it was overly fancy or confusing at all. I'm not even a particularly techie person and I found the presentation and material to be very palatable. I came away from the video wanting to learn some more about graph databases which I'd say was probably your goal. Very nice job with this!

marknadal · on March 10, 2016

Sorry for the abrasiveness, I actually liked your overview of table outrage (I should have been positive and mentioned this). What I didn't like is that the article starts right away with mentioning Gremlin - which is very popular within the academic community but difficult for most developers. Honestly anything outside of SQL and MongoDB's query spec can be frightening for devs. Because you are genuinely talking about another language you have to learn. This complexity makes graphs themselves look like they are hard and difficult and for serious people, like machine learning. So I think it is dangerous to introduce people to complicated new query languages in a 101 article because people will feel discouraged that if they can't get past Gremlin then they'll never be able to use graphs at all (despite the fact that they use them all the time, especially on the frontend, without even realizing it). Elsewise I thought you did a good job explaining the problem and even talking about the basics of semantics. Just going down the Gremlin route either scares people off or appeals to the more academic elite.

phpnode · on March 11, 2016

Gremlin is a practical query language designed for developers and has very little (maybe zero) usage in academia. SparQL is the more established option which does get some academic (and production) usage, particularly as a lot of research on graph databases has a cross over with semantic web research.

This isn't the first time I've seen you criticise the "academic elite". You seem to use it as a crutch, an excuse for sloppy thinking and poor quality software.

zero_iq · on March 10, 2016

Wow, you have some colossal nerve accusing others of trying to sound fancy.

losvedir · on March 10, 2016

Hmm, maybe I'm missing something, but how do you deal with "edges" in your example? That was the cool part of the original article for me: the concept of specifying relationships between objects via edges between the vertices.

Finster · on March 10, 2016

Personally, I don't need to be shown how simple a graph database can be. I need to know how it will interact with real world data.