We've changed the title of this one back from "Graph databases 101".
Submitters: the HN guidelines ask you to "please use the original title unless it is misleading or linkbait". Note how that does not read "please change the title to make it more misleading and linkbait".
As I had zero experience with graph databases, this was generally a good intro, but the article could do with some polishing to save newbies like me from the underlying suspicion that they're missing something obvious. I tried just reading the tutorial without watching the video, and suffered some cognitive dissonance.
At the end of the article, there's two diagrams that show the behaviour of jcvd.out() and jcvd.outE(). The little gremlins are pointing at two vertices and two edges respectively, but from the 15 lines of code earlier, they're the wrong connections, right? jcvd only has edges to kickboxer and bloodsport, but the diagrams show connections to kickboxer and timecop.
So I looked at the code again, and realized the timecop vertex was never created, which seems kinda odd if you're going to use it in the diagram.
I eventually watched the video and saw animations where the little gremlins go to all three vertices/edges, so it's probably just a badly timed screencap for the article. Not that that explains why timecop is not in the code example, but whatever.
I was very interested in the subject, but hated this video.
Very superficial, started off with a complicated relational schema to criticize relational databases, but never ended up explaining how a graph database would simplify the problem. I thought that the graph database concepts + language was way more complex than SQL schema + language.
Very fast talking and moving of slides, is this supposed to sound or look smart? On top of that, 50% of the time the video was a close up to the presenter's face moving left and right in an awkward fashion.
Good feedback. I might need similar feedback for my upcoming talk. I am about to give a talk about Cayley (open source graph db written in Go) and I am working on my slides
http://oren.github.io/adventure-graphs
Let me know what you think and also join us on IRC (#cayley on freenode) if you find it interesting.
TL;DR: graphs are everywhere in the real world, so using a graph DB will be simpler and more efficient; examples of graph queries follow.
Thank you but may I ask who this presentation is for? Because from a quick glance, it's not very deep in technical details. I mean I'm curious about graph databases, but comparing them to vanilla SQL schemas isn't very informative. What I really want to know is what makes them different from denormalized schemas (which is what I expect most people would use).
Maybe I still just don't "get it", but this explanation didn't really show me how a graph database is any better than an RDBMS, apart from a somewhat simpler interface (which in my opinion is still no better than many ORMs).
For good performance, it sounds like you still need to make good decisions about what to index, as well as putting hard limits on your data - even if not strictly enforced by the data model. And if those kinds of things affect performance, then surely changes to the schema (or whatever you'd call it here) will result in a need for migration/reoptimization. The trouble is, when that needs to happen, I personally would rather have tight control over when and how it happens (with a migration), rather that rely on a black box that supposedly makes everything simple. I'm assuming graph databases have ways to control that process, but that kind of proves my point - you don't get greater performance, simplicity, and flexibility for free, especially when you compare it to something as mature as the current RDBMS's. So what problem is it really solving?
Also, the comparison is a little unfair to RDBMS's - this makes it sound like you'd need separate join tables for every kind of person-media relationship, when you could certainly just use one join table with a column for various relationship types. And the complexity of TV shows with seasons and episodes? I'm pretty sure those distinctions would still need to be modeled in a thoughtful way with a graph database, but I could be wrong.
There are myriad pros/cons between graph/relational/nosql, but to me, a "real" graph db will have index free adjacency, allowing it to do deep traversals (friend of a friend-of a friend-oaf-oaf....) in constant time. It finds it's value in traversal of deeply connected datasets.
Any article or comparison that doesn't at least try to explain index free adjacency isn't going to make a compelling case for a graphdb, let along a native graph db. One reason for that may be that many "graph" databases don't have index free adjacency, so have worst than expected deep traversal characteristics.
That makes sense. I'm seeing index-free adjacency mentioned in some other comparisons. Sounds pretty cool.
So if each node has pointers directly to related nodes (without needing an index lookup), does that also mean that inserts and updates are slower? From what I understand, if you're bypassing the need for an index lookup at query time, you have to pay for that at some other point in time - specifically by looking up the appropriate pointers at the time of insert/update. Is that accurate?
That's right. However, there are still indexes, even if they aren't necessary for traversal. Ideally you'll use an index to find a start node and traverse on from there (or in the case of your question, update from there).
Using a novel index, which combines hashes with linked-list, it is possible to gain the same complexity O(n) when traversing the whole graph.
Index-free adjacency is an implementation detail - with drawbacks:
If you store the vertices at each node as list of direct pointers, then traversing all neighbors has complexity O(k), if a vertex has k edges. Note that this is the best possible complexity because O(k) is the size of the answer. Deleting a single edge also has the same complexity of O(k) (assuming a doubly linked list), which is much worse.
Furthermore, usually one will want to be able to traverse edges in both directions, which makes it necessary to store direct pointers on both vertices that are incident with an edge. A consequence of this is that deleting a supernode is even worse: To remove all incident edges one has to visit every adjacent vertex – and perform a potentially expensive removal operation for each of them.
In general, a graph database is “a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data” (Wikipedia) – independent of the way the data is stored internally.
Your last sentence is perfect. Each chosen internal representation has time/space-tradeoffs its making. Users needs to pick a graph database based on the tradeoffs they can live with for their application.
Is Titan going to survive even though datastax bought out the team? Their github repo hasn't been very active recently.
My issue with graph dbs is that as requirements change you usually have to add more granularity to the edges and nodes. Eventually the schema becomes much more complicated than a RDB.
I don't have any experience with this particular product, but done work with a bunch of semantic web software (which is graph based). The most difficult part of migration is related to ontology, the edges. Feature creep is very easy and if you don't set hard limits you can easily find yourself graphing metadata about graph metadata :) You can do this with relational databases as well - recursive logging tables and the like, but it is easier to catch because of the exploding table count. Authorization isn't as easy either, so you'll want to give that some thought before you jump in.
This article overly inflates the complexity of graphs and databases in order to sound fancy. I've written a response that is very direct and shows how simple a graph database can be: https://github.com/amark/gun/wiki/Graph-Databases-101 .
Hi. Presenter here. Honestly the goal wasn't to sound fancy. If you're going to work in the GraphDB world, you're going to come across this terminology.
If your concern around my intro is the complexity described of the relational world, well, that's kind of the point. Anyone with at least a few years experience in the RDBMS world has probably come across a project that's spiraled completely out of control with a outrageous number of many to many relationships that are almost impossible to work with. The role of the DBA just to manage your queries and tables is a reflection of that difficulty.
GUN looks like a cool project. Good intro, & thanks for the feedback.
I didn't think it was overly fancy or confusing at all. I'm not even a particularly techie person and I found the presentation and material to be very palatable. I came away from the video wanting to learn some more about graph databases which I'd say was probably your goal. Very nice job with this!
Sorry for the abrasiveness, I actually liked your overview of table outrage (I should have been positive and mentioned this). What I didn't like is that the article starts right away with mentioning Gremlin - which is very popular within the academic community but difficult for most developers. Honestly anything outside of SQL and MongoDB's query spec can be frightening for devs. Because you are genuinely talking about another language you have to learn. This complexity makes graphs themselves look like they are hard and difficult and for serious people, like machine learning. So I think it is dangerous to introduce people to complicated new query languages in a 101 article because people will feel discouraged that if they can't get past Gremlin then they'll never be able to use graphs at all (despite the fact that they use them all the time, especially on the frontend, without even realizing it). Elsewise I thought you did a good job explaining the problem and even talking about the basics of semantics. Just going down the Gremlin route either scares people off or appeals to the more academic elite.
Gremlin is a practical query language designed for developers and has very little (maybe zero) usage in academia. SparQL is the more established option which does get some academic (and production) usage, particularly as a lot of research on graph databases has a cross over with semantic web research.
This isn't the first time I've seen you criticise the "academic elite". You seem to use it as a crutch, an excuse for sloppy thinking and poor quality software.
Hmm, maybe I'm missing something, but how do you deal with "edges" in your example? That was the cool part of the original article for me: the concept of specifying relationships between objects via edges between the vertices.
https://news.ycombinator.com/item?id=11257280 (4 hours ago, 15 comments)