Introducing OpenCypher, the open graph query language project

klapinat0r · on Oct 22, 2015

> Why We Need a Common Graph Query Language

> I believe that graph query language is Cypher.

I, and others, believe gremlin is that common graph query language.

My first meet with graph databases was in fact Neo4j, but quickly pivoted to Titan, and now using Cayley for smaller projects (although its Cassandra model will make an interesting future).

All except Neo4j supported gremlin, which to me is expressive, formal and actual-human readable.

Cypher looks cool, and has intuitive method of writing the edge query parts, but while it looks hip, I didn't find it obvious nor human parsable. Human readable yes. You recognize the terms, but translating it into an AST is simpler and easier in gremlin than in Cypher (in my opinion)

okram · on Oct 22, 2015

Gremlin3 is now apart of Apache and nearly every graph vendor supports TinkerPop3 (or in the process of migrating to TinkerPop3). http://tinkerpop.incubator.apache.org/ Next, with TinkerPop/Gremlin you not only get a language but a virtual machine that can execute other languages (e.g. SPARQL) http://www.datastax.com/dev/blog/the-benefits-of-the-gremlin... . The virtual machine executes over OLTP graph databases (Titan, Neo4j, OrientDB, etc.) and over OLAP graph processors (Spark, Giraph, etc.).

iamtherhino · on Oct 22, 2015

Neo4j does support Gremlin: https://github.com/neo4j-contrib/gremlin-plugin

Gremlin was originally developed by Neo4j co-founder Peter Neubaurer and Marko Rodriguez (Titan founder) while they were both working at Neo4j.

optimuspaul · on Oct 22, 2015

Gremlin is certainly more readable in my opinion. Cypher has always felt like it was trying to hard to be different.

okram · on Oct 22, 2015

http://www.slideshare.net/slidarko/the-gremlin-traversal-lan...

th0ma5 · on Oct 22, 2015

I was thinking / hoping that language was SPARQL.

okram · on Oct 22, 2015

https://github.com/dkuppitz/sparql-gremlin

jerven · on Oct 22, 2015

I always wonder how readable those queries really are. Its a nice claim but has anyone done any really serious research about this at all? For example the same in SPARQL 1.1

  SELECT ?cypher_attributes
  WHERE {
    ?cypher a <QueryLanguage> ;
            <queries> ?graphs ;
            <attributes> ?cypher_attributes .
    ?user <USES> ?cypher .
    FILTER (?user IN ‘Oracle’, ‘Apache Spark’, ‘Tableau’, ‘Structr’)
    ?opencypher <MAKES_AVAILBLE> ?cypher .        
  }

Instead of

  MATCH (cypher:QueryLanguage)-[:QUERIES]->(graphs)
  MATCH (cypher)<-[:USES]-(u:User) WHERE u.name IN [‘Oracle’, ‘Apache Spark’, ‘Tableau’, ‘Structr’]
  MATCH (openCypher)-[:MAKES_AVAILBLE]->(cypher)
  RETURN cypher.attributes

In the SPARQL case the graph flow does not revert on the edge with <USES> (it can using ?cypher ^<USES> ?user, but it would be weird and in the larger queries very confusing). The SPARQL case also tends to group related concepts together.

This assumes a DEFAULT BASE URI is selected for the SPARQL version that contains all the modeled relations. Which in a straight comparison to Cypher is a fair comparison.

I find Gremlin a lot nicer than Cypher, and a lot more powerful as well. Also up to today Neo4J just has not scaled all that well. I am awaiting the LDBC Benchmark results of Neo4J to see if I am wrong.

What Neo4J has been great at is making a nice solid product that aims at solving developer problems. I believe as a database it has not been that great at solving enterprise or life science community problems. Its still a single database instance without federation on demand.

okram · on Oct 22, 2015

In Apache TinkerPop's Gremlin3.

  g.V().match(
    as("cypher").hasLabel("QueryLanguage").out("queries").as("graphs"),
    as("user").out("uses").as("cypher"),
      where("user", within(["Oracle","Apache Spark", "Tableau", "Structr"])),
    as("openCypher").out('makesAvailable").as("cypher")).
      select("cypher").by("attributes")

The original query is sort of an odd query as you don't don't need all the unbound variables...

And if you want to use SPARQL over TinkerPop, just use a SPARQL->Gremlin Virtual Machine compiler. https://github.com/dkuppitz/sparql-gremlin

johnymontana · on Oct 22, 2015

Yeah, I actually wrote that query just as an example to show what Cypher looks like and some fun around the openCypher announcement. Wasn't expecting it to end up on HN, let alone have Marko convert it to Gremlin...

timwilliate · on Oct 23, 2015

I disagree with you on the statement that Neo4j does not work well in life sciences. I am a data scientist building large scale systems for mining genomic data, and we built a fairly critical piece of that infrastructure around Neo4j. I actually presented an overview of that work at GraphConnect this week:

http://speakerdeck.com/timwilliate/graphs-are-feeding-the-wo...

Many meaningful lineages in life sciences can be hundreds to thousands of levels deep (our datasets are great examples). Neo4j is the only graph database I have evaluated that handles traversals across lineages of this depth while still achieving the performance scalability promised by maintaining index-free adjacency across which ever node in the cluster a traversal is sent to.

a_bonobo · on Oct 23, 2015

The recent "huge open tree of life" paper uses a Neo4j database as well: http://www.pnas.org/content/112/41/12764.full

jerven · on Oct 23, 2015

I am just going to point to our work at sparql.uniprot.org. A graph database with 17 billion edges and 3 billion+ nodes. Containing in its whole the NCBI tax and GO tax trees. That you can access for free over HTTP using standard SPARQL 1.1. This does not run on a cluster but single nodes with Virtuoso 7.2.1.

I am not saying that Neo4J is a bad choice, I am just saying that it due to its lack of federation support it is an expensive choice for the life sciences. i.e. an economic argument over a technical one, and not even looking at 1 project a time but in general for the community. Neo4J and Cypher will never support federation in the way that SPARQL allows. This is because all this URI business in RDF is annoying when modelling your data but critical when merging datasets on demand between separate databases. e.g. joining ChEMBL & UniProt & MeSH & PubChem etc...

We in the life sciences rarely do graph traversals for graph traversal sake, but tend to join trees. e.g. intersect a branch of a taxonomic tree with a branch of the GO tree. There are cases where real graph traversals are being done (assembly&variation graphs).

OpenCypher is a great step forward. Now Neo4J needs a open public standard for serializing graphs to disk that can imported into Neo4J and other databases. RDF being supported by so many different databases allows us to support many more of our users (at UniProt) even if they don't use SPARQL or our choice of Graph database themselves.

jakewins · on Oct 22, 2015

There's nothing stopping you from flipping that relationship order though, or making that pattern more compact. In fact, I'd prefer an overall reversed order, something like:

  MATCH (openCypher)-[:MAKES_AVAILBLE]->(cypher:QueryLanguage)-[:QUERIES]->(graphs),
        (u:User)-[:USES]->(cypher)
  WHERE u.name IN [‘Oracle’, ‘Apache Spark’, ‘Tableau’, ‘Structr’]
  RETURN cypher.attributes

I guess, in this particular case, it's subjective preference which language you feel expresses the query pattern most legibly. I certainly prefer the visual approach of cypher.

okram · on Oct 22, 2015

Ah, thats better as you don't need all the unbound variables. However, you still don't need "graphs" unless for the English reading of the promotion.

In Gremlin3, the above is:

  g.V().match(
    as("openCypher").out("makesAvailable").hasLabel("QueryLanguage").as("cypher").out("queries").as("graphs"),
    as("user").out("uses").as("cypher"),
      where("user", within(["Oracle","Apache Spark", "Tableau", "Structr"])),
        select("cypher").by("attributes")

jerven · on Oct 23, 2015

I like cypher, then gremlin for small queries. The problem is the queries I see are much, much larger. And at about 10 lines in the use of white space in SPARQL starts to make a real difference in readability in my opinion.

That can of course also be affected by my slight reading disability where the shape of the words is important. This shape could be disturbed by the connecting sigils in both gremlin and Cypher. So I understand that my preference might not hold for the whole population :)

grandalf · on Oct 22, 2015

> I find Gremlin a lot nicer than Cypher, and a lot more powerful as well.

Do you have an example of a query that you think is better in Gremlin? I've yet to see one but haven't spent much time with Gremlin.

taylorbuley · on Oct 22, 2015

A major benefit of Gremlin is portability to e.g. Google's Cayley. If I was specializing in Neo4J I would specialize in Cypher too.

jonpaine · on Oct 23, 2015

I think that's exactly why openCypher is happening. A robust and widely adopted openCypher is a good thing for users - but also for neo technology.

grandalf · on Oct 25, 2015

Does anyone know whether the semantics available in Cypher would be practical to use when querying a distributed graph database? Or is it useful to have a closer to the metal implementation considering the distributed system tradeoffs?

mk3 · on Oct 22, 2015

Neo4j has this thing on ignoring other graph databases. One example is their O'Reily book Graph databases covering you guessed it one graph database. I know it's good for marketing purposes. I like the way Cypher works, though not sure whether I would like to see Neo4j controlling how graph query language evolves.

jakewins · on Oct 22, 2015

So, it's to a large extent to address that last concern that we are doing this, right. We are moving the ownership of the language out into its own project, and inviting others to design the future of graph query languages together with us. We think that, while that means we loose a lot of control of our golden goose, so to speak, it is exactly the right choice if we want, as we do, for cypher to be the go to standard for querying graphs across the industry.

grandalf · on Oct 22, 2015

Thanks for all the work you guys are doing. I have used neo4j on a few projects, one in node and the other in python, and have really loved using cypher. The projects end up so simple with so many fewer lines of code it's amazing.

rdrey · on Oct 22, 2015

At least it _sounds like_ they've been talking to other graph DB vendors about adopting (Open)Cypher. Who knows if it's really going to happen, but it makes me slightly more inclined to invest time in Neo4J's Cypher when I finally start playing with GraphDBs.

mk3 · on Oct 22, 2015

In all article they mention only one another graph db which is not publicly available :) if you call this talking to other DB vendors :)

marknadal · on Oct 22, 2015

To add to your list (disclosure: I am the author) is http://gunDB.io/ which is an Open Source graph database written in javascript. Although we currently have not implemented any query languages (we'll be adding SQL soon), so you can only do graph traversal as of now.

I'm curious to see what would be the popular vote of what query language the industry wants to standardize (I definitely hope not GraphQL, I'm personally think RQL and LINQ are interesting). Rather than some company attempting to standardize it, I want to see the users vote. What would you choose?

dmoreno · on Oct 22, 2015

Why is similar to SQL but different, wouldn't be better to add the MATCH parameter to SELECT on good old SQL?

From https://en.wikipedia.org/wiki/Cypher_Query_Language:

    MATCH (charlie:Person { name:'Charlie Sheen' })-[:ACTED_IN]-(movie:Movie)
    RETURN movie

to

  SELECT movie
  FROM Movie movie, Person person
  WHERE person.name = 'Charlie Sheen'
  MATCH person-[:ACTED_IN]->movie;

jakewins · on Oct 22, 2015

While Cypher borrows a lot of (good) ideas from SQL, the two languages are fundamentally different in their underlying model. SQL deals with Relational Algebra, while Cypher deals with Graph Theory.

For that pattern to work in SQL, your SQL engine would need to be able to view its relations (tables) as a graph. I'm not smart enough to sort this out in my head, but my instinct tells me that it's better to have the two models separate, mixing them in one query seems like a recipe for confusion.

okram · on Oct 22, 2015

This is an exact replica in Gremlin3:

  g.V().has("name","Charlie Sheen").
    as("person").out("actedIn").as("movie").
      select("movie")

However, this can be expressed in a much simpler form as you don't need all the variables. Simply do:

  g.V().has("name","Charlie Sheen").out("actedIn")

cheerfulstoic · on Oct 23, 2015

Sometimes you do need the labels, depending on what you're trying to query for. If, for example, a person could have acted in both plays and movies, sometimes you might want one or the other specifically.

tsturge · on Oct 22, 2015

As the designer of MQL (Metaweb/Freebase's query language) which is a GraphQL like language on top of a graph database, I'm always surprised by other graph query languages. Since JSON represents a tree easily, it a natural model for non-cyclic queries.

Using JSON for the template also leads to a clear correspondence between the query and the result which is a very nice property few query languages have.

http://mql.freebaseapps.com/ch03.html

to take trip back to 2007.

owyn · on Oct 22, 2015

Now that freebase is shutting down, is there ever going to be an open source version of the server side implementation of MQL?

barakm · on Oct 24, 2015

I implemented a "light" version of MQL atop Cayley (https://github.com/google/cayley) that I'd be happy to extend :)

tsturge did a prescient job with the original (yeah, GraphQL is 2007 all over again) and I wanted to keep the flame alive.

tsturge · on Oct 23, 2015

I don't think so. Freebase became Google Knowledge Graph and I suspect the technology underneath is basically abandoned inside Google at this point. A pity, I think the 2008 version would still be the fastest and most compact graph database available today.

sireat · on Oct 23, 2015

I tried Cypher and Gremlin when working on smallish graph of about 3M nodes and 20M edges a few years ago

Back then both were pretty rough, Gremlin looked more polished but wasn't really (most promised features in the documentation were not working just yet).

What has changed since?

okram · on Oct 23, 2015

Apache TinkerPop3 has been in development for 2 years. It is a complete rewrite since TinkerPop2. It was just released in July 2015. Its light years ahead of Gremlin2.

http://tinkerpop.incubator.apache.org/

The two big things:

1. Gremlin language is much cleaner and easier to use.

2. It supports OLTP graph databases (e.g. Titan/Neo4j/Stardog) and OLAP graph processors (e.g. Hadoop/Spark/Giraph).

interdrift · on Oct 22, 2015

I work with graphs at daily basis and I can tell you that I was really suprised anyone hasn't come up with a solution to this.

rymohr · on Oct 22, 2015

What can cypher do that GraphQL can't?

atombender · on Oct 23, 2015

GraphQL, surprisingly, is neither a graph query language nor a query language. The comparison isn't really valid.

It's better to understand GraphQL as a protocol that competes with REST. It's only a language in the sense that JSON is a language; i.e., it has a syntax that can be parsed.

For example, GraphQL supports queries like this:

  query movie {
    whereYear(max: 1985)
    actors {
      hasName(like: "goldblum")
    }
  }

But this is something the particular schema and implementation would need to implement. If you want to filter by arbitrary attributes, you're out of luck because the spec is just a syntax. I suppose you do something like:

  where(what: "year", max: 1985)

but you still have to invent a standard set of parameters here: min, max, eq, notEq, lessThan, lessThanOrEq, like, etc. Again, totally ad hoc.

GraphQL, not being a language, also doesn't support variable bindings. So you cannot do self-referencing queries like "find all movies with a director who also acted in it", because that would require some kind of variable support.

(This is not a criticism of GraphQL, by the way. It's great at what it's defined for.)

optimuspaul · on Oct 23, 2015

I believe that is coming. The latest spec has variables for some use cases.

atombender · on Oct 23, 2015

The latest spec's "variables" let you pass simple parameters to a query. I don't see it going in the direction of a general-purpose graph query language.

jonpaine · on Oct 23, 2015

I think the 'Graph' in GraphQL confuses people. GraphQL is no more relevant to a graph database than it is to a relational database. It is an intermediary that facilitates efficient communication by standardizing req/res structure.

So, for people more comfortable with SQL, your question could just as well be "what can SQL do that GraphQL can't"?

The answer is, of course, that they occupy different domains and have different functions. Both can do lots of things that the other can't.

jakewins · on Oct 22, 2015

GraphQL ends up being capable of whatever the backend exposes in the GraphQL schema, right - so in a way, they are isomorphic. However, in reality I'd argue GraphQL is a brilliant compliment to Cypher.

Here are some use cases I think Cypher expresses nicely that I (as a GraphQL noob) don't know how to do in GraphQL:

Simple recommendation engine - suggest people with lots of friends in common that I don't already know;

  MATCH (me:User)-[:KNOWS]->(friend)-[:KNOWS]->(fof)
  WHERE NOT (me)-[:KNOWS]->(fof) 
  AND id(me) = blah
  RETURN fof.name, count(friend) AS friendsInCommon
  ORDER BY friendsInCommon DESC

Basic routing - what's the shortest way for me to get to work?

  MATCH p = shortestPath( (home:Address)-[:ROAD*]-(work) )
  WHERE home.street = .. AND work.street = ..
  RETURN p

okram · on Oct 22, 2015

In Gremlin3:

  g.V(blah).out("knows").aggregate("friends").
    out("knows").where(not(within("friends"))).
    select().
      by("name").
      by(count("friends")).
    order().by(valueDecr)

meowface · on Oct 23, 2015

As a developer who knows neither Cypher nor Gremlin, Cypher seems much more human-readable here, in my opinion.

    MATCH (me:User)-[:KNOWS]->(friend)-[:KNOWS]->(fof)
    WHERE NOT (me)-[:KNOWS]->(fof)

This easily reads as "get friends (AS FOF) [who know] friends [who know] me, where me [does not know] FOF" to me.

    out("knows").aggregate("friends").
    out("knows").where(not(within("friends")))

The Gremlin, on the other hand, reads as "get friends [who know] friends where... friend is not a friend???" to me.

I also don't easily see where something should be a method and where it should be a function. Why not `order(by(valueDecr))`? Why not `select("name", count("friends"))`? Why not `where(not().within("friends"))`?

okram · on Oct 23, 2015

Huh. That is a good point. In Gremlin you can chain steps together (e.g. out("knows").out("mother").out("worksFor")) and you can match patterns. So, to be clearer, I should have represented the chain as a one-liner or as a two liner with an indent.

  out("knows").aggregate("friends").out("knows").where(not(within("friends")))

OR

  out("knows").aggregate("friends").
    out("knows").where(not(within("friends")))

Note the "." concatenation that ties the two lines together into a chain. When nesting parallel traversals (e.g. match()), the traversal patterns are delineated by ",".

  . = AND
  , = OR

Ha. Thats a generally neat way to think of "." and "," in computing. mult and + ...the algebra.

jerven · on Oct 23, 2015

FYI in SPARQL

  SELECT ?foaf
  WHERE 
  {
    ?me a <USER> .
    ?me <KNOWS> ?friend . ?friend <KNOWS> ?foaf .
    MINUS {:me <KNOWS> ?foaf }
  }

OR

  SELECT ?foaf
  WHERE 
  {
    ?me a <USER> .
    ?me <KNOWS>/<KNOWS> ?foaf .
    MINUS {:me <KNOWS> ?foaf }
  }

optimuspaul · on Oct 23, 2015

GraphQL is still very undefined I've found. It can do basic queries and very limited traversals. The biggest problem is that there isn't really any real implementations for it, the spec is still a working draft, and it hasn't evolved much past defining schemas and doing basic queries.

I find what they have defined so far far more approachable than Cypher or Gremlin. As they have been adding features it is starting to sprawl and look just as nutty as the others. But I do like how it is defining the whole ecosystem around how graphs can be defined and interacted with, much like Gremlin has, but with a much more focused and disciplined approach.

baconner · on Oct 22, 2015

Wait so... It's not about crypto? That's a little confusing name wise.

johnymontana · on Oct 22, 2015

No, not crypto. Cypher is a graph query language, popularized in the Neo4j graph database. Its name may have been inspired by the Matrix movie...

phpnode · on Oct 22, 2015

This is great:

GraphQL has nothing to do with graphs.

OpenCypher has nothing to do with ciphers.

untothebreach · on Oct 22, 2015

Clearly, I need to start a project named 'GraphCypher,' which is a new, high-performance, HTTP2 server.

michaelmior · on Oct 22, 2015

I actually laughed out loud at this, although given the requirement for TLS, at least it would involve encryption.

untothebreach · on Oct 22, 2015

hmmm....guess it will have to be a javascript framework then. Pivot! :)

rakk · on Oct 22, 2015

Extra golden seeing your username - I bet you are a ruby dev? :D

jonpaine · on Oct 23, 2015

The database is called NEO(for java). All the examples are graphs of relationships in The Matrix, or of movies in general.

Cypher is named for the character in the movies. :)

jzd · on Oct 22, 2015

https://xkcd.com/927/

jakewins · on Oct 22, 2015

Except, we're not introducing something new :) We're taking what's already used by 70-80% of graph database deployments and opening it up for other vendors to collaborate with us on the design and implementation of it.

optimuspaul · on Oct 23, 2015

citation needed.

keyboardwarrior · on Oct 23, 2015

Wouldnt touch neo4j with a ten foot pole.

Between the orientdb/neo4j dick swinging contest and the whole oreilly "graph database" book for pure fluff. I am left with a very bad taste in my mouth.

It tries to do the whole vendor lock-in thing... badly.

jakewins · on Oct 24, 2015

http://db-engines.com/en/ranking/graph+dbms