Awesome surprise to see the embargo lifted -- sounds like I can now say the Graphistry team will be doing a follow-up talk at Amazon Re:Invent tomorrow (Thursday) on Amazon Neptune + Graphistry. We've been incorporating this into visual investigation workflows for security, fraud, health records, etc. They've been doing cool bits on the managed graph layer, and were early to graph GPU tech (Blazegraph team members), and our side starts bringing that kind of thinking to visual GPU analytics & workflow automation tech.
If you're in town and into this stuff, ping me at leo [at] graphistry, and would love to catch up Th/F for coffee+drinks. Also here + email, of course!
There was no PR. But there are traces, like Amazon acquired the domains, etc. Many former Blazegraph engineers are now Amazon Neptune engineers according to LinkedIn, etc. It was rumored widely in the graph db world fwiw.
And then after two years, when you're no longer startup with 100usd bill, but bigger company, you're completly tied to a jungle of amazon products, and your exit strategy is very very costly.
There is some truth to this, but in a larger sense (on an ecosystem level, rather than from the perspective of an individual company), I can only be happy when AWS enters a new space. It makes that component into table-stakes in the IaaS game, which means every other big player is about to step up with their own offering as well, and the third-party SaaS and open-source self-hosted offerings in the same space all are going to heat up as well.
Consider the evolution of container hosting services: first we had PaaSes like Heroku with proprietary container formats; then we got Docker, but Docker Swarm was nascent and there was no serious Docker Swarm IaaS-cloud offering. But then, very quickly, AWS built ECS; Google responded with Kubernetes; and then Kubernetes became the open standard, made everyone forget about Docker Swarm, and took over (and is even replacing ECS now.)
That's what happens when AWS enters a space. And it's great.
Just an off-topic comment:
i am the maintainer of a visual query builder for SPARQL queries.
cf http://datao.net
This tool proposes to design query patterns from a graph data model, via drag n drops. The tool can then compile the patterns as SPARQL, run them on an endpoint and format the results as map/forms/tables/graphs/HTML (via templating)/...
Another service of Datao (http://search.datao.net) proposes a search-engine view of those queries so you can type the textual representation of an object in any public SPARQL endpoint, and the service will list the queries currently available in Datao that can be applied upon this object.
You can then run these queries with a click, and get the HTML templating of the query results.
Feel free to have a look at the website, if you find any interest in this tool.
ANy feedback is welcome.
PS: Sorry for the poor quality of the videos. I manage this project on my spare time :)
Only had experience with Cypher, really liked it. It will be interesting to see how Neo4j responds to this. Regardless of tech specs, the fully-managed Neptune vs a community version on AWS Marketplace seems to give Neptune unfair advantage.
I don't understand how a database doesn't have its own native store. What exactly does a graph database actually do if it doesn't manage the data fed to it? Same is true for CayleyGraph† https://github.com/cayleygraph/cayley and proabably others.
1) "native" graph db
Neo4J is an example of this. This takes advantage of index-free adjacency. Each node knows what other nodes it is connected to and hence traversals are very fast. The issues you run into are when you try to scale. Data that fits onto a single machine is fine and you can replicate your data for fast parallel reads/traversals across disparate regions of a massive graph. However you no longer have the concept of data sharding and distributing the graph as index-free adjacencies don't translate across physical machines. And another drawback is highly connected vertices, you will expend a tremendous amount of resources deleting or mutating a vertex with, say, 10^6 edges. But that vertex is probably a bot so you should delete him anyway.
2) inverted index graphs, non-native graphs, whatever anti-marketing name it might have.
These rely on tables of vertices and other tables of edges. Indexes make them fast, not as fast for reads but very fast for writes. And you get distributed databases (Cassandra, for example, a powerful workhorse of a backend with data sharding and replication factor, etc.). But then you have to yet another index to maintain and the overhead can get expensive. This is the model adopted by DataStax, who bought Titan DB (hence the public fork to Janus) and integrated it with some optimisations and enterprise tools (monitoring etc, solr search engine) to sit on top of Cassandra.
Both now have improved integration with things like Spark. Cypher is probably faster than Tinkerpop Gremlin especially with the bolt serialisation introduced in recent versions of neo4j.
So janus is the graph abstraction layer of the second type and so needs somewhere to save these relationships. It all comes down to use case (and marketing) to decide what works best for you.
Regarding Cypher vs Gremlin: serialization could be a thing but what matters among other things are efficient query optimizations, algorithm and (physical) data model. Ultimately, databases are all reading from 1-dimensional spaces (RAM or disk), either randomly or (best) sequentially. If you can colocate vertices with their respective edges, you're fine: this is trivial for graphs with no edges or graphs that form a linear chain. If not, then things start to become fun, especially in a distributed way. This will impact performance; the language, not so much.
Here's one way to look at it: A graph database can be reflected with common, well-understood data structures. You can use a lot of backends to represent those data structures.
Graph database projects are often times just an adapter for doing Graph queries on-top of another store.
JanusGraph provides a graph data model on top of an existing storage layer. In this case it's using wide-column key/value systems. It works well, letting each layer do what it's good at while limiting the amount of separate systems needing to be maintained.
As per the other commenter, Cayley provides a graph data model on top of an existing storage layer.
However, when we get to manage the storage layer (Bolt, Level -- that's being generalized into local-KVs in the next release) we get to build our own indexes for better data management and performance. But there's no reason we can't hand that job off either -- hence supporting multiple (remote) backends. For the local stuff, though, at some point, Bolt is just a very good BTree implementation.
It's a database that's designed to store relationships between objects instead of just facts. It has efficient methods of following long chains of associations. So think of how you store tree structures in a relational database—there are a lot of different ways of doing it, and they're all frustrating. Storing trees is something graph databases do naturally.
I've been playing around with graph databases for a while (I am writing one that turns Postgres into one of my own for kicks[1], [2]) and one of the things that became obvious after using the project in production was that it promotes functional reactive programming in a way that most other database paradigms don't.
Even propagation and node invalidation are awesome for rapid what-if style experimentation, and I am so psyched to see more and more attention being paid to graph computing in general.
I think you're concentrating on the wrong thing, here.
Just as RDBMS can have tables about different things (Products, Users, Orders), graph databases can use labels on nodes for different things (so you can have :Product nodes, :User nodes, :Order nodes). Though with graph databases, there is often less rigidity in the associated data than in RDBMS, as there is no requirement for explicit schema for properties on nodes of different types in a graph db (plus you can multi-label nodes).
The real differentiator is how relationships are modeled, and how they're traversed in queries.
With RDMBS/SQL you're going to be working with data in tables, and use join tables as the relationships between them. You're likely going to need to be explicit about what is being joined together, so the relationship chain is likely to be very rigid.
With graph databases, relationships and relationship traversal is used in place of join tables and table joins, which gives much more flexibility over how to traverse. You can certainly do friend-of-friend-of-friend queries much more easily, but you can also perform variable-length traversals using custom logic for which nodes are in the path and which relationships are traversed (type, direction, and count), and that can be very well-defined, or very loosely defined, or a mix, as needed. I don't believe there are good ways to do that kind of ad-hoc table joining in RDBMS.
As an example of very loosely defined traversals in queries, you can ask for a shortest path between two nodes, knowing nothing about the nodes or relationships that could be between them, and get a path back showing the connecting nodes, with the relationships between the nodes providing context.
It seems a lot of Amazon services are managed instances of open source applications. For example, commenters are suggesting this may be based on Janus. Elastic load balancers, at least originally, were likely based on haproxy. Etc etc.
Has anyone ever considered the licensing implications of this? How is amazon able to convert an open source product into a proprietary one and then charge for access to it?
Of course you can argue they’re charging for the infrastructure management, not the software itself. But that argument quickly breaks down as Amazon introduces new software, under new names, with a proprietary management interface over an open source core. Try to find the source code; you can’t.
And if you accept the premise that they’re just charging for hosting, then it leads to the question of why an open source project doesn’t reap any benefits from that hosting, or at the very least, from the management interface on top of it.
It seems like a better solution would be something akin to AWS marketplace, where open source projects are available to be hosted, and the maintainers can see some revenue from them.
It seems like unfair rent seeking behavior that amazon is able to slap a management interface on open source software and then charge for it under the guise of “hosting.”
> How is amazon able to convert an open source product into a proprietary one and then charge for access to it?
Totally no problem with liberal licensed open source software.
This is also the intended behaviour of such licenses.
Also many of those big bad commercial companies contribute back big time to a number of projects. Why? I guess sometimes because devs want to and also because it makes sense business wise so they don’t have to maintain the code themselves.
Depends on the license, doesn’t it? I’m not a licensing expert, but my understanding is GPLv2 / copy left licensing means that if you create a derivative product, you need to open source the new code along with the dependencies.
Seems like a management interface is a clear cut derivative product. Where’s the source code?
Or perhaps amazon does consider licensing and only builds on top of, eg Apache licensed projects?
Actually GPL allows you to keep your source code as long as you don’t ship the software and only allows users to use it over the network. (The full truth is a bit more nuanced.)
The newer AGPL closes this loophole.
And yes: except for Linux and the GNU tools I guess most companies stick with Apache, BSD, Eclipse and MIT licensed software.
it seems clearly ok to sell managed services on open source platform x.
the question in amazons case is that since they sell the infrastructure, they can and probably do undercut any competing providers by charging themselves less.
so, it seems monopolistic. otoh their service is good and their customers get at least reasonable prices. so ... ?
Here is a great article on just this. It is commonly known as the "gpl loophole", which RMS is entirely fine with. If you want to prevent this, you license the software the with Affero GPL, which explicitly forbids this.
Amazon / Google / etc are not redistributing the software as it is running on their servers in their environment, therefor, there is nothing wrong with the existing licenses.
Don't free and open source licenses apply only during redistribution of the software? Unless it is licensed with the Affero GPL, just connecting to a service does not require its source code to be available. That is assuming Amazon modifies the software. If they don't, then there's nothing to argue.
Are they making money with software they didn't build? Yes, but so are we.
Time is money. It takes time to manage servers/infra - services like this let people make the choice between spending their time or their money managing infra. The category of managed infra is huge and goes beyond Amazon
My question is not about why someone would pay for the service. It’s about where amazon got the right to charge for it without open sourcing their derivative work.
To be clear, it’s not the hosting of open source applications I see as the problem, but the closed source management/orchestration software built on top of it.
Janus is a fork of Titan, BTW. The core Titan devs got acquired by Datastax and redirected to their graphDB offering. Titan stagnated, then got forked as Janus under the Linux Foundation.
Or they choose a horizontally-scalable architecture, a la TitanDB.
Btw, anyone knows how such solutions handle cross machine traversals? Are they schema-based? So the DB knows how to manage data locality and efficient joins/traversals?
I don't know about Neptune -- curious to hear what it is based on -- but TitanDB never really supported cross-machine traversals for the execution engine. The data was stored in a distributed fashion (across say a Cassandra cluster), but any instance of the execution engine was single-machine, with no easy way to talk between multiple instances of the execution engine.
Gremlin is indeed the query language but requires a gremlin engine. This is generally passing strings to the DB (which gives you advantages like pushdown-predecate, essentially DB-side filtering) but there is associated overhead with something like Cypher that is now serialised and very fast with the Bolt protocol
I believe Gremlin is just the query language. There is an original backend that implemented it, which might be what you are thinking has performance issues. But the query language intrinsically doesn’t have issues I don’t thinks.
Currently I work on a project with Neo4J and Cypher.
And I miss some of Gremlin tricks to optimize some graph traversals (for example to stop some sub-traversal when a given limit of matches have been reached).
While this isn't natively supported yet, there are some tricks to achieve something like this, either with using APOC Procedures for subqueries, or, if your expansion-stop case is based on labels, APOC's path expansion procedures.
https://neo4j.com/developer/kb/limiting-match-results-per-ro...
My point is not to criticize Cypher (at all). Its learning curve is perfect. And it covers most of the requirements really well with a compact readable syntax.
Plus it improves with each version of Neo4J. Which is cool.
My point is that gremlin has been super efficient for us to express (in its functional way) tricky traversals. So I do not see any reason to discard it as a "inefficient" technology.
I love GraphQL and use it quite a bit, but the "graph" part of it is a bit of a misnomer given all the existing graph database and query technologies. It doesn't really offer anything in terms of interacting with RDF triples or making complex graph queries. It has no relational algebra semantics or ability to query relationships between arbitrary nodes, which is what folks using graph databases typically want.
(I didn't downvote you though, it's a common misconception.)
I believed that GraphQL is more of an protocol for interacting with an api, and not for graph databases, which is what Gremlin and SparQL are used for.
I second all the comments about GraphQL not being a graph query language.
But we must agree that 90% of the quries we run on a graph database could be modelized with GraphQL.
(retrieve nodes of a given type plus some of their properties).
For the 10% other percent, Gremlin or SPARQL are then the way to go.
If you're in town and into this stuff, ping me at leo [at] graphistry, and would love to catch up Th/F for coffee+drinks. Also here + email, of course!