Hacker News new | past | comments | ask | show | jobs | submit login
Amazon Neptune – Fast, reliable graph database built for the cloud (amazon.com)
286 points by irs on Nov 29, 2017 | hide | past | favorite | 72 comments



Awesome surprise to see the embargo lifted -- sounds like I can now say the Graphistry team will be doing a follow-up talk at Amazon Re:Invent tomorrow (Thursday) on Amazon Neptune + Graphistry. We've been incorporating this into visual investigation workflows for security, fraud, health records, etc. They've been doing cool bits on the managed graph layer, and were early to graph GPU tech (Blazegraph team members), and our side starts bringing that kind of thinking to visual GPU analytics & workflow automation tech.

If you're in town and into this stuff, ping me at leo [at] graphistry, and would love to catch up Th/F for coffee+drinks. Also here + email, of course!


If you actually read the docs, it's not Janus-based, it's based on BlazeGraph, which Amazon reportedly acquihired last year.


Is that public information? I don't see any press releases about it.


There was no PR. But there are traces, like Amazon acquired the domains, etc. Many former Blazegraph engineers are now Amazon Neptune engineers according to LinkedIn, etc. It was rumored widely in the graph db world fwiw.


To add on Kendall's comment.

Amazon owns the BLAZEGRAPH trademark: https://www.trademarkia.com/blazegraph-86498414.html

Blazegraph's CEO is currently at Amazon as Principal Product Manager: https://www.linkedin.com/in/bradley-bebee-a15764b/


Excellent news! And congratulations to all the Blazegraph people. Afaik, Wikidata SPARQL endpoint runs on Blazegraph. Am I right?

Then will Wikidata have to migrate to AWS to maintain the endpoint?


Yet another amazon service to lock you in.

And then after two years, when you're no longer startup with 100usd bill, but bigger company, you're completly tied to a jungle of amazon products, and your exit strategy is very very costly.

clever amazon, clever.


There is some truth to this, but in a larger sense (on an ecosystem level, rather than from the perspective of an individual company), I can only be happy when AWS enters a new space. It makes that component into table-stakes in the IaaS game, which means every other big player is about to step up with their own offering as well, and the third-party SaaS and open-source self-hosted offerings in the same space all are going to heat up as well.

Consider the evolution of container hosting services: first we had PaaSes like Heroku with proprietary container formats; then we got Docker, but Docker Swarm was nascent and there was no serious Docker Swarm IaaS-cloud offering. But then, very quickly, AWS built ECS; Google responded with Kubernetes; and then Kubernetes became the open standard, made everyone forget about Docker Swarm, and took over (and is even replacing ECS now.)

That's what happens when AWS enters a space. And it's great.


It supports RDF/SPARQL which gives you migration options to twenty or so triple stores such as rdflib, Jena, Virtuoso, AllegroGraph, etc.

No lock in at all.


Same with object stores, nobody had an offer before S3 proved to be practical for almost everything.


Yes and no.

If you factor in the cost of not taking "the easy way" you'd likely never get past the 100usd phase.

You're point is valid. I'm just suggesting the lens / context isn't as one-sided as you've presented.

Put another way, plenty of startups and VCs would love to have "getting out from under AWS" at the top of their good problems to have list.


Just an off-topic comment: i am the maintainer of a visual query builder for SPARQL queries. cf http://datao.net

This tool proposes to design query patterns from a graph data model, via drag n drops. The tool can then compile the patterns as SPARQL, run them on an endpoint and format the results as map/forms/tables/graphs/HTML (via templating)/...

Another service of Datao (http://search.datao.net) proposes a search-engine view of those queries so you can type the textual representation of an object in any public SPARQL endpoint, and the service will list the queries currently available in Datao that can be applied upon this object. You can then run these queries with a click, and get the HTML templating of the query results.

Feel free to have a look at the website, if you find any interest in this tool. ANy feedback is welcome.

PS: Sorry for the poor quality of the videos. I manage this project on my spare time :)


Only had experience with Cypher, really liked it. It will be interesting to see how Neo4j responds to this. Regardless of tech specs, the fully-managed Neptune vs a community version on AWS Marketplace seems to give Neptune unfair advantage.


> seems to give Neptune unfair advantage

What do you mean by "unfair" here?


Is this JanusGraph under the covers? Guessing since Neptune is a nod to Janus.


http://janusgraph.org/

Support for various storage backends:

   - Apache Cassandra®
   - Apache HBase®
   - Google Cloud Bigtable
   - Oracle BerkeleyDB
I don't understand how a database doesn't have its own native store. What exactly does a graph database actually do if it doesn't manage the data fed to it? Same is true for CayleyGraph† https://github.com/cayleygraph/cayley and proabably others.

†Plays well with multiple backend stores:

   - KVs: Bolt, LevelDB
   - NoSQL: MongoDB
   - SQL: PostgreSQL, CockroachDB, MySQL
   - In-memory, ephemeral


There are two main paradigms here

1) "native" graph db Neo4J is an example of this. This takes advantage of index-free adjacency. Each node knows what other nodes it is connected to and hence traversals are very fast. The issues you run into are when you try to scale. Data that fits onto a single machine is fine and you can replicate your data for fast parallel reads/traversals across disparate regions of a massive graph. However you no longer have the concept of data sharding and distributing the graph as index-free adjacencies don't translate across physical machines. And another drawback is highly connected vertices, you will expend a tremendous amount of resources deleting or mutating a vertex with, say, 10^6 edges. But that vertex is probably a bot so you should delete him anyway.

2) inverted index graphs, non-native graphs, whatever anti-marketing name it might have. These rely on tables of vertices and other tables of edges. Indexes make them fast, not as fast for reads but very fast for writes. And you get distributed databases (Cassandra, for example, a powerful workhorse of a backend with data sharding and replication factor, etc.). But then you have to yet another index to maintain and the overhead can get expensive. This is the model adopted by DataStax, who bought Titan DB (hence the public fork to Janus) and integrated it with some optimisations and enterprise tools (monitoring etc, solr search engine) to sit on top of Cassandra.

Both now have improved integration with things like Spark. Cypher is probably faster than Tinkerpop Gremlin especially with the bolt serialisation introduced in recent versions of neo4j.

So janus is the graph abstraction layer of the second type and so needs somewhere to save these relationships. It all comes down to use case (and marketing) to decide what works best for you.


Recommended reads on the native vs non-native topic:

* https://www.datastax.com/dev/blog/a-letter-regarding-native-... (tldr; there is no such thing as a native graph database)

* https://neo4j.com/blog/note-native-graph-databases/ (tldr; native graph databases do exist)

Regarding Cypher vs Gremlin: serialization could be a thing but what matters among other things are efficient query optimizations, algorithm and (physical) data model. Ultimately, databases are all reading from 1-dimensional spaces (RAM or disk), either randomly or (best) sequentially. If you can colocate vertices with their respective edges, you're fine: this is trivial for graphs with no edges or graphs that form a linear chain. If not, then things start to become fun, especially in a distributed way. This will impact performance; the language, not so much.


I'm familiar with Marko and his arguments hence my quotes around "native" ;) But it sounds fantastic for marketing


Here's one way to look at it: A graph database can be reflected with common, well-understood data structures. You can use a lot of backends to represent those data structures.

Graph database projects are often times just an adapter for doing Graph queries on-top of another store.

At their core, a graph database can be reflected simply with just documents and adjacency lists https://en.wikipedia.org/wiki/Adjacency_list


Do you think it's fair to say traditional DBs are about data, but graph DBs are (more) about relationships (between the data)?


JanusGraph provides a graph data model on top of an existing storage layer. In this case it's using wide-column key/value systems. It works well, letting each layer do what it's good at while limiting the amount of separate systems needing to be maintained.


> What exactly does a graph database actually do if it doesn't manage the data fed to it?

It provides tools to run complex queries on graphs, and manages data models and indices to execute them fast.


As per the other commenter, Cayley provides a graph data model on top of an existing storage layer.

However, when we get to manage the storage layer (Bolt, Level -- that's being generalized into local-KVs in the next release) we get to build our own indexes for better data management and performance. But there's no reason we can't hand that job off either -- hence supporting multiple (remote) backends. For the local stuff, though, at some point, Bolt is just a very good BTree implementation.


Can someone explain to me in lamens terms what a graph database is?


It's a database that's designed to store relationships between objects instead of just facts. It has efficient methods of following long chains of associations. So think of how you store tree structures in a relational database—there are a lot of different ways of doing it, and they're all frustrating. Storing trees is something graph databases do naturally.


I've been playing around with graph databases for a while (I am writing one that turns Postgres into one of my own for kicks[1], [2]) and one of the things that became obvious after using the project in production was that it promotes functional reactive programming in a way that most other database paradigms don't.

Even propagation and node invalidation are awesome for rapid what-if style experimentation, and I am so psyched to see more and more attention being paid to graph computing in general.

[1] https://www.github.com/kchoudhu/openarc [2] https://www.anserinae.net/whats-cooking-openarc-edition.html...


Trying to get this straight in my mind here.

Is it fair to say that traditional RDBMS/SQL are for storing different "sets" of related information (tables for products, users, orders).

Graph databases are for storing data about the _same_ set of data as it interrelates to itself.

- a User and and all their Friends (who are also users) - a Keyword and all associated Terms (which are also keywords)

Is that right?


I think you're concentrating on the wrong thing, here.

Just as RDBMS can have tables about different things (Products, Users, Orders), graph databases can use labels on nodes for different things (so you can have :Product nodes, :User nodes, :Order nodes). Though with graph databases, there is often less rigidity in the associated data than in RDBMS, as there is no requirement for explicit schema for properties on nodes of different types in a graph db (plus you can multi-label nodes).

The real differentiator is how relationships are modeled, and how they're traversed in queries.

With RDMBS/SQL you're going to be working with data in tables, and use join tables as the relationships between them. You're likely going to need to be explicit about what is being joined together, so the relationship chain is likely to be very rigid.

With graph databases, relationships and relationship traversal is used in place of join tables and table joins, which gives much more flexibility over how to traverse. You can certainly do friend-of-friend-of-friend queries much more easily, but you can also perform variable-length traversals using custom logic for which nodes are in the path and which relationships are traversed (type, direction, and count), and that can be very well-defined, or very loosely defined, or a mix, as needed. I don't believe there are good ways to do that kind of ad-hoc table joining in RDBMS.

As an example of very loosely defined traversals in queries, you can ask for a shortest path between two nodes, knowing nothing about the nodes or relationships that could be between them, and get a path back showing the connecting nodes, with the relationships between the nodes providing context.


It seems a lot of Amazon services are managed instances of open source applications. For example, commenters are suggesting this may be based on Janus. Elastic load balancers, at least originally, were likely based on haproxy. Etc etc.

Has anyone ever considered the licensing implications of this? How is amazon able to convert an open source product into a proprietary one and then charge for access to it?

Of course you can argue they’re charging for the infrastructure management, not the software itself. But that argument quickly breaks down as Amazon introduces new software, under new names, with a proprietary management interface over an open source core. Try to find the source code; you can’t.

And if you accept the premise that they’re just charging for hosting, then it leads to the question of why an open source project doesn’t reap any benefits from that hosting, or at the very least, from the management interface on top of it.

It seems like a better solution would be something akin to AWS marketplace, where open source projects are available to be hosted, and the maintainers can see some revenue from them.

It seems like unfair rent seeking behavior that amazon is able to slap a management interface on open source software and then charge for it under the guise of “hosting.”


> How is amazon able to convert an open source product into a proprietary one and then charge for access to it?

Totally no problem with liberal licensed open source software.

This is also the intended behaviour of such licenses.

Also many of those big bad commercial companies contribute back big time to a number of projects. Why? I guess sometimes because devs want to and also because it makes sense business wise so they don’t have to maintain the code themselves.


Depends on the license, doesn’t it? I’m not a licensing expert, but my understanding is GPLv2 / copy left licensing means that if you create a derivative product, you need to open source the new code along with the dependencies.

Seems like a management interface is a clear cut derivative product. Where’s the source code?

Or perhaps amazon does consider licensing and only builds on top of, eg Apache licensed projects?


Actually GPL allows you to keep your source code as long as you don’t ship the software and only allows users to use it over the network. (The full truth is a bit more nuanced.)

The newer AGPL closes this loophole.

And yes: except for Linux and the GNU tools I guess most companies stick with Apache, BSD, Eclipse and MIT licensed software.


it seems clearly ok to sell managed services on open source platform x.

the question in amazons case is that since they sell the infrastructure, they can and probably do undercut any competing providers by charging themselves less.

so, it seems monopolistic. otoh their service is good and their customers get at least reasonable prices. so ... ?


Here is a great article on just this. It is commonly known as the "gpl loophole", which RMS is entirely fine with. If you want to prevent this, you license the software the with Affero GPL, which explicitly forbids this.

http://radar.oreilly.com/2007/07/the-gpl-and-software-as-a-s...

Amazon / Google / etc are not redistributing the software as it is running on their servers in their environment, therefor, there is nothing wrong with the existing licenses.


Don't free and open source licenses apply only during redistribution of the software? Unless it is licensed with the Affero GPL, just connecting to a service does not require its source code to be available. That is assuming Amazon modifies the software. If they don't, then there's nothing to argue.

Are they making money with software they didn't build? Yes, but so are we.


Time is money. It takes time to manage servers/infra - services like this let people make the choice between spending their time or their money managing infra. The category of managed infra is huge and goes beyond Amazon


My question is not about why someone would pay for the service. It’s about where amazon got the right to charge for it without open sourcing their derivative work.

To be clear, it’s not the hosting of open source applications I see as the problem, but the closed source management/orchestration software built on top of it.


So I get that this offers simpler paradigm for graph data, but how should we interpret the "fast & scalable" claim? Is it...

a) Slower than RDBMS/NoSQL but still pretty respectable, so it's a good choice for things like offline analysis.

b) About the same at RDBMS/NoSQL, so you could use it to handle production traffic if you want.

c) Faster, so you should definitely prefer it in production, e.g. for fetching upvotes and comments on posts.


Why "Neptune"? Having a hard time riddling that name out.


Two other well known graph databases are "Janus" and "Titan" both of which are named after ancient gods


Janus is a fork of Titan, BTW. The core Titan devs got acquired by Datastax and redirected to their graphDB offering. Titan stagnated, then got forked as Janus under the Linux Foundation.


Good call, thx


Are they using X1 ? https://aws.amazon.com/ec2/instance-types/x1/

For efficient graph DBs it's better to have a lot of ram and cores ...


Or they choose a horizontally-scalable architecture, a la TitanDB.

Btw, anyone knows how such solutions handle cross machine traversals? Are they schema-based? So the DB knows how to manage data locality and efficient joins/traversals?


I don't know about Neptune -- curious to hear what it is based on -- but TitanDB never really supported cross-machine traversals for the execution engine. The data was stored in a distributed fashion (across say a Cassandra cluster), but any instance of the execution engine was single-machine, with no easy way to talk between multiple instances of the execution engine.


One database service that supports horizontally scaled graphs is Azure CosmosDB Graph API: https://docs.microsoft.com/en-us/azure/cosmos-db/graph-intro...

Worth to take a look if you need a managed Gremlin solution with some degree of global distribution.


Super excited about this!!!!! BUT The preview link (https://pages.aws.com/NeptunePreview.html) is broken, can anyone at AWS team help us with that?



"Your storage cost will be $0.10 per GB-mon0h," on https://aws.amazon.com/neptune/pricing/



1 point by lolive 14 hours ago [-]

I really hope Amazon will propose a facility to retrieve the RDFS data model of an endpoint in a uniform way.


What inferencing does it offer to RDF?

How would I bolt on an inference engine to this if none is offered, i.e. to provide OWL:RL?


It could be a modified JanusGraph frontend backed by DynamoDB.


For wider context here, is this leading the pack or do other public clouds have competing products already?


Sadly they use Gremlin that is so often said to have poor performances


AFAIK gremlin is just a query language - it shouldn't have much to do with performance.


Gremlin is indeed the query language but requires a gremlin engine. This is generally passing strings to the DB (which gives you advantages like pushdown-predecate, essentially DB-side filtering) but there is associated overhead with something like Cypher that is now serialised and very fast with the Bolt protocol


that was my point but my Rhetoric was not as good as yours :)


I believe Gremlin is just the query language. There is an original backend that implemented it, which might be what you are thinking has performance issues. But the query language intrinsically doesn’t have issues I don’t thinks.


Currently I work on a project with Neo4J and Cypher. And I miss some of Gremlin tricks to optimize some graph traversals (for example to stop some sub-traversal when a given limit of matches have been reached).


While this isn't natively supported yet, there are some tricks to achieve something like this, either with using APOC Procedures for subqueries, or, if your expansion-stop case is based on labels, APOC's path expansion procedures. https://neo4j.com/developer/kb/limiting-match-results-per-ro...


My point is not to criticize Cypher (at all). Its learning curve is perfect. And it covers most of the requirements really well with a compact readable syntax. Plus it improves with each version of Neo4J. Which is cool.

My point is that gremlin has been super efficient for us to express (in its functional way) tricky traversals. So I do not see any reason to discard it as a "inefficient" technology.


Interesting that it doesn’t support GraphQL, but rather Gremlin and SparQL. Surely that will impact adoption.


I love GraphQL and use it quite a bit, but the "graph" part of it is a bit of a misnomer given all the existing graph database and query technologies. It doesn't really offer anything in terms of interacting with RDF triples or making complex graph queries. It has no relational algebra semantics or ability to query relationships between arbitrary nodes, which is what folks using graph databases typically want.

(I didn't downvote you though, it's a common misconception.)


> It doesn't really offer anything in terms of interacting with RDF triples or making complex graph queries.

SPARQL is supported.


Yes, I was talking about GraphQL.


I believed that GraphQL is more of an protocol for interacting with an api, and not for graph databases, which is what Gremlin and SparQL are used for.


GraphQL is not a graph database query language. It's an alternative to REST.


The lack of support of OpenCypher is also a bit intriguing.


GraphQL can in fact be a graph query language -- my company has built and open-sourced a tool that makes that possible. Here's a blog post that describes how it works: https://blog.kensho.com/compiled-graphql-as-a-database-query...

If you want to try it out:

  pip install graphql-compiler


I second all the comments about GraphQL not being a graph query language. But we must agree that 90% of the quries we run on a graph database could be modelized with GraphQL. (retrieve nodes of a given type plus some of their properties). For the 10% other percent, Gremlin or SPARQL are then the way to go.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: