Hacker News new | past | comments | ask | show | jobs | submit login
The Future Is Big Graphs: A Community View on Graph Processing Systems (acm.org)
144 points by Anon84 on Sept 12, 2021 | hide | past | favorite | 65 comments



I have spent the last 7 years building ontologies (knowledge bases), small and large. Like with most tech articles, this one is pushing a very specific point of view that is relevant to megacorporations but certainly not to most companies that would benefit from the deployment of graph-structured knowledge bases.

Unless a company has tons of extra cash to burn using technologies like neo4j, Spark, etc., they are better off just building an ontology on top of their existing datastores.

The work on the semantic web in the 2000s added a lot of valuable ideas and I see a tendency to downplay these ideas in favor of new technologies. Storing relationships as RDF triples and building indices that are well-suited to your queries will be good enough for the majority of people who need an ontology.

I have found that graph-based systems don't model time well. When you are building a knowledge base, you often need to understand how knowledge has changed over time. You may want to relate facts across time or understand what the state of knowledge was within a given period of time. This usually takes extra work.

Another thing that I have found these graph-based systems to do poorly is the management of storage classes for different types of data. If you are building a graph knowledge base, not all the information in your knowledge base will be consumed in the same way. If your knowledge base is sufficiently large, you will want to store less frequently accessed data on cheaper storage (e.g. HDD or even object storage like S3) and hotter data on more expensive storage (SSD or even maybe keep it in memory). I have always had to write this code by myself and haven't found a tool that helps to manage this very importance concern in ontology building.

That turned into a bit of a ramble, but damn ontologies are cool. :)


>The work on the semantic web in the 2000s added a lot of valuable ideas and I see a tendency to downplay these ideas in favor of new technologies. Storing relationships as RDF triples and building indices that are well-suited to your queries will be good enough for the majority of people who need an ontology.

Is this basically the idea of using RDF to create explicit joins between datasets in other systems? E.g. each tuple has a foreign key reference to a row/doc that represents that node in the graph?

>That turned into a bit of a ramble, but damn ontologies are cool. :)

This is a major hurdle in infosec as enterprises adopt cloud products. The traditional ontologies around the risk profiles and state of operating systems and networks have been quite stable over the past 10-20 years and leverage things like DMTF's CIM and IETF's SNMP. However, what I'm finding is that 'cloud' products and services surface entirely new namespaces and concepts in nearly every instance. Sometimes, as in the case of AWS EC2, it's pretty simple to map these back to an existing schema, but in other cases like say AWS Chime or Glue, it gets really tricky. It feels like I'm looking for a system that can rapidly ingest the service's native information model then adapt it with overlays that transform or aggregate various services in a way that allows consistent data modelling.

Any suggestions there? :)


re: RDF for joins. That's exactly right. You can keep things as light or as heavy as you need them to be regarding when/how/whether you validate foreign key constraints, whether you use a schema registry, etc. The simplicity of the format makes it really easy to build around.

My only suggestion re: infosec ontologies is to not push too much functionality into the ontology all at once. I have had ontology projects end in spectacular failure, and all those failures have resulted from putting too much responsibility on the ontology from the very beginning.

The ontology should be as dumb as possible to start with. In your case, the approach I would suggest is:

1. Ontology stores mapping between cloud provider resource schemas to your internal schema. These mappings can be represented in multiple ways - as a URL to a service which implements the mapping or as a JSON object in some DSL which defines the transformations or even as the code/binary that should be run to perform the transformation.

2. From the ontology, it should be easy to query/update the following things: current cloud provider schema for a given resource, internal schema corresponding to a cloud resource, and current mapping to the internal schema, the number of failed transformations from cloud schema to internal schema in the past {1 hour | 1 day | 1 week | ...}.

3. At first, you build the transformations from the cloud provider schemas to internal schema by hand.

4. A worker which alerts you when the cloud provider schemas change (should be easy to build on top of your ontology after step 2 - check the number of failures in transformation).

5. Now write a worker which statically analyzes the terraform code base to see when they changed the code related to a particular cloud API you are working with. Represent this in the ontology (e.g. "AWS Chime" - "terraform:vblah.blah.blah" -> "<json resource definition>"

6. Write a worker which monitors updates to - "terraform:*" edges and propagates alerts you to update the mapping given the new JSON resource definition.

7. For some services, it will be easy to automate the transformation update. For these things, create a new ontology worker.

I have found that it's better to let workers run idempotently on cronjobs/systemd timers than to implement event-driven workers.

The whole thing will probably take a few months, but you will have production-ready code at the end of each step and start benefitting from the existence of the ontology right away.

(Sounds like you are already at step 3 or 4?)


This is amazing, thank you! Most of this is a mental experiment at this point, but I might take some of this to start building it. Appreciate the note!


> If your knowledge base is sufficiently large, you will want to store less frequently accessed data on cheaper storage (e.g. HDD or even object storage like S3) and hotter data on more expensive storage (SSD or even maybe keep it in memory). I have always had to write this code by myself and haven't found a tool that helps to manage this very importance concern in ontology building.

This is a general concern for a lot of areas, including observability. Lots of data, need to query it, need to store it cheaply with a time component to help differentiate.

Take a look at Thanos and Grafana Labs Cortex. It’s a mostly generalize architecture.


Thank you. Didn't know about Thanos.


> better off just building an ontology on top of their existing datastores.

I just made a view with three copypasta’d levels of joins on a parent-child relationship table to build up an ancestor_names column to be used in an analytics. The analytics team is probably going to just use regexes or if-statements on that column in Tableau to understand our performance with top-level partner names. If it gets slow I’ll make it a materialized view.

This. Is. Fine.

You don’t need a graph database or explicit ontology or even recursive queries to answer graph questions. It’s nice to have one, of course, but not necessary.


Yup, database views are exactly how to do 'ontology-based' inference in SQL, in the rare cases where that makes actual sense. It can do anything that a graph or "semantic" database can do, and with on-par or better performance. Even CTE recursive queries work just as well as a general-purpose graph DB, though the latter sometimes has a more convenient query language.


Totally. You shouldn't be allowing people to run graph traversals of arbitrary depth in your database anyway. :)


I am honestly intrigued any time I hear anyone mention RDF. You have had success using it? It wasn't exactly clear how you used it from what you said.


I only use RDF triples as a storage format for edges. I don't normally take advantage of triple stores (e.g. apache Jena), sparql query engines, etc.

It's just a simple format and it's easy to build different types of query/analysis capabilities around.

The only thing that's a little awkward is applying types to items in an rdf triple (in a way that's computationally cheap). Typing is a general problem, though.


I strongly agree with this - RDF is fine and there are some great concepts in the SemWeb world (IRIs for example), but OWL and SPARQL are confusing and can be off-putting, especially when it is all rolled together. The latest iteration of our triple store actually moves radically towards linked-document or a 'document graph' to make it more understandable while keeping RDF at the base.


What is your triple store?


Can you express graph attributes such as edge weight in RDF?

I am familiar with RDF triples such as "Book", author, "Jane Doe". How would you express "10 minute phone call at time t between persons A and B"?


Not previous commenter, but as with other problems in computing, I find that you need to add a level of indirection.

    "CallSessionX" -> participant -> "person A"
    "CallSessionX" -> participant -> "person B"
    "CallSessionX" -> duration -> "10 minute"
    "CallSessionX" -> at_time -> t
Haven't used Sparql or Cypher in many years (used in Neo4j), but with the later you could do a query along the lines

   "person A" <- participant <- call_session -> participant -> "person B" 
To find calls between person A and person B, then you can add a variable binding on the exact node you're interested in (eg call_session) to extract/aggregate attributes.


You could do it, but a triple wouldn't be the best choice of data structure. I would just add a new numerical column called "weight" to each relation in your RDF store:

    "Harry Potter and the Philosopher's Stone","author","Jane Doe",0.005

    "Harry Potter and the Philosopher's Stone","author","J.K. Rowling",1.0

    ...
You could easily export RDF triples from this if you needed to using rules like "The author for each book is determined by the '<book>-author-<name>' relation of highest weight".

Edit: Sorry I didn't answer your question about "10 minute phone call at time t between person A and person B".

The best way I can think of to model this with triples is:

    "call:<call_id>","started_at",t
    "call:<call_id>","ended_at",t + <duration>
    "person A","participant","call:<call_id>"
    "person B","participant","call:<call_id>"
But this may be overly complicated depending on how you want to query the data and what you want to do with it.

Edit 2: mhitza already answered it better! https://news.ycombinator.com/item?id=28503097


I've used RDF triples for some specific part of a bigger business data model. It was implemented on a regular database and used to encode things like "Joe Hates Propaganda" which is exactly what was needed and hence was a success (meaning that it had worked as intended).


Any word from the wise for those who want to get into ontologies?

Furthermore, do you know about any good resources regarding the process of capturing information into models? I've found books about solutions like Ecore & OWL, but I fail to find books about the practice of modelling in general.


I'm not so wise, my friend, but I do have some tips from hard experience:

1. Most people don't understand the importance of ontologies. It is an uphill battle to convince them. Build your ontologies quietly. They are a superpower.

2. A real ontology must extend far beyond the confines of the data store. You must think of it as also including the workers that bring data into the ontology, the workers which verify the consistency of the ontology, and the query engines through which users get data out of the ontology.

3. Start simple. Your ontologies should be as dumb as possible to start out with but they should support extension.

4. Learn one datastore really well and build your ontologies there. Chances are you don't need to know anything about OWL or neo4j or Spark or anything hyped. I use Postgres.

5. Idempotency is very important. This means, for example, that if you run a data collection worker twice with the same arguments, it should not create duplicate data in your ontology.

6. Build for a specific use case. Don't try to build a general purpose ontology. The kind of data you will be storing and the access patterns will determine a lot about the structure of your ontology and you should be accounting for these things.

7. Avoid event-drive architectures (e.g. IFTTT, AWS Lambda, etc.). Put your faith in idempotence and cron (actually, I use systemd but you get the picture).

Do you have an idea what kind of information you would want to build an ontology for?


Thanks for the tips! Yes, would you perhaps be up to grab a cup of coffee and have a chat about ontologies? :)


Definitely. Here is a link to my calendly: https://calendly.com/neeraj-simiotics/bugout-30


The Future is Tiny Graphs! [1]

In the style of "Scalability! At What COST?" I've started a side-project bringing succinct data structures (rank/select [2], elias-fano coding) and modern instruction sets (popcount, pdep) to graphs.

For some reason the article doesn't mention the field of graph compression and I think it shows: for example most if not all popular open source routing engines don't do anything regarding graph compression. It's time to change that! Reach out to me (preferably on the Github project) if you want try some ideas :)

[1] https://tinygraph.org [2] http://www.cs.cmu.edu/~dga/papers/zhou-sea2013.pdf


We built TerminusDB[1], our OSS knowledge graph, using succinct data structures. As we have a immutable delta encoding approach to allow domain teams to build and curate data products that they then share, this has been a very fruitful combination. We wrote a technical white paper all about it! [2]

[1] https://github.com/terminusdb/terminusdb [2] https://github.com/terminusdb/terminusdb/blob/dev/docs/white...


Very interesting! I especially like the expressed interest in logic-based approaches for reasoning about graphs. Quoting from the article:

"Logic-based and declarative formalisms. Logic provides a unifying formalism for expressing queries, optimizations, integrity constraints, and integration rules. Starting from Codd's seminal insight relating logical formulae to relational queries, many first order (FO) logic fragments have been used to formally define query languages with desirable properties such as decidable evaluation. Graph query languages are essentially a syntactic variant of FO augmented with recursive capabilities. [...] The influence of logic is pivotal not only to database languages, but also as a foundation for combining logical reasoning with statistical learning in AI. Logical reasoning derives categorical notions about a piece of data by logical deduction."

This interest could be good news for vendors of logic programming languages such as Prolog and its syntactic subset Datalog. These languages enable very convenient reasoning about graphs and allow both storing and querying the database with the same simple language elements that can themselves also be analyzed and reasoned about with the same formalism.

Imagine how convenient it would be to have a large graph database represented as a set of Prolog clauses that can be queried, run and analyzed with an efficient Prolog system! The data itself, queries, and any desired transformation and augmentation could be uniformly expressed in this way.


I know at least one vendor of a prolog based graph database [1] that hopes the interest continues to grow.

[1] https://github.com/terminusdb/terminusdb


This has been the semantic web dream for decades now.


Yes indeed, although the semantic web chose a much more convoluted set of technologies without the syntactic and conceptual coherence and simplicity of actual logic programming languages, thus making reasoning about everything extremely complex and inconvenient in comparison. I think this is a severe handicap when using these technologies in practice.

All automated reasoning is by necessity exclusively syntactic in nature since the meaning of processed symbols is inaccessible to a machine, therefore a careful consideration of syntactic aspects is very important to ensure easy analysis and meta-interpretation of logic programs and queries.

As I see it, a good formalism for knowledge representation makes it easy to reason about the knowledge and also about all programs and queries that themselves reason about the knowledge, enabling easy meta-interpretation and coherent extension languages.


well if you think about it, a triple store is basically a database of propositional logic statements.


We usually talk about several paradigms for data management and programming: navigational (e.g. CODASYL, OO databases/languages, most graph DBs), tensor-based (e.g. MOLAP, Matlab, TensorFlow, Pytorch), Map-Reduce (e.g. Hadoop and Spark before they started to disown their MR heritage). Semistructured (e.g. JSON) and dataframes (e.g. Pandas) are also important abstractions. These abstractions can be captured in the relational paradigm — in other words we can think of them as special cases and represent them as views on relational models.

It is clear from the last 50 years of our industry that the relational paradigm (eventually) always wins. It wins because it separates the what from the how and automates away a bunch of tricky programming that we have to do with the more imperative paradigms. — the OLTP workload was originally supported by navigational systems and then switched to relational systems mostly based on 3rd normal form schemas — the OLAP workload was originally supported by multi-dimensional array (tensor) systems then switched to relational systems based on star and snowflake schemas — the Big Data workload was originally supported by map-reduce systems then switched to relational systems based on star/snowflake + JSON schemas.

I believe that the Graph (via navigational), Reasoning (via procedural), and ML (via dataframe and tensor) workloads will move to relational systems based on graph normal form (GNF) schemas. Those relational systems won't be built like the SQL-based systems of today. They will need new relational language extensions, new semantic optimizers, new optimal join algorithms, materialized views that are maintained efficiently, new data structures, etc. We are working on such a system at RelationaAI. We’re in limited preview now but hope to have an open preview by early next year.


> It is clear from the last 50 years of our industry that the relational paradigm (eventually) always wins

You cannot make a foreign key to link two different tables with RDBMSes. It requires to make another table. With graphs instead this is easy, just make a link from/to any node. With RDBMSes the schema is fixed. With RDBMSes you cannot have "metaproperties" (or properties about properties). With graphs you can easily say {caesar born{source "foobar"} "100BC"}. With RDBMSes this quickly becomes a mess to handle when you have more properties.

These are just some use cases why relational does not always win.


I think graph processing has great potential for use cases where the data model is heterogeneous and the structure of a problem really can be modeled as a graph problem. I think the tooling and the technologies are still not nearly as mature as e.g. relational databases and most developers have very little exposure to graph models, so adoption is still quite low, but will probably rise as the cloud providers offer hosted graph databases. In 2015 I was quite excited for TitanDB and the TinkerPop stack but I think the acquisition of the team behind it (Aurelius) by DataStax kind of sucked the air out of the community edition. As far as I understand Amazon is using many ideas and concepts from TitanDB in their Neptune graph database, but unfortunately they don't seem to care to open-source it. I don't think graph databases & graph processing will become mainstream tough, as they're just not a good fit for most problems.


Unfortunately graph DBs are a solution in search of a problem. The performance of a graph traversal in all current implementations will be at best equivalent to a well crafted join at worst much slower.

I’m unclear on where truly heterogeneous data problems would emerge outside of Q&A applications and potentially CRM. Usually the work to acquire a dataset is so large that there is limited reason not to properly model the data.


> The performance of a graph traversal in all current implementations will be at best equivalent to a well crafted join at worst much slower.

A join must be implemented with a loop / scan to find the relevant target rows by foreign key - This is why foreign keys pretty much have to be indexed, to give the scan any hope of acceptable performance. So, remember, a join in a relational database generally means a join PLUS an index lookup.

If it's not clear, indexes are not free - they have storage cost as well as imposing a cost on writes to the table. Indexes also make lookup very very fast, but still not free. On a large table, you may be incurring the cost of 20-30 comparisons to find the desired rows (note that this is not a made up number. Assuming a btree index which reduces lookups to O(log_2(n)), you will need 20 comparisons to find a record in a table with 1,000,000 rows, and 26 comparisons to find a record in a table with 100,000,000 rows.)

Graph databases (at least neo4j) always talk about index-free adjacency. Now you hopefully understand what this means. The edge between two nodes in a graph database is stored directly with no need for an index to find the destination node - the traversal is index-free. This can be a tremendous advantage, and that's also not made-up. You can subtract the cost of index lookups that I gave earlier .


These methods only work for very small graphs though. Secondary indexing structures are inefficient and don't scale at all regardless of the implementation details, hence why even non-graph systems don't use them at large scales on single machines. The scale threshold where using indexes is worse than the alternatives is low. Direct representation without indexes (like your Neo4j example) are even worse due to the latency introduced for each step by very poor locality. There is no possible version of these methods that will work well; this would have been a solved problem 30 years ago otherwise.

Latency is the performance killer for graph traversals, so locality is everything, and preserving locality across recursive traversals is the critical computer science trick that is often missing. The computational cost of a comparison is irrelevant to the performance of graph traversal.

The way most "pure" graph systems implement graph traversals is unusably slow outside of trivial cases because they mitigate latency poorly. In practical graph systems, common traversals are typically materialized ahead of time so that a graph traversal operation is not required to satisfy queries, which defeats the point in many cases.


Very well summarized, particularly about the point of locality.

In terms of intractability large graph problems, here’s an example of a scalable solution that provides incredible performance (measured in terms of response times, queries per second, etc) by Facebook to represent the social graph https://www.usenix.org/system/files/conference/atc13/atc13-b...


Neptune sucked the air out of blazegraph as well... It is a constant problem. Graph databases will only ever become more mainstream when we get a real postgresql-level database in term of reliability, maintenance and trust in its evolution.


PostgreSQL is a fine graph database. It would be nice if it supported the graph-like query language that's now being standardized as an extension to SQL, but other than that it works quite well.


What is that "graph-like query language that's now being standardized as an extension to SQL"? I don't think I've ever heard of it.


It is pretty slow for most graph operations though. It works ok for simple things, especially with the recently added functions.


Well yes, if you need to run actual graph algorithms on network data I agree that pgSQL (or any general purpose database, including "semantics" oriented graph db's) is not the best.


Someone on the internet is wrong! Or maybe just partially correct.

This is very interesting work. There is some selling going on, not discussing enough about the issues that drove the database community toward relational databases and away from "network" databases in the 1970s.

If you think of visualization systems as a microcosm of more general computing, notice people need more than graphs - they need all kinds of tables, charts, maps, and other things. Same with "graph databases".

This area does exist because the database community underinvested in computing on graphs and networks, the same as it did with usability for a long time. It is still catching up but there is a lot of great work.

Another problem is helping users to understand the cost of what they are trying to compute. Select and join? I can estimate that. Find betweenness centrality of a large network? Find a tour that visits all the cycles of a network exactly once? All the edges? All nodes? Oops. What about graph object types? Is it OK to mix them?

Sometimes I wonder arguments that "graph databases are [more] efficient for linked structures." After 20 or 30 years of intense R&D, relational databases already seem pretty efficient for computing with relationships.


SQL struggles with hierarchical data (trees), which are graphs.


In practice most hierarchies are shallow (or poorly designed)

Country, Province/State, City, Neighborhood/ZIP, Building, Room. 5 jumps to go from planetary to personal. 5 JOINs is not unreasonable.


If you're interested in learning more about graphs, you might find my "Graphs for Data Science" substack: https://graphs4sci.substack.com/ where I work though using graph perspectives to analyze real world datasets.


It feels like this should have a citation to Scalability! At What COST?

https://www.usenix.org/system/files/conference/hotos15/hotos...

which evaluates distributed graph processing frameworks (circa 2015) on various problems and shows that they can be slower than a single threaded program on a laptop.

They have a data set with 105 M nodes and 3.7B edges that fits in 14 GB, which all fits in memory (at least 128-256 GB of memory should be taken for granted in these types of problems). Maybe with a clever encoding you can store the entire social network of the world, or at least a FB-sized one. (Storing nodes as integers is efficient and useful for many graph algorithms. You can later "join" with the ancillary data using a fast non-graph distributed processing framework).

This article mentions the challenges but doesn't mention the solution of simply using a single machine for big graphs:

Even seemingly minute HPAD variations, for example the graph's degree distribution, can have significant performance implications.

Graph processing systems rely on complex runtimes that combine software and hardware platforms. It can be a daunting task to capture system-under-test performance—including parallelism, distribution, streaming vs. batch operation

Anyone who has done programming with say MapReduce knows that it can be very elegant and fun, but also limiting. This is even more true of graph processing frameworks, as the article explains.

If you simply put everything in memory and write native code (which isn't hard at all for batch jobs), then you have so much more flexibility with ANY algorithm, and especially graph algorithms. This type of programming is also fun! You don't have to fit your problem into the framework and switch frameworks when it doesn't fit (or more likely just accept 10-1000x performance decrease when it doesn't fit).

---

Abstract:

We offer a new metric for big data platforms, COST, or the Configuration that Outperforms a Single Thread. The COST of a given platform for a given problem is the hardware configuration required before the platform outperforms a competent single-threaded implementation. COST weighs a system’s scalability against the overheads introduced by the system, and indicates the actual performance gains of the system, without rewarding systems that bring substantial but parallelizable overheads.


Anyone have an idea how a graph database like Neo4J is implemented? I imagine it as a unique query language (cypher) sitting on top of a traditional relational database.


Neo4J uses index-free adjacency which means each node is storing the address of where its edges point to.

This raises an interesting question about “big” graphs and distributed systems—-storing the address in memory across machines. So Neo4J, last I looked into it, is very fast for reads using a master-slave setup but may not scale well beyond a single machine.

Other approaches such as RedisGraph use OpenBLAS, but again you’re limited to a matrix representation.

And yet others, like TitanDB (bought by DataStax and open-source community forked into JanusDB) use graph abstractions that sit on top of distributed systems (HBase, Cassandra, Dynamo, etc) and these rely on adjacency lists and large indices. So the idea of “big graph” has been around for at least 6 years in the NoSQL world, but indeed everything is not a graph problem, tempting as that may be to claim.

Good article on native graph representation: https://dzone.com/articles/letter-regarding-native-graph


From Neo4j themselves - edges are referenced by index position in a file.

https://neo4j.com/developer/kb/understanding-data-on-disk/


Neo4J is a native database build on the JVM. It does not use any underlying relational DBs. IIRC, Graph links are implemented as direct references to other node objects (if they are in memory), which make traversal way faster than what you would get with relational DBs.



I’d imagine it’s more like an adjacency list structure with various indexes (similar to a regular relational dbs) to allows lookups based on node properties


RedisGraph uses OpenBLAS sparse matrices which deliver astounding performance.

It's not as feature-complete as Neo4J but it's a promising start.


I don't have a number of course, but I believe that almost all relational db models are more naturally modeled as graphs. In fact, all normalized relational db models that use foreign keys are just implementing a graph - foreign keys represent edges between nodes.

has-many, has-one, and many-to-many are all graph relationships. The relational model + foreign keys is just one way to encode a graph.


Yeah, tables/rows are basically vertices with 1 label & many properties*, and FKs are labeled edges without properties. But, you can easily implement edge properties with many-to-many tables, at the cost of an extra JOIN. In practice, this is a useful and efficient species of graph.

I think many people get baited by Neo4j's marketing without really understanding relational models and RDBMS (I sure did).

* With the (often true) assumption that many vertices will follow the same normalized schema.


Graphs are the least restricting data model. I think everything is easier to model as a graph, because its just things (verticies) and how they are related (edges). You mostly get to skip the modelling step.

The benefit of modelling as a relational db is that it forces you to put the data in a form that's easy to work with from a computer perspective.


I’m the creator of thatDot Connect [1], a distributed asynchronous streaming graph interpreter, where we’ve been working on exactly this problem for about 7 years now. Scalable distributed graph processing has so many applications! We were fortunate enough to be funded by DARPA for a while working on automated APT detection for cyber-security. Some of our recent work has shown linear scalability of this approach on large clusters at high throughput. For example, we got to half a million complex graph events ingested per second on a combined read/write workload and could have just kept scaling. We’re in the process of writing up this work and sharing all the details.

[1] https://www.thatdot.com/technology/


Distributing a graph is a really hard problem does anyone know what the state of this is?


Dgraph and ArangoDB seems to be doing it well.


We ended up ditching Arango for a new project, despite it having decent enough graph handling.

This is because, unlike many other DBs (namely Mongo, which we're trying desperately to move away from), you can't spin up a new instance and allow the application to idempotently initialize it.

The multi-modal engine of Arango seems great, but the DX surrounding its use is a bit subpar, and would greatly shift a lot of burden to the environment config, which only adds to potential security vulnerabilities.

This is only made worse by the fact it has a severely limited granularity with security permissions.


Security is another big issue with the open graph databases. Some closed ones have ok mechanisms. But we are not there yet.


Out of curiosity, are you self managing the database today or using a service?


Self hosting. Services don't usually have as tight of security as I'd like.


I would argue that it's a common misconception: the risk of doing something wrong when you're self managing every layer means you're opening yourself up to so much surface area that needs to be done with expert precision.

Obviously services are not some magic wand either, EG the recent Microsoft snafus. But they're a third-tier cloud player


A previous coworker now works at Katana Graph and I've been keeping an eye on what they claim to be doing. Sounds very interesting.

https://katanagraph.com/


Some slightly different views based on our experiences with startups/techco's/enterprises/govs. A lot of what there is true, but at the same time, we've been seeing major appetite in 2 areas that lead to quite different directions from the article.

To be clear, I do think the CACM authors are right about a lot of what is commercially + academically popular in big graph land today. Ex: Powering better knowledge graphs, especially for NLP neural network empowered use cases (BERT/GPT). Ex: Enriching inputs into traditional neural network inference calls, like a user 360 before making a recommendation. I expect graph DBs to keep improving to help with theses.

At the same time, working with teams from first-principles to help skate to where the puck is going, we've come to 2 overall different priorities where those ideas aren't the primary technology enablers for big graph use cases in terms of engineering + big $$$$ enablers. Broadly, it's useful to think though what Data Analysts are doing and what Data Scientists are doing:

1. Graph BI: Similar to Tableau, Looker, and more autoML-y successors, there are a LOT of analysts who are trying to better understand all that data in their data warehouses. Operational data in Splunk/ELK, transactions & business data in SalesForce CRM / RedShift transactions / etc. Graph questions are powerful here. Graphs are all about relationships like user/cohort behavior, so opens up a lot of questions/answers for analysts, especially if made accessible. However, from a data engineering perspective, copying that data into graph DBs as a second system of record is historically a "now you have 2 problems" thing, and part of why most data lake projects fail. So we've been focusing on building graph query pushdown + virtual fabric style projects that expose graph thinking over your existing data infra. The discussion in the CACM article on dynamic graphs touches on this, but that's only one piece of that problem. Getting analyst graph insights over warehouses as ~easy as it is today for basic time questions on events requires a lot more. Surprisingly, we suspect so much so that it may commoditize the particular choice of graph compute engine (vs say connectors/virtualization/UI/automation/AI.)

2. Graph AI: Traditional neural networks have been increasingly dominating due to strong ability to combine categorical & continuous data around entities of interest (think computer vision realizing some pixels are a cat)... but struggled to leverage relationship structure: imagine a communication or transaction between people/devices/companies with rich histories & communities. Being able to fuse those matters a LOT on commercial contexts, so we'd been watching graph neural networks for awhile. Over the last ~2 years, with the mounting focus by the AI community, key problems around performance, heteorgeneous data, etc., have been getting solved, so GNNs are starting to show up in all sorts of places now. Interestingly, most companies using them don't plug graph databases or knowledge graphs or even traditional graph compute engines into them. If you look at how pytorch, dgl, etc. are being built, they largely ignore what traditional graph infra vendors focus on as 1-3 magnitudes too slow, and overall, wrong tool for the job: different problem space.

In both cases, teams building/using graph DB or graph compute engines, like the CACM authors, get enlisted to help as natural stakeholders for pieces of the problem. Likewise, they naturally want to help + grow their scope. Unfortunately, from a perspective of "you're lucky to do even 1-3 things well", they've been struggling to provide the needed end-to-end solutions. There are plenty of critical & fascinating scale, design, + use case challenges to tackle far beyond the core db/compute engine perspective here.

Anyways, this is an awesome time for this stuff so I encourage getting involved! We're actively hiring for help creating these new categories (all levels: fullstack, UI, BI, data, AI) + launching projects with lighthouse design partners, so if either is relevant, please reach out! Likewise, traditional graph DB/compute is important too (see: CACM article), and been a no-brainer for attracting most of the VC attention for the last few years due to being a traditional DB play, and we've enjoyed partnering with friends at them, so I'd recommend looking at those companies as well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: