Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What is the best software to visualize a graph with a billion nodes?
134 points by throwaway425933 4 months ago | hide | past | favorite | 109 comments
Currently I am using GraphViz. But I am not happy with the quality of output (It is writing a postscript file).

I want to be able to zoom in and zoom out. Graph has upto 100B nodes and is directed cyclic graph.




Visualizing large graphs is a natural desire for people with lots of connected data. But after a fairly small size, there's almost no utility in visualizing graphs. It's much more useful to compute various measures on the graph, and then query the graph using some combination of node/edge values and these computed values. You might subset out the nodes and edges of particular interest if you really want to see them -- or don't visualize at all and just inspect the graph nodes and edges very locally with some kind of tabular data viewer.

It used to be thought that visualizing super large graphs would reveal some kind of macro-scale structural insight, but it turns out that the visual structure ends up becoming dominated by the graph layout algorithm and the need to squash often inherently high-dimensional structures into 2 or 3 dimensions. You end up basically seeing patterns in the artifacts of the algorithm instead of any real structure.

There's a similar, but unrelated desire to overlay sequenced transaction data (like transportation logs) on a geographical map as a kind of visualization, which also almost never reveals any interesting insights. The better technique is almost always a different abstraction like a sequence diagram with the lanes being aggregated locations.

There's a bunch of these kinds of pitfalls in visualization that people who work in the space inevitably end up grinding against for a while before realizing it's pointless or there's a better abstraction.

(source: I used to run an infoviz startup for a few years that dealt with this exact topic)


> But after a fairly small size, there's almost no utility in visualizing graphs.

I want to stress this point and go a bit further. It can be worse as people have pareidolia[0], a tendency to see order in disorder. Like how you see familiar shapes in the clouds. There is a danger in that with large visualizations such as these that instead of conveying useful information, you counterproductively convince someone that something that isn't true is! Here's a relevant 3B1B video where this is kinda discussed. There is real meaning but the point is that it is also easy to be convinced of things that aren't true[1]. In fact, Grant's teaching style is so good you might even convince yourself of the thing he is disproving as he is revealing how we are tricked by the visualization. Remember what the original SO person latched onto.

I think it's important to recognize that visualization is a nontrivial exercise. Grant makes an important point at the end, restating how the visualization was an artifact and how if you dig deep enough into an arbitrary question, you _can_ find value. Because at the end of the day, there are rules to these things. The same is true about graphs. There will always be value in the graph, but the point of graphing is to highlight the concepts that we want to convey. In a way, many people interpret what graphs are doing and why we use them backwards. You don't create visualizations to then draw value from them, but rather your plots are a mathematical analysis that is in a more natural language for humans. This is subtle and might be confusing because people often are able to intuit what kind of graph should be used to convey data but are not thinking about what the process is doing. So what I'm saying is that you don't want to use arbitrary graphs, but there's the right graph for the job. You can find a lot of blogs on graph sins[2] and this point will become clearer.

At the heart, this is not so different than "lies, damned lies, and statistics." People often lie with data without stating anything that is untrue. With graphs, you can lie without stating a word, despite being worth a thousand. So the most important part of being a data scientist is not lying to yourself (which sounds harder than it is).

[0] https://en.wikipedia.org/wiki/Pareidolia

[1] https://www.youtube.com/watch?v=EK32jo7i5LQ

[2] Except this might be hard because if you Google this you'll have a hard time convincing google that you don't mean "sine". So instead search "graph deadly sins", "data visualization sins", "data is ugly", and so on. I'll pass you specifically the blog of "Dr. Moron" (Kennith Moreland) and one discussion of bad plots https://www.drmoron.org/posts/better-plots/ (Ken is a data visualization expert and both his blogs have a lot on vis). There's also vislies: https://www.vislies.org/2021/

(Source: started my PhD in viz and still have close friends in infoviz and sciviz who I get to hear their rants about their research, and occasionally I contribute)


My use case is that I have a graph of flops, latches, buffers, AND, OR, NOT gates and I want to visualize how data is changing/getting corrupted as it goes through each of them.


It's likely that a better way to do this is not to "eat the elephant", but do it at some medium-scale or subcomponent level.

It sounds like perhaps what you are trying to do is something more like this?

http://visual6502.org/

check out the visual simulations of the

http://visual6502.org/sim/varm/armgl.html

http://visual6502.org/JSSim/index.html

http://visual6502.org/JSSim/expert-6800.html

I will say (and please forgive that digital circuits are not my field), there are almost certainly better techniques and approaches in the field to accomplish what you are trying to do. I would personally move away from what you are trying and seek insight in the domain that's able to produce multi-billion transistor microprocessors.

Perhaps there are tools for large-scale logic circuit simulation?

https://old.reddit.com/r/computerscience/comments/uhappo/bes...


I did that a few years go. It was a nice visualization up until restoring division. With more components than that the layout just becomes to cluttered to be meaningful. And it is very difficult to encode all heuristics one uses when drawing "pretty" circuit diagrams by hand into an algorithm.


Can you recommend any good literature on the subject?


Honestly, I've been away from the field for quite a long time so wouldn't be up to date. But, if you want kind of a good framing of the field, how it evolved and how it's different from other kinds of visualization (like scientific) maybe start here [0a][0b]

0 - https://www.cs.purdue.edu/homes/xmt/classes/slides/CS530/Inf...

- https://en.wikipedia.org/wiki/Data_and_information_visualiza...

There used to be a lively research field for information visualization that studied current visualization techniques and proposed new ones to solve specific challenges -- I remember when treemaps were first introduced for example [1]. Large networks were a pretty big area of research at the time with all kinds of centrality clustering, and edge minimization techniques.

1 - https://www.google.com/search?q=treemap+visualization&tbs=im...

A few teams even tried various kind of hyperbolic representations [2,3] so that areas under local inspection were magnified under your cursor, and the rest of the hairball was pushed off to the edges of the display. But with big graphs you run into quite a few big problems very quickly like local vs. global visibility, layout challenges, etc.

2 - https://graphics.stanford.edu/papers/webviz/webviz/node2.htm...

3 - https://www.caida.org/catalog/software/walrus/

Not specifically graph related, but the best critical thinker I know of in the space is probably Edward Tufte [4]. I have some problems with a few bits of his thinking, and other than sparklines his contributions are mostly in terms of critically challenging what should be represented, why, how, and methods of interaction, his critical analysis has stayed up there as some of the best. He has a book set that's a really great collection of his thoughts.

4 - https://www.edwardtufte.com/tufte/

If you approach this problem critically, you end up at the inevitable conclusion that trying to globally visualize a massive graph in general is basically useless. Sure there are specific topologies that can be abstracted into easier to display graphs, but the general case is not conducive. It's also somewhat surprising at how small a graph can be before visualizing it gets out of hand -- maybe a few dozen nodes and edges.

I remember the U.S. DoE did some really pioneering studies in the field and produced some underappreciated experts like Thomas, Cook and Risch [5,6]. I like Risch's concepts around visualizations as formal metaphors of data. I think he's successful in defining the rigorous atomic components of visualization that you can build up from. Considering OP's request in view of Tufte and Risch, I think that they really need to think about the potential for different metaphors at different levels of detail (since they specify zooming in and out). There may not exist a single metaphor that can visualize certain data at every conceivable scope and detail!

5 - https://ils.unc.edu/courses/2017_fall/inls641_001/books/RD_A...

6 - https://arxiv.org/pdf/0809.0884v1

One interesting artifact from all of this is that most of the research has long ago been captured and commoditized or made open source. There really isn't a market anymore for commercial visualization companies, or grant money for visualization research. D3.js [7] (and the derivatives) more or less took millions upon millions of dollars in R&D and commercial research and boiled it down into a free, open source, library that captured pretty much all of the major findings in one place. It's objectively better than anything that was on the market or in labs at the time I was in the space and it's free.

7 - https://d3js.org/


The one really helpful use for a massive nodegraph with way too much data? Convincing people that something is complicated. Eg: illustrating to non-technical people that your codebase is a massive mess.


Sometimes you can get farther with something like a summary statistics table of the different motifs that show up in a dataset.

Hairballs are not interesting, but the shapes that show up in a graph once you make a few cuts can be fascinating.


I am pretty sour about it and will call out people who post "just another hairball" and act like they've done something special.

I think there is a need for a tool that can extract and tell an interesting story based on a subgraph of a huge graph, but that takes thinking unlike hairball plotting, ai image generation and other seductive scourges.

I went to an posthumous art show based on this guy

https://www.amazon.com/Interlock-Conspiracy-Shadow-Worlds-Lo...

where they showed how he drew 40 drafts with pencil of one of his graphs and went from a senseless hairball to something that seems immediately meaningful. Funny that might have something to do with his mysterous death... Maybe a tool that would help you do that is too dangerous for "them" to let you have!


I think you answer you unknowingly answered your own question. The reason no such tool exists is that this stuff is very hard. Worse, it is something that sounds and looks easy. Terrible graphs and misleading ones are not the result of maliciousness and cunning deception. Rather it is the opposite. Bad graphs happen because it is easy to visualize data, but hard to create good and meaningful visualizations[0]. It is because most people mindlessly apply a set of procedures to select the correct graph, not knowing the reasoning behind those procedures. It is in part due to the large quantity of people that have learned this and normalize/perpetuate the myth that visualization is easy. Because they do not distinguish the action from the end result. Just in the same way you might be able to perform all the manual tasks to assemble a house (use a screwdriver, hammer and nail, saw, fit pipes together, etc), it would be naive to assume that you could assemble a house. The reason there's so many terrible graphs is because it is easy to build a shanty and you rarely see an actual house to tell you what you're missing.

I doubt we'd see such a tool anytime soon. It takes expert experience and skill to make good visualizations and there are no well defined rules. If you see such a tool, I'd be wary of promises that are too big to be kept.

[0] Sometimes people complain about how something has a difficult/steep learning curve. It is important to note that while frustrating, this does not always make the learning curve a bad thing. Often a shallow learning curve can be bad because it convinces one that they have far greater ability than they actually do. We could argue that this is in part due to the improper way we visualize learning curves.


> Funny that might have something to do with his mysterous death... Maybe a tool that would help you do that is too dangerous for "them" to let you have!

This may be one of the most rediculous conspiracy theories i have ever heard. Big data (heh) had him killed... ok.


It really feels like an under defined task. Do you actually need to see those nodes? At that scale, you never want to render 100B of them. Instead you would need some kind of density aggregation when zoomed out and moving to LoD style k-d tree partitioning when zoomed in. That's almost the area of rendering engines like Unreal's Nanite. You can create your own renderer for data like this, but game engines are likely your closest inspiration. Then again, unless you already have x/y coordinates ready, (based on graphviz I'm assuming you don't) even laying out the points will be a very heavy task. (The usual iterative force directed layout would likely take days)

But if you were my coworker I'd really press on why do you want the visualisation and if you can get your answers in some other way. And whether you can create aggregates of your data that reduces it to thousands of groups instead. Your data is a minimum of ~800GB if the graph is a single line (position + 64bit value encoding each edge, no labels), so you're not doing anything real-time with it anyway.


Truly, 100B nodes needs some sort of aggregation to have a chance at being useful. On a side project I've worked with normalizing >300GB semi-structured datasets that I could load up into dataframe libraries, I can't imagine working with a _graph_ of that size. I thought I was a genius when I figured out I could rent cloud computing resources with nearly a terabyte of RAM for less than federal minimum wage. At scale you quickly realize that your approach to data analysis is really bound by CPU, not RAM. This is where you'd need to brush off your data structures and algorithms books. OP better be good at graph algorithms.


1) 100B? Try a thousand. Of course context matters, but I think it is common to overestimate the amount of information that can be visually conveyed at once. But it is also common to make errors in aggregation, or errors in how one interprets aggregation.

2) You may be interested in the large body of open source HPC visualization works. LLNL and ORNL are the two dominant labs in that space. Your issue might also be I/O since you can generate data faster than you can visualize it. One paradigm that HPC people utilize is "in situ" visualization. Where you visualize at runtime so that you do not hold back computation. At this scale, if you're not massively parallelizing your work, then it isn't the CPU that's the bottleneck, but the thing between the chair and keyboard. The downside of in situ is you have to hope you are visualizing the right data at the right time. But this paradigm includes pushing data to another machine that performs the processing/visualization or even storage (i.e. compute on the fast machine, push data to machine with lots of memory and that machine handles storage. Or more advanced, one stream to a visualization machine and another to storage). Checkout ADIOS2 for the I/O kind of stuff.

https://github.com/ornladios/ADIOS2


You're right, but I think that may be what the OP is actually asking for. They talk of "zooming out" but I don't think they mean so they can literally zoom out and see all 100b nodes individually on their screen at once but instead mean that some high level / clustered view is shown to give an overview.

That being the case, I think you're suggesting that this high level summarisation happens as a separate preprocessing step (which I agree with FWIW) whereas I think they're imagining it happening dynamically as part of rendering.


there's 8 million pixels in 4k, so if you're trying to graph 8 million points, might as well just fill up the screen with a single color and call it a day. If you have 8 billion, well you can graph about 0.1% of that and fill up every single pixel of the screen, but then you're just looking at noise. To be able to show connections between nodes, you'd need maybe 9 pixels per node, so that's around 900k nodes you might be able to graph on a 4k screen, assuming a maximum number of connections between nodes is 8, and the connected nodes are adjacent. So now you're at about 0.01% can be graphed on yor display, and that's not even very usable and there'd not be a lot of information you could glean from that. You could go to 81 pixels per node and you'd be able to connect more nodes to a graph, and maybe you could make some sense of it that way, but you'd only be graphing 0.001% and at that point, what's your selection criteria? Your selection criteria for nodes would have more of an impact than how you choose to graph it.


It's unclear to me if you're making the same point I'm about to make. So I guess at best it's another point and at worst another framing?

I think the relationship to a 4k image is a great way to explain why you should never do this. Specifically because we can note how as resolution increases it gets difficult to distinguish the difference. Like the difference between 480p and 720p is quite large but 4k and 8k is... not. A big part of why the high res images even work is because the data being visualized is highly structured and neighboring data strongly relates. So maybe OP's graph contains highly structured graph cliques. But it is likely doubtful. Realistically, OP should be aiming for ways to convey their data with far less than 10k points. Maybe ask yourself a question: can you differentiate a picture of a thousand people from two thousand? Probably not.


> Do you actually need to see those nodes?

Even 8k-screens have not enough pixel to show that many nodes at the same time. So some visual optimization has to happen anyway.


What is the average degree of the 100B nodes in this graph? If it's anything north of like...2 (or maybe 1.0000001, or less, unsure), then this sounds about as intractable as "visualizing Facebook friends" (times 30)

Comparing it to a rendering engine I think is a bit of a cheat unless the points do have some intrinsic 2-D spatial coordinates (and no edges beyond immediate adjacency). You're ultimately viewing a 2-D surface, your brain can kinda infer some 3-D ideas about it, but if the whole volume is filled with something more complex than fog, it gets tricky. 4-D, forget about it. 100-D as many datasets are? lol.

Having worked in a lab where we often wanted to visualize large graphs without them just devolving into a hairball, you'd need to apply some clustering, but the choice of clustering algorithm is extremely impactful to how the whole graph ends up looking, and in some cases it feels like straight deception.


Speaking of Nanite, anybody know of a data visualization tools actually implemented with Mesh Shaders? I've dabbled with time series data, not graphs, but it feels lonely.


My use case is that I have a graph of flops, latches, buffers, AND, OR, NOT gates and I want to visualize how data is changing/getting corrupted as it goes through each of them.


Ok, so you have nice natural boundaries between systems. If you're dealing with something processor-like, you have really good chokepoints where for example ALU / register / caches connect. The task may be way easier if you deal with one of them at a time. Maybe even abstract anything less interesting (memory/cache?) Would visualising things per-system work better for you, or maybe visualising separate systems getting affected instead of specific nodes?

Having the structure of the device available should also help with the layout - we know you can group the nodes logically into independent boxes instead of trying to auto-layout everything.


As many people already commented, no one actually visualizes graphs of that size at once.

Context: I’m the CTO of a GraphViz company, I’ve been doing this for 10+ years.

Here are my recommendations:

- if you can generate a projection of your graph into millions of nodes, you might be able to get somewhere with Three.js, which is a JS library to generate WebGL graphics. The library is close enough to the metal to allow you to build something large and fast.

- if you can get the data below 1M nodes, your best shot is Ogma (spoiler: my company made it). It scales well thanks to WebGL and allows for complex interactions. It can run a graph layout on the GPU in your browser. See https://doc.linkurious.com/ogma/latest/examples/layout-force...

- If you want to keep your billions of nodes but are OK with not seeing the whole graph at once, my company builds Linkurious. It is an advanced exploration interface for a graph stored in Neo4j (or Amazon Neptune). We believe that local exploration up to 10k nodes on screen is enough, as long as you can run graph queries and full-text search queries against the whole graph with little friction. See https://doc.linkurious.com/user-manual/latest/running-querie...


Just wanted to say, while I'll never use Ogma, it's really fun to play w/. Performant too.


You don't. generate a hierarchical clustering of the data, then collapse nodes into groups to get under a data set size threshold at any given view distance. That gives you full interaction and the ability to do mouseover info on groups, while being able to zoom in and interact with individual nodes if you want.


This is the way imo. Nobody is consuming 100b nodes in a chart.


This is mostly a data structure problem. I am certain this can be made interactive, but it will require some elbow grease.

If you want it to be interactive, you will need to figure out a few things:

1.) how to format the data so it can be streamed off disk. 2.) how to cull the offscreen bounding boxes quickly. 3.) how to cull tiny bounding boxes quickly.

The central problem is finding a way to group the nodes efficiently into chunks. A 2D approach is probably best. You would then have something that could be rendered efficiently.

Other than that, maybe a point cloud renderer? There might be one you can buy off the shelf, or something open source.


First step is to generate a graph distance matrix to use as features.

You can do the hierarchical clustering using HDBScan probably in reasonable time, it's a fast algorithm.

To have any sort of 2d display you need to project the nodes, which might require some form of PCA given the data set size. UMAP might also work.

From there, you can use an R* tree in conjunction with "cut-depth" cluster segmentation tied to zoom level with additional entity selection based on count and centrality. If you load it in postgres PostGIS can do this in one query.

All pretty straightforward stuff.


It really depends on what the nodes represent, right?

A 1080p monitor has:

1,920 × 1,080 = 2,073,600 pixels

Each pixel can display 32-bit color, which equates to:

2^32 = 4,294,967,296 colors

So, while each pixel can display one of 4.3 billion colors, the monitor can display combinations of those colors across its 2,073,600 pixels. The total number of possible color combinations on the screen is astronomical.

The actual number of possible combinations is:

4,294,967,296^2,073,600


Aside from tgv's correct point that this is implicitly a recipe for something that isn't useful as a visualization, I think even if we were able to distinguish 4B colors and make sense of each pixel -> color assignment ... the math isn't on your side. You responded to a statement about nobody consuming a graph of 100B nodes. Suppose we don't have any concept of edge weight, and an edge is either present or not, but edges are directed, then you have 100B^2 (i, j) pairs for potential edges, each of which is either present or not (i.e. 10^22 edges, each of which is a bit).

4,294,967,296^2,073,600 is very large but 2^(10^22) is much much larger


That way of looking at it doesn't make sense. When visualizing a graph, you want to see the connections between the nodes; coloring each node individually almost never makes sense; and the eye cannot distinguish 4B colors.


You end up bucketing those colors into differences the human eye can see, so you end up with a much smaller domain.

You do something similar with 100B data points since you're not literally looking at the relation between individual nodes when all 100B are on screen at once.


> Each pixel can display 32-bit color

It's only 24-bits of visible colours.


what decision / downstream process is going to consume the 1B node graph render? is producing a render really necessary for that decision, or is rendering the graph waste?

is there a way you can subsample or simplify or approximate the graph that'd be good enough?

in some domains, certain problems that are defined on graphs can be simplified by pre-processing the graph, to reduce the problem to a simpler problem. e.g. maybe trees can be contracted to points, or chains can be replaced with a single edge, or so on. these tricks are sometimes necessary to get scalable solution approaches in industrial applications of optimisation / OR methods to solve problems defined on graphs. a solution recovered on the simplified graph can be "trivially" extended back to the full original graph, given enough post-processing logic. if such graph simplifications make sense for your domain, can you preprocess and simplify your input graph until you hit a fixed point, then visualise the simplified result? (maybe it contracts to 1 node!)


> is producing a render really necessary for that decision, or is rendering the graph waste?

Just to be clear, the OP already has a graph. There are nodes and relationships. The graph can be queried for understanding.

Rendering the graph is tractable for a small graph or a portion of the graph.

Trying to render all the nodes in an enormous graph is almost always an expensive quixotic adventure.


Expensive quixotic adventure.

Perhaps that is the experience he was after for his billion node graph.


Cytoscape JS[1] with canvas rendering. Probably won't be able to do a billion nodes, but the last time I compared graph rendering libraries it was the best one in terms of performance/customizability. If you need even more performance, there's VivaGraphJS[2], which uses webgl to render.

If you want other resources, I also have a GitHub list of Graph-related libraries (visualizations etc.) on GitHub[3].

[1]: https://js.cytoscape.org/ [2]: https://github.com/anvaka/VivaGraphJS [3]: https://github.com/stars/AlexW00/lists/graph-stuff


We use cytoscape for some of our genetics tools. It works well.

It does tend to “hairball” (technical term) at about 500+ nodes. That’s not the tools fault, just large graphs tend to be difficult.

It’s just hard to imagine visualizing a million plus nodes without doing some clustering first.


Hilbert curves (or similar) are often used for graphing billions of nodes[1]. However this will not by default show the relationships between nodes in a graph. Depending on your data you may be able to write a function to map from your edge list to a node index that hints at proximity.

Note that visualizations are limited by human perception to ~10000 elements, more usefully 1000 elements. You might try a force directed graph, perhaps a hierarchical variant wherein nodes can contain sub-graphs. Unless you have obvious root nodes, this variant would be interesting in that the user could start from an arbitrary set of nodes, giving different insights depending on their starting point.

1 - An excerpt from "Harder Drive", a rather silly implementation of a unix block device using ping latency with any host that will let him. He visualizes the full ipv4 address space in a hilbert curve at this offset: https://youtu.be/JcJSW7Rprio?si=0AlyMgaZjH7dmh5y&t=363


Thank you for the link! I've recently created a small service on this topic: https://reversedns.space/

It visualizes the IPv4 space based on reverse DNS responses.


I might be prematurely classifying your question as an instance of the XY problem, but I worked at a company that tried to create something similar — a graph visualization system that could handle 100B nodes as part of our core product and... well... I would caution you not to do so if your purpose is something along those lines.

There's almost never a use case where a customer wants to see a gigantic graph. Or researchers. Or family members for that matter. People's brains just don't seem to mesh with giant graphs. Tiny graphs, sure. Sub-graphs that display relevant information, sure. The whole thing? Nah. Unless it's for an art project, in which case giant graphs can be pretty cool looking.


I'm reminded of the time back in the aughties when I was asked to help print a ~300,000 page PDF. That's about 30 boxes' worth of paper if you print double-sided. I spent an hour tracing the request back to its source and discovered that they really only wanted some specific pieces of information out of it. I extracted that information from the file and printed maybe 5 pages instead.

In moments like these your job is to not be the monkey's paw. Don't just blithely give them what they asked for. Ask more questions to find out what they're actually trying to accomplish, and help them compose a more specific request that's closer to what they actually want.


Asked to print a 300,000 page PDF you say? Almost sounds like it was meant for this guy:

https://www.psihoyos.com/image/I0000jtF1ui2j79Q


> Don't just blithely give them what they asked for.

Depends on how much they're paying you for it.


Knowing how and when to be consultative is a key soft skill that helps get you access to the higher end of the pay scale. It's how you demonstrate that you're an independent thinker who doesn't need to be micro-managed.


Just in case you are not being sarcastic: there is a thing called ethics, which is the basis of human relations.


every man has a price though. whether that's a big cheque or a gun to your child's head.


Datashader is good for rendering large amounts of data, I'd start with that

https://datashader.org/


It is somewhat old-school, but Gephi is by far the best graph visualization tool I've used that stays robust and usable at such scales (at least ~10M, but possibly a lot more).


And work is underway to help Gephi handle larger graphs:

https://gephi.wordpress.com/2024/06/13/gephi-week-2024-peek-...


I'm also looking for a graph viewing tool, but my wishlist is different (not all of them are hard requirements):

- Deal with 100k node graphs, preferably larger

- Interactive filtering tools, e.g. filtering by node or edge data, transitive closures, highlighting paths matching a condition. Preferably filtering would result in minimally re-layouting the graph.

- Does not need an very sophisticated layout algorithms, if hiding or unranking nodes interactively is easy. E.g. centering on a node could layout other nodes using the selected node as the root.

- Ability to feed live data externally, add/remove nodes and edges programmatically

- Clusters (nodes would tell which clusters they belong in)

I'm actually thinking of writing that tool some day, but it would of course be nicer if it already existed ;). I'm thinking applications like studying TLA+ state traces, visualizing messaging graphs or debug data in real time, visualizing the dynamic state of a network.

Also if you have tips on applicable Rust crates to help creating that, those are appreciated!


Neo4j has an interactive graph browser built in.


Thanks! I'll give it a go.


Have you tried Gephi?


I have not!

So I gave it a try and it seems quite a capable tool. It doesn't check all my boxes and is more cumbersome to use than I'd wish, e.g. it is able to find shortest path between nodes, but activating that needs finding and entering the node ids manually.

However, this still seems the best tool I've ever seen for this purpose and it's also highly general. Thanks!


I had this question a few years back while working on a social network graph project and trying to render a multi-million node graph. Tried Ogma and it worked quite well but it became too slow when approaching the million. Ended up writing my own renderer in C++ and then Rust. Code here: https://github.com/zdimension/graphrust

Tested it up to 5M nodes, renders above 60fps on my laptop's iGPU and on my Pixel 7 Pro. Turns out, drawing lots of points using shaders is fast.

Though like everybody else here said you probably don't want to draw that many nodes. Create a lower LoD version of the graph and render it instead


As someone who's made graphing libraries for over a decade: Are you sure you want to visualize 1 billion nodes? What's the essential thing you're trying to see?

Visualizations are great at helping humans parse data, but usually they work best at human scales. A billion nodes is at best looking at clouds, rather than nodes, which can be represented otherwise.


I have a slightly different use case. I have a dependency graph of tasks with each task having some attached info in form key value pairs. I want to be able to easily visualize the complete graph but then filter out stuff using conditions on attached info. The filtering should hide there nodes/tasks in graph and auto-resize display. What would be a good solution to this?


My library (https://gojs.net) can do that easily. Give it a look, and if you think the price is acceptable for your project, contact us and we can make you a proof-of-concept.


You can visualise a graph with 9 billion nodes on https://www.openstreetmap.org :)

You could copy their design, if you know how you want to project your nodes into 2D. Essentially dividing the visualisation into a very large number of tiles, generated at 18 different zoom levels, then the 'slippy map' viewer loads the tiles corresponding to the chosen field of view.

Then a PostGIS database alongside, letting you run a query to get all the nodes in a given rectangle - such as if you want to find the ID number of a given node.


I would guess OSM uses optimizations for eucledian graphs, where the path a->c is always shorter than a->b->c. This restriction makes e.g. TSP solvable. But this property does not hold for any generic graph. I don't know if this makes visualisation also easier.


Technically, if you've got a bumpy dirt track a->c and a freeway a->b->c then the travel time on the latter route can be shorter.

Of course, they do get to dodge a major problem: That high-dimensional data is hard to visualise in an understandable way. Everyone knows what a map looks like, nobody knows what a clear visualisation of a set of 100-dimensional vectors looks like.


Most graphs of social networks done over at /r/dataisbeautiful seem to use Gephi.org and Kumu


Oh god I ran into this issue! Fewer nodes, but still.

I created an HTML page that used vis-network to created a force-directed nodegraph. I'd then just open it up and wait for it to settle.

The initial code is here, you should be able to dump it into an LLM to explain: https://github.com/HebeHH/skyrim-alchemy/blob/master/HTMLGra...

I later used d3 to do pretty much the same thing, but with a much larger graph (still only 100,000 nodes). That was pretty fragile though, so I added an `export to svg` button so you could load the graph, wait for it to settle, and then download the full thing. This kept good quality for zooming in and out.

However my nodegraphs were both incredibly messy, with many many connections going everywhere. That meant that I couldn't find a library that could work out how to lay it out properly first time, and needed the force-directed nature to spread them out. For your case of 1 billion nodes, force-directed may not be the way to go.



Repeating what others said here: I doubt anyone actually needs to see 1B (or 100B) nodes to make whatever decision they need to make. They probably need to see the X nodes that matter?

If you're fully "zoomed out", is seeing 1B individual nodes the most useful representation? Wouldn't some form of clustering be more useful? Same at intermediate levels.

D3 has all sorts of graphing tooling and is very powerful. It likely wouldn't handle 1B nodes (even if it did, your browser can't) but it has primitives to build graphs


At that size what you're actually looking for is a game engine with a particle system.


Wow. I had not thought of that. While idea makes sense, even MVP for this is going to be daunting


Sent you email with some details on volumetric particle cloud in blender and pythons to snake a harness to that mvp using cuda/other gpu signal tools...


I'd love to see a good solution for this. And it's not just the nodes, it's also the connections between them: https://taoofmac.com/static/graph


You could try a hypertree https://en.wikipedia.org/wiki/Hyperbolic_tree but that's usually for acyclic data.


Sigma.js is pretty good at rendering a ton of nodes and edges. I haven't tried it with a billion nodes though. https://www.sigmajs.org/


It won't work with hundreds of gigabytes of data. That's not its scope.


for curiosity - what wisdom do you intend to draw from visualising relations of single gut bacteria? Or is it grains of sand in the sea? How many of them will you zoom into? Maybe clustering may make things feasible.


I haven’t tried that many, but GraphPU was able to render 100s of millions for me in real time.

https://github.com/latentcat/graphpu


I remember Tulip could handle pretty huge ones, though no idea if it can manage billions.

https://tulip.labri.fr/site/


I was working on a node graph for nftables bison parser.

A blog that covers only failures of large SVG viewers having 10,000+ of nodes.

https://egbert.net/blog/articles/comparison-svg-viewers-larg...

More on https://egbert.net/blog/tags/graphviz.html


Thanks to everybody who replied. I will scale my ambitions for now. How can I visualize a one billion node graph. Lets say I want to visualize transitors in a modern AI chip (around 1B nodes). My original use case was to set color on various compinents on a transistor and visualize it. For example, all flops will have one color, all buffers another color, and then I wanted to visualize their distribution on the semiconductor die.


A couple of years ago, I had a similar issue. I don't have the code any more, but my output to a 3d model converter in a weekend and then threw it into unreal. Then I put on my VR goggles and walked around the graph. It was much easier to deal with in three dimensions instead of two.

From there I could write better visualizations. I got laid off before the project was completed, though.


Try collapsing cycles into single nodes. In my experience, cycles are extremely low entropy. Those cycle nodes can then be explored on separate diagrams/pages. Explore more dimensions that allow you to collapse nodes. You effectively want to turn you graph into a data cube.


It would be a great thing for open source if someone improved the performance of dot/graphviz!


They're not meant for anywhere close to that scale. They deal with text descriptions and rendering full views at the time. That's never going to be usable with more than hundreds of models.


Tacking on a related question - what software should one use to interactively create/update/see a small graph?

Thinking specifically about a graph of knowledge, so will be an iterative process.

Just looking for anything more than a text editor really!


If it's text-heavy I'd recommend Obsidian


So the goal would be to eventually move that graph into running code and query it. But it’s never going to be large - easily fit in memory.

Obsidian is PKM right? Does it have the idea of labels on the edges?


Right, in that case Obsidian might not be the best choice.

I'd look for Knowledge Graph editors then, for RDF or OWL knowledge bases, I don't have any specific recommendation, there are many but most are rather old.

Alternatively, go for a graph database like Neo4j, it's primarily designed as a database and not an editor, but it does have a nice UI visualize and change things by hand.



dot/vimdot/graphviz promises to do it but I was not able to get it to work yet


100B is going to require something custom, and tbh I'd be surprised if you can get any useful information from that. But try Gephi. It can at least go into the millions of nodes. Not sure about billions.


For billions of nodes, there are two options: Graphistry (and it might be less than that, 100M is OK), and Pajek, which is weird, but can handle billions of nodes.

Neo4j, cytoscape, etc will not work.


Forget a billion.

I'm finding even 10's of thousands can be difficult.

Just generally, is there a list of visualization products that is broken down by how many nodes they can handle?


As someone working on such a visualization product ... it's complex ... and often the wrong question.

While you can envision ways of laying out and rendering such large graphs (force-directed layout is frequent, as are hardware-accelerated rendering methods that typically only show nodes with size and color, but little more complex than that), you don't just want to stare at a pretty hairball. Graphs have structure, which the correct layout will emphasize or even make visible. And you want to be able to explore or interact with the data. And there's where this often breaks down.

If you're just interested in part of the data, reduce the graph to that part. Makes layout, rendering, and interaction way easier.

If you have ways of grouping or clustering the data beforehand, reduce the graph to the clusters and then drill down into them.

You might get lucky and your data already has a structure that's well suited for fast layout algorithms and the same structure makes it easy to figure out which part you want to look at more closely. But in my experience that's rare. Most requests for large graphs from customers come from requirements of the software (e.g. “should be able to handle 100k nodes and as many edges at 60 fps with a load time of no more than 2 seconds”) written by someone who pulled more or less reasonable maximum numbers from thin air, or from just looking at the amount of data without really having an idea of how to work with it and just wondering whether all that can somehow be turned into pixels. Dedicating less than a pixel on the screen to each node is very frequently not helpful, even though a visualization product may very well advertise that they can handle it. It may make for pretty pictures, but often not very useful ones.

There are a number of posts on the topic, e.g.

https://cambridge-intelligence.com/how-to-fix-hairballs/

https://www.yworks.com/pages/smooth-visualization-of-big-dat...


Maybe you could turn it into a sparse matrix, hit it with a couple different reorderings, do some matvecs, and see if that gives you any insight into it?


There was a product I personally liked called graphistry but that isn’t free persay, but its founder is brilliant in this space @lmeyerov


The amount of information in a graph that big is on the order of 10^21. You can't meaningfully "visualize" it.




Do you just hold this number of node in the database or also need to visualize them all in one view?


visualize them in one view so I can understand how data is flowing from input to output


Graphistry


Deck.gl’s PointCloudLayer


Dude is casually asking about software to visualize a graph with size comparable to whole internet...


Maybe it represents the connection between transistors in a chip. Could easily be hundreds of billions of nodes, probably a lot of structure to the edges though.


ArangoDB or Neo4j


Unformatted csv and you scroll down through it real fast


Statistics


try dGraph or Aerospike


Your Google Takeout? This sort of thing is why I left /s




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: