PageRank algorithm for graph databases

vbrandl · on Jan 30, 2023

PageRank and similar ranking algorithms on graphs can be used to detect monitoring attempts in P2P botnets [0] (a botmaster can detect when researchers/law enforcement start monitoring a botnet).

For my master's thesis, I evaluated these algorithms and tried to find ways to prevent detection and came to the conclusion that it is hard if you don't want to deploy about as much sensor nodes as the botnet has active peers. Those algorithms are working really well and are hard do work around.

[0]: https://doi.org/10.1145/3098954.3098991

vpavicic · on Jan 30, 2023

Thank you for your use case! I'll take a look at your thesis! :)

vbrandl · on Jan 30, 2023

The linked paper is not my thesis but one of the foundational works, I based my thesis on. Mine can be found here: https://git.vbrandl.net/vbrandl/masterthesis/raw/branch/mast...

vpavicic · on Jan 30, 2023

Ahh! Thnx! I was sloppy... just bookmarked it for later, didn't even check!

moralestapia · on Jan 30, 2023

Great stuff, thanks for sharing.

vpavicic · on Jan 30, 2023

I first heard about graph databases 13 months ago... there were so many questions in my head, I'm sure my colleagues still have PTSP when my name pops-up on Slack... :D If you also have questions about one of the most used algorithms used in graph databases - PageRank - I've tried to give examples of usage in various use cases!

What have I missed? What other questions do you ask yourself about PageRank? What other graph algorithms do you have questions about?

wslh · on Jan 30, 2023

BTW, is there a good "graph layer" for SQLite? I understand that graph databases use specific data structures to optimize for graph queries instead of row-oriented but sometimes you need something in the middle: representing graphs and doing basic queries.

robmccoll · on Jan 30, 2023

It's nearly a decade old at this point and probably not optimal even for the time, so there may be better approaches, but here's a few basic algorithms on SQLite. PageRank is at the bottom.

https://github.com/robmccoll/graphdb-testing/blob/master/tes...

The graph is just an indexed edge list with auxiliary tables to support algorithms as needed.

AmateurLeader · on Jan 30, 2023

Sounds like EdgeDB [0] might be of interest to you

[0] https://www.edgedb.com

alberth · on Jan 30, 2023

> "That said, EdgeDB is built on top of Postgres."

katrotz · on Jan 31, 2023

That made my day, thanks for sharing it

vlmutolo · on Jan 30, 2023

Cozo (cozodb.org) is a new embedded graph database that has many available storage backends, including SQLite.

It’s not a thin layer, though, if that’s what you’re looking for. Cozo has its own query system and uses a datalog-like query language.

simonw · on Jan 30, 2023

I'm interested in this too. My hunch is that SQLite would be a particularly good fit for a whole bunch of queries thanks to this characteristic: https://www.sqlite.org/np1queryprob.html - "Many Small Queries Are Efficient In SQLite"

An algorithm that traverses a graph by performing hundreds of individual SELECT queries to follow a path should work much better against SQLite than against most other relational databases, due to the lack of network overhead in making each of those queries.

NicoJuicy · on Jan 30, 2023

This should make it a lot better: https://github.com/abetlen/sqlite3-bfsvtab-ext

ashvardanian · on Jan 30, 2023

I was building something like this a few years ago - NetworkXum [1]. But now we just use a pure Graph implementation with NetworkX interface - UKV [2].

[1]: https://github.com/unum-cloud/networkxum [2]: https://github.com/unum-cloud/ukv

fatih-erikli · on Jan 30, 2023

Common graph databases are network-based for scaling purposes. You do have a database server then you talk to the server somehow for the queries. Sqlite is a single-file database. You do not have a server so you do not need an extra layer in database side.

Instead, Run the graph algorithms on a stringifed json stored as a text in sqlite. They will be running in some process in anyway.

marksbrown · on Jan 30, 2023

SimpleGraph might be worth a look. I've found it useful. https://github.com/dpapathanasiou/simple-graph

conceptme · on Jan 30, 2023

looks like SQLite supports with recursive which allows some graph oriented queries: https://www.sqlite.org/lang_with.html the mandlebrot query looks interesting :D

I haven't used it on SQLite but I have done basic graph queries in postgres with `with recursive` some examples can be found here: https://www.alibabacloud.com/blog/postgresql-graph-search-pr...

_frkl · on Jan 30, 2023

Not sqlite, but kuzu ( https://github.com/kuzudb/kuzu ) is an interesting project in this space. Fairly new, but already quite impressive IMHO.

mbuda · on Jan 30, 2023

Don't know about SQLite, https://github.com/apache/age seems very interesting, but it's just a tiny layer on top of Postgres :)

semihsalihoglu · on Jan 30, 2023

echoing _frkl: yes, kuzu aims to fill exactly this space. an easy to use dbms that gives you the ability to model your records as a graph, do common querying and transformations, and extractions all in a high-level graph-specific query language. we are very new and have quite a lot to go but it still implements many cypher clauses, so many things can still be done.

digdigdag · on Jan 30, 2023

Does Google still use PageRank anymore? The search results are so bad these days, I'm dubious on how many other websites reference the top results they show.

PaulHoule · on Jan 30, 2023

PageRank was never as important as people thought it was.

Think of it this way: a search engine needs a relevance score that connects a query to a document. If the number of documents is vast (e.g. billions and billions) a search engine also benefits from a document-dependent quality score.

The first is more important than the second. You'd rather get a poor quality document that is relevant to the topic than a high quality document which isn't relevant.

It took several years before papers in the literature came out that found PageRank useful in search results, the key thing is that you need a real excess of documents. With millions of documents you are better off without it (being more effective at finding relevant documents improves performance), you really need 100 million + to reach the point where you have so many relevant documents for typical queries that filtering on quality doesn't get in the way of relevance.

PageRank can be thought of as simulating a Markov process where a user clicks a random link on a page most of the time but with some probability jumps to an entirely random page. PageRank is proportional to the probability that a user visits the page, or alternately how much traffic a page gets.

Google very quickly developed a few ways to sample this directly, such as (1) making Google analytics almost ubiquitous, (2) making Google ads almost ubiquitous, (3) analytics from the Chrome browser.

Google denies using the above for ranking, but they've been known to lie about Google's relevance factors before. Even a small sample from the above 3 could be used to calibrate models based on other info.

KMag · on Jan 30, 2023

There are known link structure metrics besides PageRank that both scale better and are harder to game, though I'm not sure how their effectiveness in ranking compares to PageRank in the un-gamed case.

I used to work on Google web search indexing, almost 20 years ago, and (1) it has been public knowledge since around the time I left that PageRank was just one of a plethora of ranking signals going into the Learn To Rank page ranking ML and (2) I strongly suspect that PageRank itself has been replaced by a somewhat similar reverse link weighting algorithm.

I strongly suspect that something roughly PageRank-like is still there deep in the bowels of ranking, and due to the non-linear nature of ML, its importance probably varies greatly page to page.

georgehill · on Jan 30, 2023

>There are known link structure metrics besides PageRank that both scale better and are harder to game

Any examples?

KMag · on Jan 30, 2023

For one, the TrustRank paper discusses some variations on link structure ranking, some of which are less computationally intensive than PageRank for incrementally updating. Back 20 years ago, Google News and the more frequently updated web pages would have an incremental estimated PageRank patched in for updates that came faster than PageRank was re-calculated.

https://en.wikipedia.org/wiki/TrustRank

georgehill · on Jan 30, 2023

Thanks, this is new to me.

paper link:

http://ilpubs.stanford.edu:8090/770/1/2004-52.pdf

taubek · on Jan 30, 2023

Did you see the news that has been published today about Yandex? Maybe there is something that could be interesting to you https://searchengineland.com/yandex-search-ranking-factors-l...

ashkhaled · on Feb 8, 2023

Hi, I need a database PageRank graph to apply a calculation of PageRank algorithm to find the score of these pages in the database.

dekhn · on Jan 30, 2023

page rank is just one of many centrality measures of graphs.

rurban · on Jan 30, 2023

Wanting to load Javascript from 18 domains sounds like record spam to me. More than on your typical russian movie streaming backdooring attempts

taubek · on Jan 30, 2023

There is a repost at dev.to site - https://dev.to/memgraph/pagerank-algorithm-for-graph-databas...

joshilaurain17 · on Jan 30, 2023

really nice article.

vpavicic · on Jan 30, 2023

Thank you, I'm glad you liked it! What other graph algorithms would you like to know more about?

joshilaurain17 · on Feb 10, 2023

Yea sure please tell me more

amelius · on Jan 30, 2023

PageRank can be gamed, though. See SEO.

vpavicic · on Jan 30, 2023

The PageRank algorithm can be used for so much more than... well... page ranking :) hence the article :)

amelius · on Jan 30, 2023

Yes, it would have been nice if the article looked into ways in which the algorithm could be gamed, for the proposed applications.

quickthrower2 · on Jan 30, 2023

With the web the SEO practitioner can affect the inputs to the model and spend money to do so.

With the other examples I am not sure they can get an advantage by gaming it. In addition the pangrank models and how they weight it are more secret than Google. Google’s algorithm is secret but leaky (the search results are public)