Hacker News new | past | comments | ask | show | jobs | submit login
PageRank algorithm for graph databases (memgraph.com)
241 points by vpavicic on Jan 30, 2023 | hide | past | favorite | 39 comments



PageRank and similar ranking algorithms on graphs can be used to detect monitoring attempts in P2P botnets [0] (a botmaster can detect when researchers/law enforcement start monitoring a botnet).

For my master's thesis, I evaluated these algorithms and tried to find ways to prevent detection and came to the conclusion that it is hard if you don't want to deploy about as much sensor nodes as the botnet has active peers. Those algorithms are working really well and are hard do work around.

[0]: https://doi.org/10.1145/3098954.3098991


Thank you for your use case! I'll take a look at your thesis! :)


The linked paper is not my thesis but one of the foundational works, I based my thesis on. Mine can be found here: https://git.vbrandl.net/vbrandl/masterthesis/raw/branch/mast...


Ahh! Thnx! I was sloppy... just bookmarked it for later, didn't even check!


Great stuff, thanks for sharing.


I first heard about graph databases 13 months ago... there were so many questions in my head, I'm sure my colleagues still have PTSP when my name pops-up on Slack... :D If you also have questions about one of the most used algorithms used in graph databases - PageRank - I've tried to give examples of usage in various use cases!

What have I missed? What other questions do you ask yourself about PageRank? What other graph algorithms do you have questions about?


BTW, is there a good "graph layer" for SQLite? I understand that graph databases use specific data structures to optimize for graph queries instead of row-oriented but sometimes you need something in the middle: representing graphs and doing basic queries.


It's nearly a decade old at this point and probably not optimal even for the time, so there may be better approaches, but here's a few basic algorithms on SQLite. PageRank is at the bottom.

https://github.com/robmccoll/graphdb-testing/blob/master/tes...

The graph is just an indexed edge list with auxiliary tables to support algorithms as needed.


Sounds like EdgeDB [0] might be of interest to you

[0] https://www.edgedb.com


> "That said, EdgeDB is built on top of Postgres."


That made my day, thanks for sharing it


Cozo (cozodb.org) is a new embedded graph database that has many available storage backends, including SQLite.

It’s not a thin layer, though, if that’s what you’re looking for. Cozo has its own query system and uses a datalog-like query language.


I'm interested in this too. My hunch is that SQLite would be a particularly good fit for a whole bunch of queries thanks to this characteristic: https://www.sqlite.org/np1queryprob.html - "Many Small Queries Are Efficient In SQLite"

An algorithm that traverses a graph by performing hundreds of individual SELECT queries to follow a path should work much better against SQLite than against most other relational databases, due to the lack of network overhead in making each of those queries.


This should make it a lot better: https://github.com/abetlen/sqlite3-bfsvtab-ext


I was building something like this a few years ago - NetworkXum [1]. But now we just use a pure Graph implementation with NetworkX interface - UKV [2].

[1]: https://github.com/unum-cloud/networkxum [2]: https://github.com/unum-cloud/ukv


Common graph databases are network-based for scaling purposes. You do have a database server then you talk to the server somehow for the queries. Sqlite is a single-file database. You do not have a server so you do not need an extra layer in database side.

Instead, Run the graph algorithms on a stringifed json stored as a text in sqlite. They will be running in some process in anyway.


SimpleGraph might be worth a look. I've found it useful. https://github.com/dpapathanasiou/simple-graph


looks like SQLite supports with recursive which allows some graph oriented queries: https://www.sqlite.org/lang_with.html the mandlebrot query looks interesting :D

I haven't used it on SQLite but I have done basic graph queries in postgres with `with recursive` some examples can be found here: https://www.alibabacloud.com/blog/postgresql-graph-search-pr...


Not sqlite, but kuzu ( https://github.com/kuzudb/kuzu ) is an interesting project in this space. Fairly new, but already quite impressive IMHO.


Don't know about SQLite, https://github.com/apache/age seems very interesting, but it's just a tiny layer on top of Postgres :)


echoing _frkl: yes, kuzu aims to fill exactly this space. an easy to use dbms that gives you the ability to model your records as a graph, do common querying and transformations, and extractions all in a high-level graph-specific query language. we are very new and have quite a lot to go but it still implements many cypher clauses, so many things can still be done.


Does Google still use PageRank anymore? The search results are so bad these days, I'm dubious on how many other websites reference the top results they show.


PageRank was never as important as people thought it was.

Think of it this way: a search engine needs a relevance score that connects a query to a document. If the number of documents is vast (e.g. billions and billions) a search engine also benefits from a document-dependent quality score.

The first is more important than the second. You'd rather get a poor quality document that is relevant to the topic than a high quality document which isn't relevant.

It took several years before papers in the literature came out that found PageRank useful in search results, the key thing is that you need a real excess of documents. With millions of documents you are better off without it (being more effective at finding relevant documents improves performance), you really need 100 million + to reach the point where you have so many relevant documents for typical queries that filtering on quality doesn't get in the way of relevance.

PageRank can be thought of as simulating a Markov process where a user clicks a random link on a page most of the time but with some probability jumps to an entirely random page. PageRank is proportional to the probability that a user visits the page, or alternately how much traffic a page gets.

Google very quickly developed a few ways to sample this directly, such as (1) making Google analytics almost ubiquitous, (2) making Google ads almost ubiquitous, (3) analytics from the Chrome browser.

Google denies using the above for ranking, but they've been known to lie about Google's relevance factors before. Even a small sample from the above 3 could be used to calibrate models based on other info.


There are known link structure metrics besides PageRank that both scale better and are harder to game, though I'm not sure how their effectiveness in ranking compares to PageRank in the un-gamed case.

I used to work on Google web search indexing, almost 20 years ago, and (1) it has been public knowledge since around the time I left that PageRank was just one of a plethora of ranking signals going into the Learn To Rank page ranking ML and (2) I strongly suspect that PageRank itself has been replaced by a somewhat similar reverse link weighting algorithm.

I strongly suspect that something roughly PageRank-like is still there deep in the bowels of ranking, and due to the non-linear nature of ML, its importance probably varies greatly page to page.


>There are known link structure metrics besides PageRank that both scale better and are harder to game

Any examples?


For one, the TrustRank paper discusses some variations on link structure ranking, some of which are less computationally intensive than PageRank for incrementally updating. Back 20 years ago, Google News and the more frequently updated web pages would have an incremental estimated PageRank patched in for updates that came faster than PageRank was re-calculated.

https://en.wikipedia.org/wiki/TrustRank


Thanks, this is new to me.

paper link:

http://ilpubs.stanford.edu:8090/770/1/2004-52.pdf


Did you see the news that has been published today about Yandex? Maybe there is something that could be interesting to you https://searchengineland.com/yandex-search-ranking-factors-l...


Hi, I need a database PageRank graph to apply a calculation of PageRank algorithm to find the score of these pages in the database.


page rank is just one of many centrality measures of graphs.


Wanting to load Javascript from 18 domains sounds like record spam to me. More than on your typical russian movie streaming backdooring attempts



really nice article.


Thank you, I'm glad you liked it! What other graph algorithms would you like to know more about?


Yea sure please tell me more


PageRank can be gamed, though. See SEO.


The PageRank algorithm can be used for so much more than... well... page ranking :) hence the article :)


Yes, it would have been nice if the article looked into ways in which the algorithm could be gamed, for the proposed applications.


With the web the SEO practitioner can affect the inputs to the model and spend money to do so.

With the other examples I am not sure they can get an advantage by gaming it. In addition the pangrank models and how they weight it are more secret than Google. Google’s algorithm is secret but leaky (the search results are public)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: