PageRank and similar ranking algorithms on graphs can be used to detect monitoring attempts in P2P botnets [0] (a botmaster can detect when researchers/law enforcement start monitoring a botnet).
For my master's thesis, I evaluated these algorithms and tried to find ways to prevent detection and came to the conclusion that it is hard if you don't want to deploy about as much sensor nodes as the botnet has active peers. Those algorithms are working really well and are hard do work around.
I first heard about graph databases 13 months ago... there were so many questions in my head, I'm sure my colleagues still have PTSP when my name pops-up on Slack... :D
If you also have questions about one of the most used algorithms used in graph databases - PageRank - I've tried to give examples of usage in various use cases!
What have I missed? What other questions do you ask yourself about PageRank? What other graph algorithms do you have questions about?
BTW, is there a good "graph layer" for SQLite? I understand that graph databases use specific data structures to optimize for graph queries instead of row-oriented but sometimes you need something in the middle: representing graphs and doing basic queries.
It's nearly a decade old at this point and probably not optimal even for the time, so there may be better approaches, but here's a few basic algorithms on SQLite. PageRank is at the bottom.
I'm interested in this too. My hunch is that SQLite would be a particularly good fit for a whole bunch of queries thanks to this characteristic: https://www.sqlite.org/np1queryprob.html - "Many Small Queries Are Efficient In SQLite"
An algorithm that traverses a graph by performing hundreds of individual SELECT queries to follow a path should work much better against SQLite than against most other relational databases, due to the lack of network overhead in making each of those queries.
I was building something like this a few years ago - NetworkXum [1].
But now we just use a pure Graph implementation with NetworkX interface - UKV [2].
Common graph databases are network-based for scaling purposes. You do have a database server then you talk to the server somehow for the queries. Sqlite is a single-file database. You do not have a server so you do not need an extra layer in database side.
Instead, Run the graph algorithms on a stringifed json stored as a text in sqlite. They will be running in some process in anyway.
looks like SQLite supports with recursive which allows some graph oriented queries: https://www.sqlite.org/lang_with.html the mandlebrot query looks interesting :D
echoing _frkl: yes, kuzu aims to fill exactly this space. an easy to use dbms that gives you the ability to model your records as a graph, do common querying and transformations, and extractions all in a high-level graph-specific query language. we are very new and have quite a lot to go but it still implements many cypher clauses, so many things can still be done.
Does Google still use PageRank anymore? The search results are so bad these days, I'm dubious on how many other websites reference the top results they show.
PageRank was never as important as people thought it was.
Think of it this way: a search engine needs a relevance score that connects a query to a document. If the number of documents is vast (e.g. billions and billions) a search engine also benefits from a document-dependent quality score.
The first is more important than the second. You'd rather get a poor quality document that is relevant to the topic than a high quality document which isn't relevant.
It took several years before papers in the literature came out that found PageRank useful in search results, the key thing is that you need a real excess of documents. With millions of documents you are better off without it (being more effective at finding relevant documents improves performance), you really need 100 million + to reach the point where you have so many relevant documents for typical queries that filtering on quality doesn't get in the way of relevance.
PageRank can be thought of as simulating a Markov process where a user clicks a random link on a page most of the time but with some probability jumps to an entirely random page. PageRank is proportional to the probability that a user visits the page, or alternately how much traffic a page gets.
Google very quickly developed a few ways to sample this directly, such as (1) making Google analytics almost ubiquitous, (2) making Google ads almost ubiquitous, (3) analytics from the Chrome browser.
Google denies using the above for ranking, but they've been known to lie about Google's relevance factors before. Even a small sample from the above 3 could be used to calibrate models based on other info.
There are known link structure metrics besides PageRank that both scale better and are harder to game, though I'm not sure how their effectiveness in ranking compares to PageRank in the un-gamed case.
I used to work on Google web search indexing, almost 20 years ago, and (1) it has been public knowledge since around the time I left that PageRank was just one of a plethora of ranking signals going into the Learn To Rank page ranking ML and (2) I strongly suspect that PageRank itself has been replaced by a somewhat similar reverse link weighting algorithm.
I strongly suspect that something roughly PageRank-like is still there deep in the bowels of ranking, and due to the non-linear nature of ML, its importance probably varies greatly page to page.
For one, the TrustRank paper discusses some variations on link structure ranking, some of which are less computationally intensive than PageRank for incrementally updating. Back 20 years ago, Google News and the more frequently updated web pages would have an incremental estimated PageRank patched in for updates that came faster than PageRank was re-calculated.
With the web the SEO practitioner can affect the inputs to the model and spend money to do so.
With the other examples I am not sure they can get an advantage by gaming it. In addition the pangrank models and how they weight it are more secret than Google. Google’s algorithm is secret but leaky (the search results are public)
For my master's thesis, I evaluated these algorithms and tried to find ways to prevent detection and came to the conclusion that it is hard if you don't want to deploy about as much sensor nodes as the botnet has active peers. Those algorithms are working really well and are hard do work around.
[0]: https://doi.org/10.1145/3098954.3098991