ignoring the problem of robots.txt inaccessibility, is it feasible to have Kagi-...

zX41ZdbW · on March 7, 2023

It sounds like CommonCrawl: https://commoncrawl.org/the-data/get-started/

You can download it, put it into ClickHouse, and get your own professional search engine.

I've made up the term "professional search engine". It's something like Google, but: - accessible by a few people, not publicly available; - does not have a sophisticated ranking or quorum pruning and simply gives your all the matched results; - queries can be performed in SQL, and the results additionally aggregated and analyzed; - full brute-force search is feasible.

PS. Yes, the Reddit dataset stopped updating.

krackers · on March 7, 2023

With a source corpus that's high enough quality, you probably don't even need pagerank, just regular full-text search like Lucene would be enough.

samat · on March 7, 2023

Sounds crazy hard, a lot of moving parts, both human and technical. But if pulled off right I’d pay for that kind of thing, preferably hosted and cared for me in a datacenter

phendrenad2 · on March 7, 2023

I'd like a browser plugin which allows me to vote up or down websites. This would go to a central community database. Highly upvoted sites would be crawled and archived, and available through a specialized search page.

corford · on March 7, 2023

That used to sort of exist: https://en.wikipedia.org/wiki/StumbleUpon

chalst · on March 7, 2023

The users of the plugin if it's at all successful will include SEO types. Without a way of sorting out quality input from generic promotional input, this plugin will not scale.

paulmd · on March 7, 2023

as chalst says, once your site becomes successful, spammers will make accounts and upvote their spam and downvote the good stuff.

your solution reduces to the reputation-network problem, it works if everyone is a good actor, or known-good actors (people you know personally) can "vouch" for others across the network (perhaps with reductions in vouch-iness across the network - friend of friends is good, friend of friend of friends ok, 4 degrees out maybe not so much).

But the trivial solution is easily attacked with the "sybil attack", which is one thing crypto was supposed to solve - people would have a good incentive to not forward shit if everyone had to put up a deposit and if they forwarded spam then they'd lose the deposit. But what is the definition of spam, and how can you assert that without attackers using that to kick legitimate users off the network? it's a tough problem.

https://en.wikipedia.org/wiki/Sybil_attack

there's an old form-reply copypasta about spam filtering and how your clever solution will not work for the following reasons: and basically "it requires us to solve a user or server reputation ranking problem" is one of the main reasons spam filtering also will not work - remember Bitcoin originally evolved from HashCash which was meant to solve the spam problem! if I provably spent 10 seconds of CPU time solving this random math problem and the solution is provably never re-used, then it becomes infeasible for an attacker to send a bunch of junk messages because they'd need a whole lot of CPUs, right? Definitely not something they'd have access to via, say, botnets... ;)

the other core problem with a lot of these solutions is, attackers are a lot more willing to spend money to get spam in front of users (people make money running those websites after all) than actual users are to spend money to make a facebook post or whatever. 10 cents to make a bunch of impressions is cheap, but I'm not spending 10c to post my cat!

centralized authorities are a relatively cheap solution to these complex problems: if you post spam then facebook decides that it's spam themselves and bans you, done. If your IP or domain sends a lot of spam email then Spamhaus bans you, done. O(1) (or at least O(N)) solution. And that's kind of the neat thing about mastodon too - you don't have to moderate every message, you just have to ensure the groups you're federating with are doing a decent job of policing their own shit, and if they're a problem you un-federate.