Hacker News new | past | comments | ask | show | jobs | submit login

ignoring the problem of robots.txt inaccessibility, is it feasible to have Kagi-style "private google" with a more limited number of high-signal-to-noise sites, especially if you drop the concept of e-commerce and some other low-SNR feeds?

perhaps one interesting thing is that a decent number of the highest-SNR feeds don't actually need to be crawled at all - wikipedia, reddit, etc are available as dumps and you can ingest their content directly. And the sources in which I am most interested in for my hobbies (technical data around cameras, computer parts, aircraft, etc) tend to be mostly static web-1.0 sites that basically never change. There's some stuff that falls inbetween, like I'm not sure if random other wikis necessarily have takeout dumps, but again, fandom-wiki and a couple other mega-wikis probably contain a majority of the interesting content, or at least a large enough amount of content you could get meaningful results.

Another interesting one would be if you could get the Internet Archive to give you "slices" of sites in a google takeout-style format. Like they already have scraped a great deal of content, so, if I want site X and the most recent non-404 versions of all pages in a given domain, it would be fantastic if they could just build that as a zip and dump it over in bulk. In fact a lot of the best technical content is no longer available on the live web unfortunately...

(did fh-reddit ever update again? or is there a way to get pushift to give you a bulk dump of everything? they stopped back in like 2019 and I'm not sure if they ever got back into it, it wasn't on bigquery last time I checked. Kind of a bummer too.)

I say exclude e-commerce because there's not a lot of informational value in knowing the 27 sites selling a video card (especially as a few megaretailers crush all the competition anyway), but there is lots of informational value in say having a copy of the sites of asus, asrock, gigabyte, MSI, etc for searching (probably don't want full binaries cached though).

But basically I think there's probably like, sub-100 TB of content that would even be useful to me if stored in some kind of relatively dense representation (reddit post/comment dumps, not pages, same for other forum content, etc, stored on a gzip level5 filesystem or something). That's easily within reach of a small server, not sure if pagerank would work as well without all the "noise" linking into it and telling you where the signal is, but I think that's well within typical r/datahoarder level builds. And you could dynamically augment that from live internet and internet archive as needed - just treat it as an ever-growing cache and index your hoard.




It sounds like CommonCrawl: https://commoncrawl.org/the-data/get-started/

You can download it, put it into ClickHouse, and get your own professional search engine.

I've made up the term "professional search engine". It's something like Google, but: - accessible by a few people, not publicly available; - does not have a sophisticated ranking or quorum pruning and simply gives your all the matched results; - queries can be performed in SQL, and the results additionally aggregated and analyzed; - full brute-force search is feasible.

PS. Yes, the Reddit dataset stopped updating.


With a source corpus that's high enough quality, you probably don't even need pagerank, just regular full-text search like Lucene would be enough.


Sounds crazy hard, a lot of moving parts, both human and technical. But if pulled off right I’d pay for that kind of thing, preferably hosted and cared for me in a datacenter


I'd like a browser plugin which allows me to vote up or down websites. This would go to a central community database. Highly upvoted sites would be crawled and archived, and available through a specialized search page.


That used to sort of exist: https://en.wikipedia.org/wiki/StumbleUpon


The users of the plugin if it's at all successful will include SEO types. Without a way of sorting out quality input from generic promotional input, this plugin will not scale.


as chalst says, once your site becomes successful, spammers will make accounts and upvote their spam and downvote the good stuff.

your solution reduces to the reputation-network problem, it works if everyone is a good actor, or known-good actors (people you know personally) can "vouch" for others across the network (perhaps with reductions in vouch-iness across the network - friend of friends is good, friend of friend of friends ok, 4 degrees out maybe not so much).

But the trivial solution is easily attacked with the "sybil attack", which is one thing crypto was supposed to solve - people would have a good incentive to not forward shit if everyone had to put up a deposit and if they forwarded spam then they'd lose the deposit. But what is the definition of spam, and how can you assert that without attackers using that to kick legitimate users off the network? it's a tough problem.

https://en.wikipedia.org/wiki/Sybil_attack

there's an old form-reply copypasta about spam filtering and how your clever solution will not work for the following reasons: and basically "it requires us to solve a user or server reputation ranking problem" is one of the main reasons spam filtering also will not work - remember Bitcoin originally evolved from HashCash which was meant to solve the spam problem! if I provably spent 10 seconds of CPU time solving this random math problem and the solution is provably never re-used, then it becomes infeasible for an attacker to send a bunch of junk messages because they'd need a whole lot of CPUs, right? Definitely not something they'd have access to via, say, botnets... ;)

the other core problem with a lot of these solutions is, attackers are a lot more willing to spend money to get spam in front of users (people make money running those websites after all) than actual users are to spend money to make a facebook post or whatever. 10 cents to make a bunch of impressions is cheap, but I'm not spending 10c to post my cat!

centralized authorities are a relatively cheap solution to these complex problems: if you post spam then facebook decides that it's spam themselves and bans you, done. If your IP or domain sends a lot of spam email then Spamhaus bans you, done. O(1) (or at least O(N)) solution. And that's kind of the neat thing about mastodon too - you don't have to moderate every message, you just have to ensure the groups you're federating with are doing a decent job of policing their own shit, and if they're a problem you un-federate.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: