Hacker News new | past | comments | ask | show | jobs | submit login

I have written this before but I’ll put it here again. What I would like to see is a federated search engine. Based on activitypub that works like mastodon. Don’t like the results from one source? Just remove them from your sources, or lower their ranking. Similar to yacy but you can work with the protocol to connect or build whatever type of index you want using whatever technology you like, and communicate over an existing standard. Want to build the worlds best index of Pokémon sites, then go do it. Want to build a search engine using idris or ats? Sure! I did note the professors are on mastodon so perhaps this may actually happen.

One of these days I’ll actually implement the above assuming nobody else does. I figured if I can at least get the basics done and a reference implementation that’s easy to run it could prove the concept. If anyone is interested in this do email my in my bio.

What I worry about for this project is that it becomes another island which prohibits remixing of results like google and bing, and its own index and ranking algorithms become gamed.

I wish the creators best of luck though. I am also hoping for some more blogs and papers about the internals of he engine. So little information is published in the space that anything is welcome, especially if it’s deeply technical.




> Don’t like the results from one source? Just remove them from your sources, or lower their ranking.

That's basically Usenet killfiles and, yes, I think they're totally due for a comeback in one form or another. Usenet may have had its issues towards the end (although it still exists), but killfiles weren't one of its problems. The simplest one you could just discard sources you didn't want to read anymore but the more advanced you could assign weight/rankings based on various factors (keywords / usernames / if you did participate or not in a discussion / etc.).


We like Federated search, we like decentralized search, and even P2P search; we are trying to find a good mix, and decided to get started rather than wait! Exciting times.


What are the benefits from this?

I'm not trying to be dismissive, it's just my feeling from working on search.marginalia.nu is that nearly every aspect of search benefits from locality, not only is the full crawl-set instrumental in determining both domain rankings and relevance signals on a term-level such as anchor tag keywords; but the way an inverted index is typically set up is extremely disk cache friendly where the access pattern for checking the first document warms up the cache for the other queries, but that discount obviously only exists when it's the same cache.


You could get people creating indexes with love such as your own. marginalia could become the de-facto index for long form content. However you probably arent that interested in running the best pokemon website, so someone else could do that.

Enough people add domain specific search endpoints, with perhaps a taxonomy to say "hey send those sort of queries over here" and you have a compelling engine that self heals should someone stop running things, or starts spamming.


Yes, that is an advantage.

You can also integrate search results for which you cannot have the index, like social media APIs, another reason.

You could also mix and match search results from various topic-oriented indices. That's a research question, whether that is really better than building one unified one. But we think it is the way to bring index fragments to the edge, with the obvious privacy advantages.


I would love to be able to run a node that mirrors part or all of an index like this, and to let people query it - a bit like https://torrents-csv.ml/#/

Good luck! I'll be watching your progress and cheering you all on!


What's the point of a federated search engine? At the end of the day most nodes will end up implementing the same regulations/censorship with development driven primarily by a few. It's like ethereum vs ethereum classic all over again. If the EU or the developers' respective governments demand a censorship or forgetting feature to be implemented, it's not like the federated nature would matter. An open source search index is useful, a search engine that can be easily self hosted is also useful. But building a search engine as a federated system is a gimmick with no significant value.

Do you see any major Mastodon nodes interfacing with Truth Social or Gab? I certainly don't. If federation barely works for a social media app, I fail to see how it would even matter for a search engine.


At least one of the partners (https://openwebsearch.eu/partners/radboud-university/) does research on "federated search systems", so there's hope!


Isn't searx what you're describing? I was running an instance for a while, and it's basically a meta search engine that has support for all kinds of providers.

There are also some web extensions available so that you can fill it with more data.

[1] https://searx.github.io/searx/


I'd say it rather looks like Seeks, unfortunately defunkt: https://en.wikipedia.org/wiki/Seeks

> a decentralized p2p websearch and collaborative tool.

> It relies on a distributed collaborative filter[6] to let users personalize and share their preferred results on a search.


Searx is half of it where it calls out to other searches but does not provide its own index as far as I can see. It also does not remix the results.


If it is about a decentralized index there's also YaCy.net [1] but I don't know how actively maintained the project is.

For me it made more of an enterprise-grade use case (e.g. for building a search for your own file servers or confluence) so I only tested it out a little. It's a huge java project, that's why I decided to go with searx back then...cause yacy was pretty hard to setup.

[1] https://github.com/yacy


One of the things I wonder here is if it would be easier to just start by crawling known RSS feeds and then exposing a JSON API for the data and making the whole thing open source. Then keeping a public list of indexes and who crawls what. Eventually moving into crawling other sources but first primarily addressing the majority of useful content that's easily parseable.


That's probably the easiest way I know to get good content into a search engine. Annoyingly however it does not contain all the content available.


What benefit does federation bring here? Unless it is very simple to set up, most communities are non-technical and probably won't be able to set up their own crawler. I would think just a search engine that lets you customize the ranking algorithm, and maybe hook into whatever ontology they've developed and ranking it accordingly would be sufficient.


> most communities are non-technical and probably won't be able to set up their own crawler

They can use a solution which already integrates the search. Forums and CMSes are a good target for that. Then you can say "I'd like my search to look at widgetlovers.com too" - and you get their sitemap + featured external links, because they run FooPress that supports it.

Kind of the same as sitemap we already produce for Google.


It can be very simple to setup. Think single binary to run, or lambda to deploy (yes this is possible) with the URL back to it.

I imagine a binary, with a simple Admin UI allowing you to crawl some domains recursively would be enough to index your own website, and then have those results shared.

Where I could see this being really useful, is let someone who knows everything about pokemon provide the index for searching pokemon information. Then when they federate, provide a taxonomy saying "for queries that have these words, call me". Suddenly you have a very high value search source for pokemon.

Throw in some zero click info information boxes and you have added a lot of value.


ActivityPub is not well suited for this application. It's for publishing activities made by actors — hence the name. You'll want to invent your own federation protocol specifically for federated search.


Last time I checked there was something in there for search...

Even so you could base it on activitypub I suspect. It would need to be extended for sure to implement the sorts of things believe would be required.


They've listed "DECENTRALISED SEARCH" as a ongoing project/goal.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: