Hacker News new | past | comments | ask | show | jobs | submit login
EU Open Web Search project kicked off (openwebsearch.eu)
404 points by ZacnyLos on Sept 20, 2022 | hide | past | favorite | 310 comments



Interesting to see the amount of negative comments.

Most of the negativity seems to come from the following points:

- EU-funded project cannot succeed in tech because previous EU-funded projects have failed in tech in the past (and generally government-funded project in tech are suspicious)

- Search is hard, and therefore it will fail

- The project is underfunded

Even if the points above can all be valid (though the obvious US-funded startup launched by a bunch of uni students seemed to have fared pretty well) it seems we are missing the point of this project.

The proposal is to contribute to the creation of open building blocks necessary to enable others (including private US companies) to make better search products.

Better search product are needed.

Shall we remember HN of some of the intense conversations that happened here about Google failing us:

- Google Search is Dying [1]

- Every Google result now looks like an ad [2]

- Google no longer producing high quality search results in significant categories [3]

So while, yes search is a hard topic, we should welcome initiatives aiming at improving the ground infrastructure needed to lower the barrier to entry on this subject and hope it will allow many companies to build better search products (or inspire other initiatives to contribute in similar and even more successful way)

1: https://news.ycombinator.com/item?id=30347719

2: https://news.ycombinator.com/item?id=22107823

3: https://news.ycombinator.com/item?id=29772136


One reason search is hard is people are very motivated to game your system. That makes being transparent about how it works a fool's errand. It also isn't clear how the economic structure works: improving search relevance by x% is tremendously socially valuable, but probably makes Google's bottom line go up by a thousandth of x, and they have a very direct understanding of the connection. Without that money, how will this get the resources to succeed given the adversaries are learning from the people with a lot more?


This sentiment is common, but is largely based on Google (and similar commercial actors). I'm not convinced it generalizes as much as it seems. It is uniquely profitable to manipulate Google because Google's outlook of the world informs not only its search engine, but its advertisement business.

> improving search relevance by x% is tremendously socially valuable, but probably makes Google's bottom line go up by a thousandth of x, and they have a very direct understanding of the connection.

I'm not sure this is correct. In a vacuum of real competition, the most profitable ought to be how it is right now, when search results are kind of ambiguously bad, so you need to click and skim through a few results to maybe find what you want, multiplying the number of ad impressions.


It's nothing to do with Google advertising. It was a problem before Google had advertising. Being the number 1 non-ad result for a search will get you many more hits than being a number 2. If you don't appear on the first page, you may as well not exist.

Google's priority with ads is to make sure ad buyers are happy with the results they are getting. This means making sure the people who click the ads go on to buy the product being advertised. That's what ad buyers measure.


Being result number 1 not result number 5 matters a lot to companies that are selling things to people who are searching. The value of the ads comes from that property, not the other way around.


Indeed, and that's why search relevance is important.

We can talk about the Importance Of This until we die of old age. Google failed us, Google is Borg, Yandex is FSB, Bing is ... Bing, etc. However, the fact that there is a problem to be solved and It Is Important doesn't mean that the EU will solve it. If anything, it's just the set-up for another political play that will have damaging consequences to the internet as a whole, just like GDPR, EU's poster-child "internet project".

They made GDPR just strong enough legislatively to be annoying, but not strong enough to actually change anything. Companies can still store EU citizens' data anywhere they want and do whatever they want. There's no insight into this. It is an unenforceable law and the only artifact of it existing is that I have to have "I don't care about cookies" installed, so that Avast Antivirus can eventually decide to silently collect and sell on my data.

For all intents and purposes OpenWebSearch is most likely not meant to succeed at anything either, and is just going to be a political stepping stone towards legislature that will be awkward and make the internet worse for all users. EU has a long history of creating or hanging onto laws that betray a misunderstanding of how the digital age works, how the internet works, how data can be copied or moved around for free.

Here's an example. Every country in Europe has some sort of legal construct in place that will prevent you from secretly recording a conversation you're having and uploading it to the internet. So for example, take Austria. They accomplish it by preventing you from publishing it on the internet. However, they don't prevent you from recording it. There are laws against secretly recording a conversation you are not a part of, but there's no such law for the situation when you are part of the conversation. So you can record and upload, just not publish. However, you can get on a train and go to a different country which has laws that prevent you from recording, but has no laws against publishing. Then you're in the clear. Or you can just use a VPN so that it looks like you uploaded and published it from the other country. Or you can just upload to YouTube, which will not show where the thing was published from - and claim that you did so during your tourist visit to Vietnam or whatever. And if someone brings a civil lawsuit? Good luck trawling the Vietnamese legal system for clues about that specific issue. Hope you know this rare language with virtually no legal experts who at the same time speak Vietnamese and your particular local European language. The costs would be on the order of tens of thousands at least, which is out of reach for anyone but the wealthiest EU citizens.

Want more? Egregious copyright related laws are known by everyone.

So are dns blocks of shunned sites. Or the recent Austrian project to block cloudflare IPs which literally broke the internet.

The fact such laws are still in place from before the internet - or are even still being put out - and are effectively unenforceable while making the internet worse for everyone - makes it clear that the governments in place simply have a misunderstanding of how the internet works.

None of that will stop politicians from coming up with BS excuses breaking the world with fever-dream laws in order to push their latest political agenda. The real question is: if such clearly unfit legislature is being put in place for political folly in industries we understand - how much of that is happening in industries we don't understand? Health Care, Food, Agriculture, Civil Engineering, Patent Law, and so on. My guess? You can tell what my guess is.

As for Open Web Search, what everyone should really be asking themselves is: ok, so what's the scam that's going to be pulled here?


> As for Open Web Search, what everyone should really be asking themselves is: ok, so what's the scam that's going to be pulled here?

As I'm one of the people who is working within the Open Web Search project please allow me to feel strongly about your statement. I've been involved in campaigning for this project since around 2014. This project did not originate as a 4d-chess move of some political game. It exists because of the hard work of a group of people, some of which are researchers, some of which are working at smaller search engines and some of which are involved in civil rights organisations. Currently still sitting in the kick-off meeting I can tell you that we are actively discussing how to get this project to produce useful results for a european open web index. Getting the EU to draft problematic legislation is neither in our power nor in our interest.


"One reason democracy is hard is that people are very motivated to game your system. That makes being transparent about how it works a fool's errand."

That's a major fallacy you're opening with. If something is important to us (as a society), we will find ways to make it serve us well even when it's under attack. The rest of your post even makes a similar point: the social value of well-working search is greater than the economic value for even the biggest search monopolist on the planet. So why not socialize it?


Well, on one side we have zero examples of open engines actually working well, on other we have long nuanced history of SEO vs search engine fighting unending battle, with SEO side trying to circumvent every way the search engine is trying to stop them from poisoning the search results. And barely edging out on that.

Being idealistic about it won't change the outcome.


Do we have any example of an open engine working badly? '

As far as I know, this hasn't really been tried, definitely not with the kinds of resources this project will have available.


Do we have any examples of search engines not run by ad companies having problems fighting SEO?


Altavista was initially a demo of the capabilities of DEC Alpha CPUs. It was initially amazing. It got taken over by SEO rubbish, and beaten by Google who invented PageRank to get good results again.

PageRank was initially much less prone to SEO shenanigans because it relied on signals from other sites (incoming links) to decide how important a result was. Of course, as Google became more popular, people started sharing links on other pages and so on to cheat the PageRank algorithm. And Google have been caught in a fight with SEO ever since.


Democracy has some intentional opacity though. Consider the impact of making your personal voting choice public.

AFAIK, nobody has ever tried that because of the dangers of vote buying and coercion. Essentially, gaming the system.

(I don't think this proves anything! Simply wanted to suggest that comparing search to democracy doesn't significantly change the analysis wrt opacity.)


I like the idea of opaque citizens and open government/search company though.


Exactly, I think the failure of such project is embedded in the incentives. OpenMaps are good but you’re not going to convince Google to make them default on Android.

The correct strategy isn’t to hand this over to a giant bureaucracy, but to create an atmosphere where we can have a dozen alternatives to Google.


This does seem to be the goal of the project. They mention that they want to make a centralized control panel and crawler, where you can control how your website is crawled and when

But at the end of the day the resulting dataset would be sold to third parties so they can rank the results appropriately. Which to me seems to be the only sane way forward. Only a government could run something the scale of the Google crawler and succeed at doing so. And then everyone can build search engines on top of that.


> And then everyone can build search engines on top of that.

That bit troubles me. If the index is maintained by a government agency, and every search engine is using the same index, then that's a massive censorship avenue. I wonder how "open" Open Web Index is going to be.


If the index is censored, nothing stops you from adding to the index yourself.

The vast majority of the net is, after all, now indexed, so you can run your own indexer to cover whatever they didn't cover.


The only way you can beat google is to create a search engine in a small niche where they can’t compete, then eventually over the years expand into general search once they can’t catch up (innovators dilemma basically).

The only search product that comes close to this is Amazon search or booking.com (or maybe YouTube).

All multi billion dollar businesses. And when smaller ones emerge like the flight scanner one, google buys them.

I don’t see google going anywhere soon. They have a lot of faults but they’re still too good.


In Austria there is "geizhals.at" [1], which is a price comparison website for electronics, and they have the best parametric search engine I have ever seen. It's 1000x better than Amazon search.

However, there's no way Geizhals will ever expand into general search. The reason why they are so much better than the competition is that they are focussing on a small, profitable niche and presumably use manual data entry to ensure they have the best data.

[1]: there's a UK version too "skinflint.co.uk"


Just opened skinflint.co.uk to have a look. I was greeted by "Do Not Track detected only storing necessary information". That's wonderful!

I don't think I've seen a website that does that before and I think it's worth praising.


Geizhals is my go to for anything computer related. Somewhat amusing that such a good site comes out from this little country.


This is HN and we're not even pretending to be optimistic about beating incumbents, but it is much deeper than that - there is also a profound sense of defeatism directed towards European entrepeurs as well as the world at large. It is an indictment of the current nihilistic zeitgeist.


Well, it is a forum run by a Silicon Valley VC firm … so of course there will be a bit of “only SV can do anything interesting in tech” gatekeeping going on.


Idk, even people here in SV has given up.


> The only search product that comes close to this is Amazon search or booking.com (or maybe YouTube).

What is so great about those search examples?


Absolutely nothing, Amazon search is horrendous. Search by price is completely broken (and has been for years). Filter by exact phrase is non-existent. No way to search products by size, and on and on.


Exactly, neither Google nor Apple will open the gates to their fenced garden. It will always be more convenient for users to use their solution.


Speaking of Google and ads; I witnessed a new low today.

My 3 going on 4 kid was watching a cartoon on Youtube: Curious George.

An ad popped up promoting some show, featuring foul language and sexual intercourse.

Yagoddabekidding.


That's why I don't let my kids watch YouTube on their own... That said, as a European I'm more triggered by all the gore and violence rather than some naked skin (although I also wouldn't want sub-12 year olds encounter explicitly sexual stuff). The later seems to just lead to questions and adult conversations, but the normalised violence the kids seem to act out.


I second the violence thing. I has somehow become the new normal in the teenager world which is very disturbing to me. In my family we strictly use only Firefox with ublock for youtube (and the rest of the internet too). The difference is night an day. I always recommend it to friends of mine and every time I get positive feedback after this.


Once or twice I didn't use Firefox with ublock origin on youtube, and since I am pretty allergic to any ads as they insult my intelligence I just closed the whole page before getting to the content.

How can otherwise smart people accept being treated like brainless idiots by similar services is beyond me. And how can they teach their kids that's fine too is something I'll never understand... but hey its your kids, everybody sees quality of their parenting first hand (and then complain how kids are unruly and eat junk these days... gee I wonder where they took inspiration and who built their character).


YouTube kids doesn't appear to have any ads, so might be worth checking out.


I have YouTube premium. Everytime my kids get exposed to even age appropriate àds it generates attraction to crap. Even in magazines for kids. So I just keep ads out of their system and they seem happier and more internally satisfied


Advertisements are essentially artificial reminders of how unsatisfying and difficult life is without $PRODUCT.


Pretty much a form of psychological abuse and manipulation.


I've recently been searching for some how-to videos and the number of ads is insane - but somehow they don't show when I watch in the video search results in DuckDuckGo


Same here, not using YouTube's built in search function eliminates most of the ad and also recommendations problems. I use You.com since they additionally show Reddit and TikTok results which I sometimes find more useful than YT videos.


I'm pretty sure YouTube doesn't play ads in embeds (as much?)


people are primarily motivated by envy even if they don't realize it...


One more reason to use an ad blocker for YouTube. I even use on my (android) "smart" TV. The open source app I use is smartyoutubetv. Occasionally it gets broken, but then a fix is usually an update away. It skips not just ads, but intros, self promotion, "please subscribe to this channel" etc (assuming creators tag their content properly). Basically it is YouTube premium for free.

An interesting thing I noticed is than initially when I used the app)(with the same YouTube profile) I would get worse video suggestions. These days they are the same in YouTube app as well as smartyoutubetv.

Another interesting observation is that if the app breaks and I start watching YouTube in their app the frequency of their ads is quite reasonable in the beginning. Few times I even thought. Hey, if they show me an ad every couple of videos that's not that bad right, I might switch back to YouTube. Then as the amount of content watched increases the amount of ads increases to the point of making it unwatchable (4 ad breaks in a 20min video, sometimes even double 40s ads one can't skip). This makes me go back.


I'm curious here and would love to hear thoughts on:

A) Youtube's terms of service clearly state "not for kids under 13"[1]

B) Youtube has produced a product for the younger age market [2]

C) folks in this thread are reasonably complaining that full youtube isn't appropriate for children.

Am I missing something here? (Like maybe your child was using youtubekids and still got the unacceptable content?)

[1] https://blog.youtube/news-and-events/children-youtube/#:~:te.... "While we permit users between the ages of 13 and 17 to register for an account with parental permission, we do not allow children under the age of 13 to create an account"

[2] https://www.youtubekids.com/


I'll put in a recommendation for the Youtube Kids app where this doesn't happen.


Unfortunately, voice search is also something doesn't happen on Youtube Kids for Android TV. That greatly limits its usability, which is unfortunate, since there are many good things about it.

I'm using it in Japanese. Using the on-screen keyboard in Japanese mode, you can enter hiragana only. It doesn't do any word recognition to turn into kanji or katakana, which results in the search results being poor.

Nobody uses onscreen keyboards on Android TV for anything beyond entering a Wi-Fi password; it's a nonstarter.

YouTube Kids for Android TV ... ironically, just a toy for now ... with a well-deserved, accurate 1.4 star rating in the Play Store.


See? Now you know about that show and you will watch it. Ad was successfully delivered. Engagement goes up.


What did you expect? Google is an ad company, first and foremost. Everything else is just a funnel.


> What did you expect?

Something like, perhaps, the same standards and sense as in traditional broadcasting.

> Everything else is just a funnel

A funnel promoting pick up trucks and financial services to toddlers? Do they count that as an "impression" in the statistics that they feed the client? It seems like borderline fraud.


You can pay for YouTube premium to get rid of ads, maybe just do that? Or don't let your 3 year old kid watch YouTube in the first place, and show them something from traditional broadcasting. Before complaining about broadcasting standards, maybe first up your parenting standards.


> You can pay for YouTube premium to get rid of ads, maybe just do that?

"My /own/ children don't see inappropriate ads; therefore there isn't any problem."

> Before complaining about broadcasting standards, maybe first up your parenting standards.

"Before whining for ice cream, maybe first eat your dinner!"


Does Youtube Premium offer higher standards? The post you're replying to isn't asking for fewer ads.


> An ad popped up promoting some show, featuring foul language and sexual intercourse.

Youtube Premium has no ads.


If only there was a way to get rid of ads...


What if I don't want to get rid of ads?


That's not right. Use uBlock Origin.


PyPy was funded by EU. I would be very happy if this project is as successful as PyPy.


EU providing tiny amounts of seed funding to and existing project to bootstrap a small proof of concept is entirely different from the EU trying to essentially create a tech giant.


To be perfectly clear, OWS is an example of the former. No one is trying to build a EU search giant.


a tech giant which is also a public utility a level or two lower than running water...


"Funded by" is very distinct from what's happening here.

This is another Gaia-x, remember, the EU big tech cloud killer?


This is not a Gaia-X, it is an exploratory project, showing a possible way forward and setting first steps.


> EU-funded project cannot succeed in tech because previous EU-funded projects have failed in tech in the past

I think the fundamental problem here is that the people that are interested and in grants and are capable of writing grant proposals are different than the people that are interested in building things. There's very little overlap. So the money goes to the people capable of writing proposals, and the people doing the work do it for free in an obscure corner of the internet.

It's sad really, but I suspect it's a side effect of the huge bureaucratical machine that is the EU. One way to make this better would be to simplify the access to grants so that technical people can do it without needing a class in "EU funding speech".


If these thinkers can find a way to remove the incentives of spammers, misinformators, and stakeholders to pollute results, that would be a great achievement. It could be seen as a big economic game, simulating these actors might allow comming up with rules to balance this game and minimize pollution.


It might be possible using moderation and reporting abuse or registering with the search engine to aid moderation and banning abusers. This will probably also take down right wing websites, porn and online gambling.


> EU-funded project cannot succeed in tech

Not sure about software, but all (well, Apple is getting there) phones now use the same connector to charge because of the EU.


We have Estonian which is like a laboratory for govtech complete with human test subjects. They are making the automation focused law sausage.


Wasn't China's unified standard from about 2007 much more influential?


Which standard?


Some standard about unified chargers.


The new iPhone 14 still has a lightening connector.


2024.


Any new open-source search option is good, but I also wish more attention was given to prior open projects like GigaBlast[0]/KBlast[1] crawlers, etc.

It hasn't escaped the wider world that quality open-source search is desirable, and it's hard to think what this new EU project brings to the table that isn't already available if others want to contribute to existing efforts. I wish the EU project the best of luck of course!

[0] https://github.com/gigablast/open-source-search-engine

[1] https://github.com/fossabot/kblast


> EU-funded project cannot succeed in tech because previous EU-funded projects have failed in tech in the past (and generally government-funded project in tech are suspicious)

I'm not suspicious of all government-funded projects (back in the day my own PhD was government-funded!) but I can't help be suspicious of claims such as:

"an open European infrastructure for internet search, based on European values and jurisdiction"

and

"The project will be contributing to Europe’s digital sovereignty"

Q1: Who defines "European values"? Is that done by Qualified Majority voting or would - for instance - Hungary have a veto on any proposed definition?

Q2: Which treaties regulate "digital sovereignty"? Recalling that the 27 member states each "remain sovereign and independent"[0], is that digital sovereignty being handled in BRU, in the 27 states, or a mixture of both?

[0] https://op.europa.eu/webpub/com/eu-what-it-is/en/


>"an open European infrastructure for internet search, based on European values and jurisdiction"

jurisdiction = GDPR will not be cheated.

European = We mistrust America, because they do crappy stuff that meant we had to pass GDPR.

>Europe’s digital sovereignty

Means - Europe will not be ruled by interests outside Europe, the interests inside Europe can fight it out via EU procedures.


> jurisdiction = GDPR will not be cheated

Are there (m)any global search engines with European users who are "cheating" GDPR? Which ones?

> Europe will not be ruled by interests outside Europe

That's a very bold claim, especially considering the geopolitical situation right now.

> the interests inside Europe can fight it out via EU procedures

Would that be the interests of the estimated 25k-30k lobbyists who work in Brussels on behalf of their corporate paymasters?


are you reading my post as being a statement that these will be the results instead of my providing an interpretation of what the original EU text means?

>Would that be the interests of the estimated 25k-30k lobbyists who work in Brussels on behalf of their corporate paymasters?

probably, as well as the various governments that exist in the EU.


>European = We mistrust America, because they do crappy stuff that meant we had to pass GDPR.

But we still mistrust our citizens that's why we introduce Chat-control 2.0, so we have a "better then China" control over private communication.

European values ;)


Remember that the EU is not just one person. There are many people, parties and organisations who are actively fighting against legislation like this.


You mean begging not fighting right? But why should i trust Europe more then for example Google (at least they can protect their "trade-secrets/data").

That's why People trust evil-company's more then stupid[1] Governments.

[1] never attribute to malice that which is adequately explained by stupidity.


> EU-funded project cannot succeed in tech because previous EU-funded projects have failed in tech in the past (and generally government-funded project in tech are suspicious)

Do we distinguish direct EU funding from funding by governments of EU countries? If not ASML would like a word, and I'm sure people from other member countries can come up with other examples.


and generally government-funded project in tech are suspicious

Like the Internet, or the WWW?


or computers themselves


You forgot people not trusting the government to curate information.


Wasn’t Google financed by Darpa? It was always planned big. The student story is bogus.


I have written this before but I’ll put it here again. What I would like to see is a federated search engine. Based on activitypub that works like mastodon. Don’t like the results from one source? Just remove them from your sources, or lower their ranking. Similar to yacy but you can work with the protocol to connect or build whatever type of index you want using whatever technology you like, and communicate over an existing standard. Want to build the worlds best index of Pokémon sites, then go do it. Want to build a search engine using idris or ats? Sure! I did note the professors are on mastodon so perhaps this may actually happen.

One of these days I’ll actually implement the above assuming nobody else does. I figured if I can at least get the basics done and a reference implementation that’s easy to run it could prove the concept. If anyone is interested in this do email my in my bio.

What I worry about for this project is that it becomes another island which prohibits remixing of results like google and bing, and its own index and ranking algorithms become gamed.

I wish the creators best of luck though. I am also hoping for some more blogs and papers about the internals of he engine. So little information is published in the space that anything is welcome, especially if it’s deeply technical.


> Don’t like the results from one source? Just remove them from your sources, or lower their ranking.

That's basically Usenet killfiles and, yes, I think they're totally due for a comeback in one form or another. Usenet may have had its issues towards the end (although it still exists), but killfiles weren't one of its problems. The simplest one you could just discard sources you didn't want to read anymore but the more advanced you could assign weight/rankings based on various factors (keywords / usernames / if you did participate or not in a discussion / etc.).


We like Federated search, we like decentralized search, and even P2P search; we are trying to find a good mix, and decided to get started rather than wait! Exciting times.


What are the benefits from this?

I'm not trying to be dismissive, it's just my feeling from working on search.marginalia.nu is that nearly every aspect of search benefits from locality, not only is the full crawl-set instrumental in determining both domain rankings and relevance signals on a term-level such as anchor tag keywords; but the way an inverted index is typically set up is extremely disk cache friendly where the access pattern for checking the first document warms up the cache for the other queries, but that discount obviously only exists when it's the same cache.


You could get people creating indexes with love such as your own. marginalia could become the de-facto index for long form content. However you probably arent that interested in running the best pokemon website, so someone else could do that.

Enough people add domain specific search endpoints, with perhaps a taxonomy to say "hey send those sort of queries over here" and you have a compelling engine that self heals should someone stop running things, or starts spamming.


Yes, that is an advantage.

You can also integrate search results for which you cannot have the index, like social media APIs, another reason.

You could also mix and match search results from various topic-oriented indices. That's a research question, whether that is really better than building one unified one. But we think it is the way to bring index fragments to the edge, with the obvious privacy advantages.


I would love to be able to run a node that mirrors part or all of an index like this, and to let people query it - a bit like https://torrents-csv.ml/#/

Good luck! I'll be watching your progress and cheering you all on!


What's the point of a federated search engine? At the end of the day most nodes will end up implementing the same regulations/censorship with development driven primarily by a few. It's like ethereum vs ethereum classic all over again. If the EU or the developers' respective governments demand a censorship or forgetting feature to be implemented, it's not like the federated nature would matter. An open source search index is useful, a search engine that can be easily self hosted is also useful. But building a search engine as a federated system is a gimmick with no significant value.

Do you see any major Mastodon nodes interfacing with Truth Social or Gab? I certainly don't. If federation barely works for a social media app, I fail to see how it would even matter for a search engine.


At least one of the partners (https://openwebsearch.eu/partners/radboud-university/) does research on "federated search systems", so there's hope!


Isn't searx what you're describing? I was running an instance for a while, and it's basically a meta search engine that has support for all kinds of providers.

There are also some web extensions available so that you can fill it with more data.

[1] https://searx.github.io/searx/


I'd say it rather looks like Seeks, unfortunately defunkt: https://en.wikipedia.org/wiki/Seeks

> a decentralized p2p websearch and collaborative tool.

> It relies on a distributed collaborative filter[6] to let users personalize and share their preferred results on a search.


Searx is half of it where it calls out to other searches but does not provide its own index as far as I can see. It also does not remix the results.


If it is about a decentralized index there's also YaCy.net [1] but I don't know how actively maintained the project is.

For me it made more of an enterprise-grade use case (e.g. for building a search for your own file servers or confluence) so I only tested it out a little. It's a huge java project, that's why I decided to go with searx back then...cause yacy was pretty hard to setup.

[1] https://github.com/yacy


One of the things I wonder here is if it would be easier to just start by crawling known RSS feeds and then exposing a JSON API for the data and making the whole thing open source. Then keeping a public list of indexes and who crawls what. Eventually moving into crawling other sources but first primarily addressing the majority of useful content that's easily parseable.


That's probably the easiest way I know to get good content into a search engine. Annoyingly however it does not contain all the content available.


What benefit does federation bring here? Unless it is very simple to set up, most communities are non-technical and probably won't be able to set up their own crawler. I would think just a search engine that lets you customize the ranking algorithm, and maybe hook into whatever ontology they've developed and ranking it accordingly would be sufficient.


> most communities are non-technical and probably won't be able to set up their own crawler

They can use a solution which already integrates the search. Forums and CMSes are a good target for that. Then you can say "I'd like my search to look at widgetlovers.com too" - and you get their sitemap + featured external links, because they run FooPress that supports it.

Kind of the same as sitemap we already produce for Google.


It can be very simple to setup. Think single binary to run, or lambda to deploy (yes this is possible) with the URL back to it.

I imagine a binary, with a simple Admin UI allowing you to crawl some domains recursively would be enough to index your own website, and then have those results shared.

Where I could see this being really useful, is let someone who knows everything about pokemon provide the index for searching pokemon information. Then when they federate, provide a taxonomy saying "for queries that have these words, call me". Suddenly you have a very high value search source for pokemon.

Throw in some zero click info information boxes and you have added a lot of value.


ActivityPub is not well suited for this application. It's for publishing activities made by actors — hence the name. You'll want to invent your own federation protocol specifically for federated search.


Last time I checked there was something in there for search...

Even so you could base it on activitypub I suspect. It would need to be extended for sure to implement the sorts of things believe would be required.


They've listed "DECENTRALISED SEARCH" as a ongoing project/goal.


Seriously... I really wanted to like this project, but it seems everything EU touches as of recently gets worse.

From the webpafe that half of the time shows "Resource Limit exceeded" to a technology stack diagram on the bottom of this page https://openwebsearch.eu/the-project/ being completely unreadable due to bad scaling.

It is very disappointing really. Another example from the top of my head. Here in Poland we have ID cards(as every other EU country) . Those ID cards have to be renewed every now and then (10~15years). In last years an online system for government services was implemented including for renewal of those cards. One could take a photo with a mobile phone, submit an application and pick up a card from a gov office in few weeks time. Unfortunately, EU made a law that ID cards applications have to be acommpanied by biometrics (fingerprints) so this system has been thrown away. One has to physically go to the gov office, scan their fingerprints, apply for a new id card and then go again to pick it up...

Ok, so what happens in 10 years time? They should have the fingerprints already, right? No. They take the fingerprints, they store them only until one picks up the id card and then they are deleted. There are no fingerprint database, they are not stored anywhere. The fingerprints are used only to ensure the same person that submitted the application picks up the document. It makes zero sense, other than to break the previous online system. Thanks EU.


And why the hell the government will keep the fingerprints? So that once in 10 years I save 1h. The benefit is minor compared to bad things that can be done with a fingerprints DB (mass surveillance..)


I'm not complaining they don't store the fingerprints. I'm complaining they broke a perfectly good system pretending it improves security by getting fingerprints, while it does nothing. The fact they collect the fingerprints to delete them when you collect the document (or 6 weeks past your collection date if you don't show up). demonstrates how bullshit this "rule" is.

Our (Polish - and other EU states as I understand it) government is not allowed to store everyone's fingerprints unless they are a criminal. I as well as other people here have pretty strong feelings against it.


> And why the hell the government will keep the fingerprints?

The government already does... just saying.


They just explained that it doesn't (assuming we trust them, of course).


> they are not stored anywhere

They are. They are stored on the card itself, and they are NOT stored in government databases. This is a good thing...


Yes, you're right, they are stored on the card itself and nowhere else so in theory it allows authenticating the person to the document via a fingerprint. I didn't know that. Still I think having them there doesn't provide suffiecient extra value to offset having to go to the gov office twice.

Also, using fingerprints to routinely authenticate people presents a whole new lot of problems. No one voted for a party that proposed such idea.EU simply decided to mandate this and it has to be implemented no questions asked. What about people that have trouble using the fingerprint scanners govs use? I used to play base guitar when I was a teenager. I can't wait to find out how well/bad this tech will work with the thick skin on my fingers.


I mean, do people really want their government to store their biometrics? What could ever go wrong with that?


and they are NOT stored in government databases

...that we know of.

Yet.


Intentional on not at least the EU prevented an insane system where you can apply for an ID card without actually being physically present.


If you're an adult you already have an ID card...I'm talking about renewing one, not applying for the first time. One applies for one in person when one turns 18 (with a birth certificate, a passport if a citizen of another country, or a piece of paper from the border guards if a refugee etc).

Also one needs an ID card plus 2FA authentication(usually connected to a bank account, a physical smart card, or mobile phone) to login to the government services portal in the first place. This portal is seen as a huge accomplishment (in comparison to the inconvenience of having to do every little thing in person, queue for hours etc). It is not just taxes, it is health service, local councils, national(and health) insurance, building permits, basically almost everything one can do in person can be done via this portal - except renew an ID card since the stupid EU rule came in force...


You still have to apparently go in person to pick it up, how is that an insane system? Now you have to go twice.

edit: I mean you can renew a US passport purely by mail without ever interacting with anyone and I haven't heard of massive issues caused by that.


so your complaints are: the webpage announcing this news isn't perfect, and the (well-known for being incompetent, corrupt and anti-EU) Polish government has implemented an EU policy poorly?


haha, very funny. I think you are 100% correct if we invert every statement you made. Out of curiosity, how would you implemnt this "policy" that it wouldn't be "implemented poorly"?

As for being corrupt and anti-EU, I have to say in recent decade at least there has been no other group of politicians more incompetent, corrupt and anti-EU over the EU commision itself. From the botched/corrupt "green new deal" that resulted in complete dependency on Russian hydrocarbons and resulted in another war in Europe, through the "pay Turkey for the problem to go away" mediterran refugee "solution", to complete ineptitude at the first 6 months of the pandemic and basically leaving Italy on its own, culminating in illegal witholding of funds to member states that elected parties opposed to the current option in Brussels.

However, what truly destroys EU is not even the above, but the lack of respect for the rule of law amongst the top officials. They have their goals and no matter what, they will do anything to reach them. For example they want more integration and a federal state. They proposed it fairly some years ago as an EU-Constitution and it was demolished in referendums. Instead of giving up, hearing the democratic choice and going the direction the sovereign(the people) told them to they then proceeded to implement it another way over people's heads(It was supposed to be implemented in the treaty od Lisbon). However, those treaties have to be unanimous, and some countries didn't want to essentially be ruled by the biggest countries so it got watered down back then. Then they realised it is impossible to implement this goal in accordance with the rule of law, so what they are trying to do now is twofold. First, throw out unanimous voting in favor of majority vote so smaller country objections can be disregarded. Second, bully countries that disagree by illegal witholding of funds.

We're very near the end of the EU, and it does make me sad because I still believe in the ideas that led to it in the first place: free market for goods, travel and work, shared values and work towards common goals between member states - not bully eachother or sell other member's state's security for financial gain.

At least in my generation 10 years ago I would think 95% people would consider themselves very pro-EU, now, unless the current political class GTFO promptly I don't see EU being a thing in next 10 years.


did you notice that no one chimed in to say "yeah we have this same problem in [my EU country] too!"? the reason for that is that - as usual when people are criticising the EU - it is in fact your government's fault, and your politicians/news media/nationalist friends are using the EU as a scapegoat for their own (nation's) incompetence/corruption


I'm a bit skeptical EU-funding a bunch of professors is the way a search engine will be built.

The primary goal for academics is to publish new findings, while what you need to build a search engine is rock solid CS and information retrieval basics. Academically, it's not very exciting. Most of it was hashed out in the 1980s or earlier.


>I'm a bit skeptical EU-funding a bunch of professors is the way a search engine will be built.

Heh, so, funny story...

>A second grant—the DARPA-NSF grant most closely associated with Google’s origin—was part of a coordinated effort to build a massive digital library using the internet as its backbone. Both grants funded research by two graduate students who were making rapid advances in web-page ranking, as well as tracking (and making sense of) user queries: future Google cofounders Sergey Brin and Larry Page.

>The research by Brin and Page under these grants became the heart of Google: people using search functions to find precisely what they wanted inside a very large data set.

https://qz.com/1145669/googles-true-origin-partly-lies-in-ci...


Splitting when the project looks like it's gonna make money is the American way. (Thanks for the public funds. khaaa chingg)


They did a lot more good by making a company rather than sticking around in academia and publishing a few extra papers.


I see this a lot in a project I'm involved with where the majority of contributors are from academia. The incentive is to mostly push novel things, often with questionable practicability, in order to write a paper about it.

But nobody will implement the 'boring' features needed to make the thing generally useful.


username checks out


not that I support this approach, but the return on public investment there is clearly huge


The best remote+paying job that I've ever seen online was from a Darpa project (memex project, about search engines). $180K - $250K+. This was ~5 years ago.

Curious what the salaries will be on this one.


Marginalia, I know you are working on a (fantastic!) search engine of your own https://search.marginalia.nu/

I salute your efforts and endorse your search engine. I also recognize that you know what it takes to build a search engine.

I don't think you have deep familiarity with EU academia. - The primary goal for academia is to influence society. Publishing is a route to that. - Being head of the EU search engine project would give high academic status - There are hundreds of articles which you could publish on this project - Rock solid CS. Would someone like Knuth count as "rock-solid"? Who is better at CS, the person who can implement quicksort because they practiced leetcode, or the person who invented quicksort? - information retrieval basics. Again, these basics were probably developed in academia.

The skills you say are basic to this endevour are more prevalent in top-quality professors and post-docs than they are in industry.


On top of these skills, you actually need most of all software engineering and architecture experience. I don't think this is common in academia at all. Not in professors, not in PhD-students. You need practical experience building complex software at a large scale. Across that, you need to implement these CS fundamentals.

This is requires far more CS than you'll find in your usual software development effort, for sure, and many CS professors absolutely fit that bill. However, to the same degree it also demands far more on the software engineering side. People out of academia in general, from every time I've seen them build software, have not been all too impressive on that side of things.

Web search has an incredible demand for being well rounded, beyond anything else I've encountered. CS isn't the hard part bottle-necking everything else, it's just one of the many hard parts.


One could have been skeptical about US-funding a bunch of university students to build a search engine a few years ago.


Search was quite broken back then. It got reasonably good at some point, now I’d argue it’s the content as a whole that’s gotten worse.

I’m just skeptical if the EU bureaucrats will put the money in the right place, and if this is even the right approach.


The project can fail, you are correct, but it does not take anything from any other projects, it is just an other initiative trying to contribute in the space.

Parent commenter own search project Marginalia Search [1] could even benefit from it, or even maybe collaborate with it.

It is not a winner-take-all situation, and we need various open initiative in this space to get out of the current conundrum we are in with Google stronghold on search.

1: https://search.marginalia.nu


Taxes are taking from funds for personal projects, and universities already get a lot of funding in Europe to do research, maybe they should focus on creating a better environment for students to become researchers or entrepreneurs instead.

Overall I think there are better ways to improve search from an EU perspective by doing what they are supposed to be doing:

- create a fair environment for companies to compete in, e.g., take a look how Google, Apple or Meta's assets are set up to make it harder for competitors, break that up

- improve standards in eduction – it doesn't really make sense for all member countries to think of and maintain a good CS curriculum and they all seem to be pretty bad at it

- make it easier to build something and get funded, and reward creating prosperity, don't tax it to death

Just tested marginalia's random mode btw. Pretty cool, reminds me of the internet when I was a kid

(edit: formatting)


I don't think the US funded Google?


Google started out of a CIA funded Stanford project.


CIA? Other people are saying DARPA and NASA got together to fund the NSF which funded the PhDs of the founders of Google, but even that's a bit too indirect IMO. Where does the CIA fit in?


https://qz.com/1145669/googles-true-origin-partly-lies-in-ci...

Enjoy! It's a great story.

(Plus: for who might not know, DARPA is US defense research, and heavily influenced by the intelligence services needs. Which is not necessarily bad! Just good to understand where and how Google originated. And wrt DARPA, they funded the creation of the internet itself, for whatever matters.

In Europe, things often go slightly different. The Web is a result of CERN, who are also a project partner of OpenWebSearch.EU. Why? Well, better search can also be beneficial for better science, not just for end users wanting to find their way or buying something.)


Thanks :)


..correct me if I'm wrong, but Google was started by a couple of postdoctoral researchers, no?


Who deliberately did not stay in academia to do it. More to the point, a successful team building a product like a search engine requires roles that academia doesn't really have.

Who is doing product management?

Who is doing product marketing?

etc

This is all applied engineering at this point, not R&D. How does it at all fit into academia's strong suit?


I think that maybe the point is this is not being tackled as a purely economic endeavour (or if it is, it’s in the “indirect” manner), as such, I suspect roles like “product marketing” are probably unnecessary, at least for now.

Also, tell me you wouldn’t love to work on a large project that wouldn’t be subject to the arbitrary whims and promises of the marketing department.


> the point is this is not being tackled as a purely economic endeavour (or if it is, it’s in the “indirect” manner), as such, I suspect roles like “product marketing” are probably unnecessary, at least for now

Which results in an interesting engine nobody uses. Products that start with the tech and then think of selling it fall on their faces for a reason.


Google started with tech and disdain for online advertising. They didn't start with the ads.


and were forced to pivot because it didn't work to do it that way, empirically disproving your thesis?


They weren't forced to pivot. Gates explained to they how much money is to be made and they changed their mind. For some time they were very excited how unobtrusive and helpful the ads were. Then they realised there's even more money and the rest is history.

Is Google proves anything, it's that greed is real.


> 14 European research and computing centers

> 7 countries.

> 25+ people.

There are literally dozens of them!

https://openwebsearch.eu/partners/


I don't think the number of people or even the size of the budget is wrong. A small team can be incredibly powerful and productive if you have the right people. In fact, I think far more often search engines fail from trying to start too big than too small.

The problem is that you need people who actually know how to architect complex software systems much more than you need revolutionary new algorithms. For that, professors are the wrong people. A professor on the team, sure, that might be helpful. Not half a Manhattan project's worth.


> For that, professors are the wrong people.

Have no fear; all of the actual work will be done by PhD students straight out of undergrad, and most of the actual leadership will be done by a string of recent PhD grads who need results in 6 months because they'll be full time job marketing for the 6 months after that ;-)


It happens all the time in Europe. Collaboration between public and private companies is pretty much a pipe dream in the EU. Some company that actually works on building search technology would achieve way more than a bunch of professors.

I disagree on the budget though. It is basically pocket change.


Arguably the biggest most unsolved problem in search is how to make a profit (or even break even). This can be approached in two ways: You can either try to find some way of making search more profitable, or you can find a way to make search cheaper. I think the latter is a lot more plausible than the former.

A shoestring budget keeps the costs down by design and by necessity. A large budget virtually ensures the search engine becomes so expensive to operate it will never break even.


Why not offer a paid tier? Seems to work for Kagi. Information elites will soon flock to paid search engines, which won’t be much more expensive than a streaming subscription. I pay for Netflix and am willing to pay for a search engine that offers as good a search service as the video streaming offered by Netflix.


> Arguably the biggest most unsolved problem in search is how to make a profit

And the EU just solved that problem.


Did they though?


> Im a bit skeptical EU-funding a bunch of professors is the way a search engine will be built.

Worked for Google.


The real game-changer in search would be if companies would agree to publish indexes of their own sites in an open standard to a place that everyone could access. This would undercut the monopoly power that large search engines have and allow everyone to focus on innovating the best way to search that content vs. having to spend so much time and money to crawl and index it.


There are already sitemaps, and pages used structured data like HTML5/ARIA roles, RDF or JSON+LD to provide some semantic annotations.

I'd rather that web robots use this information to build useful indexes than to have to worry about generating yet another feed in the hopes that it helps people find my content in a search engine.

Besides, a web robot can determine how much other sites link to my content and help determine its overall ranking in results. Adding another type of index file to my site will do nothing to determine how it relates to other sites.


The structured data on sites, unfortunately, still requires a crawler to index that content, which serves as a barrier for search engine startups. At a minimum, adding some metadata content to XML sitemaps would go a long way to solving some of this problem (title, meta description, content summary, even structured data to the sitemaps).


Deep down in my soul, the long-locked-away SEO of my money-hustling youth just grinned in anticipation.

We have had embedded metadata in websites for decades. In the beginning, Search Engines did even use them. Until someone started stuffing unrelated keywords in it to rank higher.


> The structured data on sites, unfortunately, still requires a crawler to index that content

How is that any different from requiring a crawler to index XML sitemaps?

> At a minimum, adding some metadata content to XML sitemaps

The purpose of a sitemap is to tell a web robot what resources there are, with some minimal metadata about page titles and last modified date.

Google has some extensions for identifying images and videos.

But that adds more work for site maintainers, who have to duplicate work.


What's the problem of using any of the many free webcrawler (libraries) available to crawl a website (even if solely based on the pages advertised by sitemap.xml / robots.txt-announced sitemaps), then extract structured data from these pages?

I don't see this as a barrier unique to startups.


It's easy to do for small sets of sites, but try doing this at web-scale and you quickly run into a large financial barrier. It's not about technical feasibility as much as it is cost.


Standard for this already exists [1] but it does not solve the problems of

1. Implementation (sites do not need to have a sitemap; or those that have it, may not have an accurate one)

2. Discoverability (finding sites in the first place, you'll need a centralised directory of all sites; or resort back to crawling in which case sitemaps are not needed)

3. Ranking (biggest problem in creating a search engine)

[1] https://www.sitemaps.org/protocol.html


The sitemaps standard (if this is the basis) would need to be expanded to support additional metadata / structured data to support this idea.

1. This would be up to sites, to your point, major question would be best way to create incentives.

2. This is solvable via a number of approaches, but the search engines themselves would be mostly responsible for finding the right approach for their business. I know how I would do it.

3. Indeed, which would be the main point of this decentralization, to let search engines focus on their hardest problem.

Edit: would Kagi not benefit from having to worry about crawling / indexing sites?


> would Kagi not benefit from having to worry about crawling / indexing sites?

It would, but sitemaps do not provide that function as we discussed above. However if EU Open Web Search succeeded, that is something we could probably use to some extent.


Or use to extend


One problem with that is now you have to trust the websites to give an accurate index of their content.


Anyone who thinks this will work has never tried to index a site. A huge amount of effort is spent trying to figure out if the site is serving different content to users vs crawlers, or if the site is coded to appear visually different to humans vs machines. If you ask sites to index themselves you will get lies only.


I index sites all the time and I think it could work. There will be other problems, of course, but we already are partly there with XML sitemaps. Relying on the large search engines to enforce “honesty” from websites puts them into a mediator role that has a number of negative effects both for search in general and, increasingly, society at large.


Relying on sites to be honest about themselves, is even less likely. There are monetary incentives for many of them not to do that. Many sites host dishonest and clickbait content with extreme levels of SEO already. The cost of dishonesty decreases if you can directly modify the index.


I think that is primarily a symptom of the fact that we have a bottleneck on search interface providers. If it were easier / cheaper for new search engines / rankers to exist in the market, they could fairly easily filter out unscrupulous domains.


I've run a web-scale search engine and I don't think it will work.

Not only are some sites malicious -- mostly unimportant ones -- but many good sites are simply incompetent.


Indeed


I suspect you underestimate how much of the power of search engines is being able to interpret search queries and figure out what a user is really looking for. Even if there were a public, standardised up-to-date high performance full-text index of the entire web freely available I'm willing to bet Google search would be a useful value-add in its ability to answer natural language queries.


> you underestimate how much of the power of search engines is being able to interpret search queries and figure out what a user is really looking for.

So you mean that a search engine is supposed to ignore what you are asking for and instead give you what it thinks you really meant?


If you search for [what year was queen Elizabeth born], Google doesn't return pages that have that phrase in it, rather it returns pages that have an answer to that question. You could call that "ignoring" if you like, but it's what 99% of users expect.


I run an SEO platform SaaS, so I'm familiar. :)


I resort to "natural language queries" only in desperation, when queries that are lists of search-terms have failed.

Actually, they aren't really natural language queries. They are just ordered lists of search-terms. Goo provides no mechanism for saying "This is an english-language question". And even if Goo could parse my natural language, and rephrase it as something like "Are you looking for a list of books published by Douglas Hofstadter?", when you turn that into a query on the index, it stops having anything to do with natural language.


A "dumb" search that just took a search phrase like "books published by Douglas Hofstadter" is going to return pages that have that phrase in it, or something close to that phrase. Google will prioritise results that actually contain such lists, regardless of whether the page contains a phrase like that (e.g. the word "published" is basically ignored by Google). That's all I meant.


We will explore that idea in the project, I also think it may help (but vulnerable for Web index spam by adversary parties).


That is indeed the biggest problem but maybe something that can be more effectively dealt with downstream by the content rankers and potentially even the user base / custom search algorithm builders. Brave's Goggles project is a good early prototype of this concept.


I'm pretty sure we tried that way back in the day with <meta name="keywords" content="spam spam spam spam">. People would stuff that with every word in the english language. Older search engines that used those keywords returned some pretty awful results. You simply can't trust sites, who have a strong incentive to get to the top of SEO rankings, to not lie. In fact, given at least one of your competitors will stuff their keywords to get to the top you'll have to do it too. It would become an arms race for who can stuff the most garbage into their indexes to "win". It just doesn't work.

All search engines that attempt to be useful will have to filter out the junk. You just have to trust that the search engine you are using isn't withholding results from you that it considers "bad" (eg: "misinformation" (i.e. stuff somebody disagrees with)).

And to me, that is the crux of the debate really. Nobody wants spam for search results--everybody agrees with that and there is no real debate about filtering that crap out. The argument really is should a very large company that has a huge market share get to decide what constitutes "fact" and what is "misinformation". Based on 2.5 years of experience so far, what was once deemed "misinformation" has a sneaky way of becoming "factual information". Labeling and hiding "misinformation" because it goes against some narrative pushed by incredibly powerful entities is very scary and there was a hell of a lot of exactly that going on during this covid crap.

I used to fall on the side of "private companies can do whatever they want" but now I'm not so sure. Companies like FB, Twitter or Google play a huge role in shaping politics and society. I'm no longer convinced it is okay to let them play the role of "fact checker" or anything like that. Filtering spam is one thing, but hiding "misinformation" is entirely different.


Your last point is also the one (aside from the economics) I am the most interested in.

I think we live in a world now where we are so used to a few tech giants mediating everything for us that we can't even imagine other solutions to this problem, but it's also how we got to this point in the first place.


>You simply can't trust sites, who have a strong incentive to get to the top of SEO rankings

Why is it not enough to punish sites that abuse the keywords?


Who is the one who punishes the abusers? How can you scale the solution to deal with billions of pages?


The users punish.

You need a trustworthy core by which you can judge the vote of new users. You can incorporate them until somebody complains about a result that is out of place.

This doesn't have to fully scale. There are many pages without monetary value that won't be manipulated. The tags are an additional signal that can be used where they work. If they don't work, they can be ignored.

But it will scale because there are far more consumers than producers.


People would abuse that for SEO purposes within seconds.


The market need would then be shifted to the best search interfaces instead of who has the most money to build the biggest index. A much better focus, IMO.


I believe that is precisely what the project is aiming to do, and to turn it into a public resource.


I’d rather see them publish a federated search of their own content.


Your comment prompted me to check out Searchcode, looks very interesting. How would the federated search model work in this example? Instead of you having to index the various code repositories, they would index themselves and make their search of those indexes available via a federated API?


I don't see any mention of Quaero, the EU search engine that was supposed to compete with Google [0, 1]. How is this time different?

[0] https://en.wikipedia.org/wiki/Quaero

[1] https://www.dw.com/en/germany-pulls-away-from-quaero-search-...


For starters: the objective is to create the index not the engine, that's quite a different ambition.

We are very aware of the Quaero/Theseus history :-)


What is the difference?


Supposedely the project is about just building the platform/infrastructure (which is what the index is) upon which search engines can be built.

These search engines will then have the freedom to define their own search product experience, business model, even ranking of results.


So something even more vaguely defined and detached from real use cases than last time? Great.


It is very precisely defined, you not understanding what they are building does not mean it is not worth the effort.


The above actually defines the scope very well. There is lot more to be built upon it, but it is not what the project is trying to solve.


Is there any discussion on how this work will differ from Common Crawl?


This was the past legislature project. The new legislature brings CHANGE. They are not the same..


> A new EU project OpenWebSearch.eu … [in which] … the key idea is to separate index construction from the search engines themselves, where the most expensive step to create index shards can be carried out on large clusters while the search engine itself can be operated locally. …[including] an Open-Web-Search Engine Hub, [where anyone can] share their specifications of search engines and pre-computed, regularly updated search indices. … that would enable a new future of human-centric search without privacy concerns.

So.. Who's going to create the index? Indexing the web is expensive, and its offset by the ads the indexer runs on their search website, such as Google, bing, brave and others.


Well easy, Europeans are going to fund it with their taxes, public search labs are going to build it, private companies squeeze the concept for cash for as long as possible while taking EU taxpayer money as "research grants" and claiming operational costs as "research tax credits" which will set them up as special parners for the next EU cashgrab (I mean quadriannual plan) since they have successfully run a public-private partnership (into the ground)

t. was involved in one of those plans as part of the research team in a public lab


I wonder how privacy will be ensured when your query hits the map-reduce infrastructure running on these clusters.

Regarding privacy the bar is significantly higher than what Google has to deal with. This will come at some cost in quality and/or speed.


Every individual website has an incentive to create indices of their own content, and hosting providers could provide it as a service. Not hard to envision. Search Engines could download these indices periodically to build the meta-search.


Also not hard to envision websites being incentivised to lie in their indexes.


Lie how? Meta indexes can be picky about the sites they list, simple as that. One could imagine personal meta search engines which only list the sites you care about. Bad sites simply don't get listed.


The point is that if a popular search engine returned pages primarily based on whether particular search terms appeared in their self-published indexes (rather than using a crawler that parsed the page content to build its own index), web sites could easily publish indexes that just listed popular terms users are known to search for regardless of whether the site had anything to do with those search terms. Pretty much what uses to happen 10-15 years ago when everyone believed meta tags were the way to achieve SEO.


Someone who's snagging an EU grant, that's who.


> Someone who's snagging an EU grant, that's who.

Bullseye.


What does "based on European values and jurisdiction" refer to? I'd love to be pleasantly surprise, but this sounds like it's ripe for centralized censorship.


By US values profit is weighted heavily over anything else. Chinese and Russian values (or lack thereof) are focused on controlling the narrative. European values would contrast as rules (as opposed to vague guidelines) with legal ramification, a lot of open source software, an actual effort towards honesty (mileage may vary) and probably vogon poetry.


> Chinese and Russian values (or lack thereof) are focused on controlling the narrative

I think it's reasonable to label China and Russia as control freaks, but the US and Europe aren't much different at this point too given the creepy focus on trying to shut down whatever they want by calling it "misinformation" and pressuring social media to comply.


But you do get your day in court.


"Right to be forgotten" is an example of a European value.


Um, bad example. It was a misconceived bit of legislation. There's no logic behind that claimed right. What happened to my right to remember what I choose? Why shouldn't I publish my memories on a website?

Anyway, it's not a "European value", it's a piece of legislation, so it falls under the "European jurisdiction" category, not the "European values" category.

That complying with "European values" is part of the project goals is unfortunate. Who decides what are European values? There's no Declaration of European Values. We don't all have the same values. There are just European laws, and then a whole bunch of opinions that not all Europeans agree about.


people really, truly, go out of their way to criticise the EU. there are things about the EU that deserve criticism, but this is reaching to the point of deliberate misunderstanding. no one is saying that a person who wants to be forgotten cannot be talked about on your website, or that you have to somehow deliberately delete your memories of that person. you are not a corporation. the right to be forgotten is about the right for you to be forgotten by a corporation. the right for you to not be in their database anymore


> the right for you to not be in their database anymore

Let's switch that around: you're referring to their right not to be listed in my database any more.

I don't see what difference it makes whether I'm incorporated or not; I'm not OK with the idea that certain information cannot be shared. Like, if the information is false, that's one thing; but a legal requirement that some true information must be hidden, that seems like pure badness.


your argument is predicated on the assumption that there’s no difference between you and a large group of people. this is very clearly a false assumption


It could be fun to have a conversation about this topic, but when you start with calling the opinions of others a "deliberate misunderstanding", you've poisoned the well.

Still, it's a good illustration that it's a somewhat unique European value. It's definitely not an American value.


it’s such an overpowering misunderstanding, combined with a topic that people have very little objectivity over that it is not actually believable.

do you also think you’re not allowed to remember people’s names without permission because of GDPR?


Given the history of the 20th century, this kind of comment promoting European values and jurisdiction seems..... dicey. Companies ethical records, as shitty as they are have nothing on the mass destruction, genocide and stupidity of governments.


On first glance, I see the word "unbiased" immediately followed by "based on European values". Now, I'm no expert, but to me, that seems pretty biased.


"being unbiased" can be fairly honestly called a "value". Europeans apparently have chosen it as one of their own values.

If you are arguing that being truly unbiased cannot ever be realized, I counter that neither truth nor justice can ever be truly realized, by this should not stop anyone from taking them as their values.

     signed by: an expert


It's not that there's anything wrong with aspiring to be unbiased; the problem arises when someone really thinks they are unbiased, or claims to be unbiased. You get this a lot in the MSM; the Guardian constantly exhorts it's readers to subscribe, to support unbiased journalism.

I prefer reporting that wears its biases on its sleeve. Regrettably The Guardian squeezed out most of its most interesting writers, apparently by repeatedly spiking their stories.

Yeah, Truth and Justice are platonic abstracts; nobody thinks they exist in the real world. But people do believe that unbiased reporting is possible. It isn't.


biased on European values



Which now shows:

> Resource Limit Is Reached

> The website is temporarily unable to service your request as it exceeded resource limit. Please try again later.

Original URL might be more resilient...


Hmm. I can access the page without that message. In any case the Internet Archive seems to have it:

https://web.archive.org/web/20220920183027/https://openwebse...


I wish a project doesn't try to moonshot a thing that we already had.. I want the good old Google search where you type something, and all relevant web result pop out (without creepy ads, trackers, personalisation, shopping links to get you to spend). I understand it is very expensive to crawl the web, that's where an org like the EU can come in. I wish they bootstrapped a simple, open index that any startup can use to provide clean search fronts. Crawling the web, and indexing content can be quite expensive, if EU bootstrapped that infrastructure, and then say augmented it with federated crawling and indexing (sort of like the distributed computing MIT condor project), where people, Universities etc in the EU could contribute their spare computer time to crawling the web and indexing it for the EU index, making search better for everyone. Heck, I'd even put solar panels in my bike shed and hook it up to a PC to crawl the web while the sun is shining.

I'm dreaming too much, I need coffee..


But the web is different; google much less so. There is no "old google search", because almost everything to search through is noise and spam.

PageRank made user preferences as signalled by hyperlinks the key signal of quality. Who really hyperlinks any more? And how many, by percentage, are non-automated etc.?

The web this works for is one of forums & personal websites. It's hard to say that today there are any properties of websites that are a reliable signal of quality.

Hence the proliferation of voting sites (such as HN, etc.) which are little more than search engines augmented with reliable signals of user preferences.


>But the web is different; google much less so. There is no "old google search", because almost everything to search through is noise and spam

the commenter is not referring to the search results, they're referring to the interface not having all the wank that google packs in to optimise for profit

>Who really hyperlinks any more?

who doesn't?


Correct me if I am wrong, but so the purpose is to create an index database, upon which custom search engines can be attached upon? Ie, the EU will crawl all pages on the web?


The index is just the first step according to news articles:

> Once the index has been created, the next step is to develop search applications.

> The team at TU Graz will be particularly active here in the CoDiS Lab and will work on the conception and user-centric aspects of the search applications. This includes, for example, research into new search paradigms that enable searchers to have a say in how the search takes place. The idea is that there are different search algorithms or that you can influence the behavior of the search algorithms. For example, you could search specifically for scientific documents or for documents with arguments, include search terms that have already been used, or include documents from the intranet in the search.

https://www.krone.at/2791083


This is an excellent idea to disintermediate Google. A search tool for the commons without the spyware and opportunities for search manipulation is very important, and the more actors work on it, the better. Perhaps cooperations with other tech savvy democracies could occur. India is full of skilled programmers. Unlike Russia or China it is easy to see Europe cooperate with it.


Hmm. India also has a government that is borderline fascist, promotes ethnic and religious hatred, and doesn't seem to believe in internet freedom (they keep shutting it down). And it's not the lack of skilled programmers that makes Western governments wary of cooperating with China and Russia. China and Russia have totalitarian governments that are terrified of the internet, and would like it to go away. I wouldn't want to include them in a project to make the internet work better.


What personally worries me the most is "Meet the partners" page.

I didnt look for all but I did look who is partner from Slovenia.

The most privacy invading ISP/Mobile company in our country selling statistics about their users (although anonymized, but walking on a thin line).

I just hope it wont go down that drain.

Regarding all the negativity about the government project.

If we can run CERN we can surely do a web search project.



I suspect search engines are an outdated concept for at least the largest of sites, who will generally, but not always, have better ways to directly search their own content.

The remainder of the search problem seems to just be collecting relevant trafficked sites for listing in results. Today Google et al seem to be doing this BY HAND. And it's not even obfuscated.

Recently, for the first time in my life, the wizard behind the curtain seems to have been exposed. I feel strongly that one could probably start a small index that catered to a fairly large audience.

And honestly, for other queries, just tell the user to search that site directly. I think you could even market it to users as not a technical limitation, but behavior that should be considered fuddy-duddy.

Like, really, you're going to search me? You know they have their own search right?

Even Yellow Pages faded into obscurity eventually.


> the largest of sites, who will generally, but not always, have better ways to directly search their own content.

I have the exact opposite experience.

To wit: searching HN via the algolia link at the bottom is way worse than searching on Google with a site:ycombinator.com restrict.

Same thing for YouTube, where the search engine is tuned for maximizing watch time and strictly not to return what you're looking for.


I would agree; I've almost never seen a site that provides a better way to directly search its own content than the major search engines provide. Wikipedia might come closest because their search compares the user input to article titles, and that's often helpful (mainly because the article titles are chosen to describe the article content, and aren't clickbait). But pretty much anything else uses a search approach that is far inferior to what Google or Bing provide, and it won't find what you want.


Looks like it's Northern EU only.

No research institutes from {France, Italy, Spain, Greece, Portugal, etc ...} involved.


Slovenia, Czech Republic. But yes, I think there was a competing proposal from Italy/Spain. Not enough budget for two projects in this area, unfortunately, as they were good too.


Some of you might be interested in the privacy focussed "French Google" Qwant. https://en.wikipedia.org/wiki/Qwant Plot twist: It's funded by Huawei.


Good to see more efforts from EU to negate big tech domination. Federation is the way forward.


Before Corona, I would have really welcomed this announcement. But to be honest, when someone says that the project will be "based on European values," I think it's stillborn.

Who is defining those "European values"? Is it the European Commission?

As a liberal person who still dreams of mature citizens who form their own opinions well-informed from a rich debate, I now see the "European values" quite critically.

In Germany - if you criticize the Corona measures you are called a Nazi, if you criticize the Ukraine war you are a Russian troll. Are those then the "European values"?

The European Commission has already established some projects to weight the information according to its will (SocialTruth, PROVENANCE, EUNOMIA, etc.).

When a government agency talks about truth, then all the hairs on the back of my neck stand up.

When governments speak of disinformation, it is usually only in the sense that the information does not correspond to their interests. I had to learn that the so-called fact checkers don't check facts, they just sell a counter opinion as a "fact".

So for a European search engine, a filter is then placed upstream that filters out all disinformation?

In the past, that was called censorship. Today, it's more like citizen service.


> If you criticize the Corona measure you are a Nazi, if you criticize the Ukraine war you are a Russian troll?

These notions are propagated by the US to the rest of the world. While EU politicians seem to go along, europeans do not tolerate it much and that much is reflected in society.

If the EU or more gov sponsored search engines pop up, i have no doubt they will want to control them in their own way - i'm okay with that. Right now our only option is US controlled entities that answer to US govs and rules.

At least here we can point fingers and hold politicians accountable if they try to influence things the wrong way. We can't do that with US companies.


> At least here we can point fingers and hold politicians accountable if they try to influence things the wrong way. We can't do that with US companies.

Germany is currently governed by a chancellor who has in fact been convicted of lying, has selective memory lapses as a co-responsible party in one of the biggest banking scandals, and was also instrumental in the political decisions to push ahead with dependence on Russian gas.

So sorry if I doubt that politicians can be made accountable for anything.


I couldn't work out what "building blocks" they're going to produce. Are they going to produce a public index, that search engines can use? A set of APIs?

> the researchers will develop the core of a European Open Web Index (OWI) as a basis for a new Internet Search in Europe.

That sounds as if building an index is at least within scope. But isn't the design of the index one of the key differentiators between different search engines? Perhaps the OWI will just produce tools for building transparent, privacy-preserving search engines, rather than crawling the web itself. It's not clear (to me) from reading the site.


The project is starting so not all your questions are answerable today, but, we definitly will produce an open web index, already by the end of the first year, with improvements for years two and three.

We further deliver components to make search engines on top of this index. The project vision is that there will be many different search engines, not just 4 worldwide. Hoping to lead the way!


A crawler should see the web pages as a human would and read the pages as a human would and extract keywords based on the real displayed content. After that the search could be as simple as looking up a book in your library catalog. Think liberally about the search key, e.g. "<40% screen is display ads". If the search engine gets popular, you can then worry about scaling up to serve the results faster.

Oh PageRank (R) can be done using real semantic links (actual human comments) rather than easily spammed hyperlinks. And you can group all similar pages and serve those as a single expandable/refinable entry.


Yes, telling spam (spam blogs, copy-pasted versions of Stackoverflow, etc) from real organic content is easy for a human, so I can only assume it would be pretty easy for a computer too. At least until this becomes a practice that spam site makers have to consider and adapt.

For example, given 10 similar articles where one has a well-known domain but few links and the others are all cross-linked only among a known spam-net, then just show the well known page (stackoverflow) and not the useless copycats.

It seems doable for a computer to also navigate the page like a human and try to negotiate various cookiewalls and similar. If a page requires jumping through hoops to enter, rank it poorly. There should be a single decline/reject if there is a button at all, and the content ranked should be the content shown after rejecting. If it shows two buttons "Accept" and "Options" then just don't index that page.


> If a page requires jumping through hoops to enter, rank it poorly.

Oh, how I wish that search engines would simply give up on encountering a paywall/cookiewall, and refuse to index the page.


I really hope this project is successful - the heart seems to be in the right place!


> Free, open and unbiased access to information

How can something be unbiased yet somehow stop showing illegal search results?


Search is way more than just indexing.

I'd really like to see them match the 20+ years of search quality fine-tuning that Google built into their search engine.

Not that Google is as good as it used to, but still, catching up with them is way more complicated than just building a big crawl + index piece of infrastructure.

And all of that on a government-funded shoestring budget.

Mmmh.

Good luck to them, but I'm not holding my breath.


I'd like to see them match the 10 years of fine tuning from 2012 google.

My understanding is google used to keep a lot of data around for "long tail" queries, and stopped doing that at some point. This seems to be the issue around search declining outside of "food near me" type queries (which they absolutely excel at).


I agree that we are sadly ending the chapter of free search but im quite sure this is just another roll call for some professors to make a quick buck like it always is.

Search needs help its true, but 8 million are not even gonna move the needle and that other engines have that much money invested, just to maintain their network switches.


Galileo, anyone? The EU is the trash bin of politicians who failed in their respective country. Not elected and saturated with piles of money. It is under control of the American Empire, thus designed to fail with a push of a button if necessary (€ is the key). Wake up!


> Unbiased

One of the tech lead feed [0] looks quite biased towards Ukraine though. I hope it doesn't interfere with the search engine.

[0] https://twitter.com/mgrani


There is no bias. Russia is a terrorist state that has brutally and illegally murdered civilians in a country they chose to invade. Hopefully it will be their continued downfall.


All countries are terrorist states if we look at it this way.


seems like a very interesting idea. So many times I wanted some kind of advanced gogle-query-language. (i know about allinurl and such, but thats not enough. google is tuned for average user, which is good for google, but not for any non average query)


Is there a privacy policy in English? The only one I could find[1] is in German.

[1] https://openwebsearch.eu/privacy-policy/


It will be interesting to see what the index contains, and how it is structured.

What made Google such a game changer was that they based their index not just on the contents, but on how pages linked to each other.


That's the marketing story. I think it's because they didn't clutter their homepage like AltaVista did.


I was a heavy AltaVista user back in the time, I can tell you Google results were really higher quality, it wasn't only their homepage (even if that was also quite refreshing).


Nope. It's actually how it worked. They patented it and published papers about it.


openwebsearch without open code and open issues, really?

A tiny image with a few nodes and stakeholders, including third party indices and monetary services ("Third Party Enrichment Services"). In which decade are they living? https://openwebsearch.eu/the-project/


With a budget of 8.5M Eur/Usd. Alphabet spends 200B per year. If 40% of that is spend on search, their budget is 10 thousand times larger.


It's definitely a comparative underdog regardless, but if you think Alphabet spends anywhere near 40% on search you're out of your mind. I'd be shocked if their spend is double-digits. I'd be unsurprised if it's <1%.


I doubt 40% is spent on search. Seeing how bad Google has gotten, it seems more likely there is just a skeleton crew keeping the lights on


The https://memex.marginalia.nu/ search engine runs basically on one desktop computer (if I recall, the creator made a post on this a week or so ago).

How much of what google spends on "search" is strictly for search, vs business goals related to search?

How much of that google spend is salary? What are those salaries? How do they compare with EU post-doc salaries?


I would be shocked if Alphabet spent >5% on search. But even 1% would dwarf this project.


It doesn't sound open and unbiased to me, because it will allow the government to censor on an index level.


We need to develop a social aspect to search where results are also moderated and curated by humans in some kind of way.


And when that curation produces results you find abhorrent? What then? Because I guarantee it would; a metaphysical certitude.


Do you hang out with abhorrent people?


Blekko (2007-2015) did that -- apparently it wasn't enough for commercial success.


I hope it will have proper regex capability for narrowing the search results to more relevant content.


Oh cool, but do you mean the "EU Open Web Search Data Collection Program"?


Will it censor the so called "hate speech"?


I have never seen “hate speech” that wasn’t hate speech.


Would it be hard to expand it as global endeavour?


Governments would never agree on what to show or hide.


so it began, that sern starts to gather market share.

--

I doubt this will take off. I mean they investend more in funding and marketing instead of starting to built something. they should've started with code (agpl3 of course) and invited more and more people. at the moment this is more buzzword bingo bullshit than anything else. it's basically always the same problem, instead of focusing on the product, they fous more on the message.


???

Where is the EU own and operated 2nm foundry?


GAIA-X, is that you?


"unbiased...based on European values" - will it fly?


European values are inherently unbiased. What's the problem? o.O


There's no such thing as an unbiased yet useful search engine. The very act of choosing which article to put at the top is a value judgment, therefore a bias. The only way to make things "unbiased" would be to rank the matches randomly, and that would be useless.


This is just an index on which different search engines with different priorities can be built.


> The only way to make things "unbiased" would be to rank the matches randomly

Well, I would rather like to have visibility into the ranking algorithm, and to be able to control the resultset ordering.


So would the SEO people. But even disregarding that problem, to have a halfway decent search ranking you're going to need deep learning, and you'll get a model that is largely opaque.


Heh! You've mistaken me for a marketer, I think.

When I said I wanted control over the ranking, I meant as a user. Tell me what parameters I can rank on; give me a UI that makes it reasonably easy to express a ranking. That's it.

I hope UI design and stuff is orthogonal to the construction of an open index. I think this could be very interesting.


Well, there are topics which are not legal or in line with EU values to be mentioned, once you start removing content then it is no more unbiased.


This is just a short reply to a blog which mentions that the project started...

The actual website of the project (with some concrete info) can be found here: https://openwebsearch.eu/


Changed now. Thanks!


the EU loves taxing productive companies and wasting said money in stillborn projects that nevertheless promise a kind of bright socialist federalist Europe in their bureaucratic minds


At least we have some of the most livable countries on Earth to show for it. I take taxes over any trickle-down economics, and don't let me stop you looking up the definition of socialist, because you are using it wrong.

Besides it's a 8.5 million EUR project, it's literally nothing, it's payroll for a few people. The money is being invested into people who then spend most of it, so it's a triple investment.


Who said anything about socialism? It’s a geopolitical tool to weaken American and Russian influence from Google and Yandex.


Isn't it lovely?!


On a side note - how does one get involved with the project should they wish to do so?


I am fine as long as they pay for these self-centered utopias with their own money


I'm fine with paying 2 cent for this.


I’ve already caught their crawler ignoring robots.txt directives on one of my sites, aggressively indexing explicitly excluded information.


That cannot be true, as the project has yet to start. But anyone can start a crawler, so you may have encountered other people's software. We wouldn't be so unknowledgeable to ignore robots.txt ;-)


It was a crawler with the user agent "hgf AlphaXCrawl/0.1 (+https://www.fim.uni-passau.de/data-science/forschung/open-se...)", operated by the University Passau and Open Search Foundation named on your landing page. It would be a mighty big coincidence if this wasn't a project connected to this endeavor, especially when it confirms being an experimental crawler of said project at the UA URL.


Out of curiosity, what's the url for your website, and from what IP or host do their crawlers connect?


The main connecting IP was 195.113.175.41.


Wouldn't it be impossible to know if it ignored robots.txt?

Just because it crawled it doesn't mean it stored it.


Storage or not is entirely irrelevant to robots.txt directives. It guides automated access. It must be parsed first and excluded URLs must not be accessed at all.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: