EU Open Web Search project kicked off

margarina72 · on Sept 21, 2022

Interesting to see the amount of negative comments.

Most of the negativity seems to come from the following points:

- EU-funded project cannot succeed in tech because previous EU-funded projects have failed in tech in the past (and generally government-funded project in tech are suspicious)

- Search is hard, and therefore it will fail

- The project is underfunded

Even if the points above can all be valid (though the obvious US-funded startup launched by a bunch of uni students seemed to have fared pretty well) it seems we are missing the point of this project.

The proposal is to contribute to the creation of open building blocks necessary to enable others (including private US companies) to make better search products.

Better search product are needed.

Shall we remember HN of some of the intense conversations that happened here about Google failing us:

- Google Search is Dying [1]

- Every Google result now looks like an ad [2]

- Google no longer producing high quality search results in significant categories [3]

So while, yes search is a hard topic, we should welcome initiatives aiming at improving the ground infrastructure needed to lower the barrier to entry on this subject and hope it will allow many companies to build better search products (or inspire other initiatives to contribute in similar and even more successful way)

1: https://news.ycombinator.com/item?id=30347719

2: https://news.ycombinator.com/item?id=22107823

3: https://news.ycombinator.com/item?id=29772136

wbl · on Sept 21, 2022

One reason search is hard is people are very motivated to game your system. That makes being transparent about how it works a fool's errand. It also isn't clear how the economic structure works: improving search relevance by x% is tremendously socially valuable, but probably makes Google's bottom line go up by a thousandth of x, and they have a very direct understanding of the connection. Without that money, how will this get the resources to succeed given the adversaries are learning from the people with a lot more?

marginalia_nu · on Sept 21, 2022

This sentiment is common, but is largely based on Google (and similar commercial actors). I'm not convinced it generalizes as much as it seems. It is uniquely profitable to manipulate Google because Google's outlook of the world informs not only its search engine, but its advertisement business.

> improving search relevance by x% is tremendously socially valuable, but probably makes Google's bottom line go up by a thousandth of x, and they have a very direct understanding of the connection.

I'm not sure this is correct. In a vacuum of real competition, the most profitable ought to be how it is right now, when search results are kind of ambiguously bad, so you need to click and skim through a few results to maybe find what you want, multiplying the number of ad impressions.

rjmunro · on Sept 26, 2022

It's nothing to do with Google advertising. It was a problem before Google had advertising. Being the number 1 non-ad result for a search will get you many more hits than being a number 2. If you don't appear on the first page, you may as well not exist.

Google's priority with ads is to make sure ad buyers are happy with the results they are getting. This means making sure the people who click the ads go on to buy the product being advertised. That's what ad buyers measure.

wbl · on Sept 21, 2022

Being result number 1 not result number 5 matters a lot to companies that are selling things to people who are searching. The value of the ads comes from that property, not the other way around.

boltzmann-brain · on Sept 21, 2022

Indeed, and that's why search relevance is important.

We can talk about the Importance Of This until we die of old age. Google failed us, Google is Borg, Yandex is FSB, Bing is ... Bing, etc. However, the fact that there is a problem to be solved and It Is Important doesn't mean that the EU will solve it. If anything, it's just the set-up for another political play that will have damaging consequences to the internet as a whole, just like GDPR, EU's poster-child "internet project".

They made GDPR just strong enough legislatively to be annoying, but not strong enough to actually change anything. Companies can still store EU citizens' data anywhere they want and do whatever they want. There's no insight into this. It is an unenforceable law and the only artifact of it existing is that I have to have "I don't care about cookies" installed, so that Avast Antivirus can eventually decide to silently collect and sell on my data.

For all intents and purposes OpenWebSearch is most likely not meant to succeed at anything either, and is just going to be a political stepping stone towards legislature that will be awkward and make the internet worse for all users. EU has a long history of creating or hanging onto laws that betray a misunderstanding of how the digital age works, how the internet works, how data can be copied or moved around for free.

Here's an example. Every country in Europe has some sort of legal construct in place that will prevent you from secretly recording a conversation you're having and uploading it to the internet. So for example, take Austria. They accomplish it by preventing you from publishing it on the internet. However, they don't prevent you from recording it. There are laws against secretly recording a conversation you are not a part of, but there's no such law for the situation when you are part of the conversation. So you can record and upload, just not publish. However, you can get on a train and go to a different country which has laws that prevent you from recording, but has no laws against publishing. Then you're in the clear. Or you can just use a VPN so that it looks like you uploaded and published it from the other country. Or you can just upload to YouTube, which will not show where the thing was published from - and claim that you did so during your tourist visit to Vietnam or whatever. And if someone brings a civil lawsuit? Good luck trawling the Vietnamese legal system for clues about that specific issue. Hope you know this rare language with virtually no legal experts who at the same time speak Vietnamese and your particular local European language. The costs would be on the order of tens of thousands at least, which is out of reach for anyone but the wealthiest EU citizens.

Want more? Egregious copyright related laws are known by everyone.

So are dns blocks of shunned sites. Or the recent Austrian project to block cloudflare IPs which literally broke the internet.

The fact such laws are still in place from before the internet - or are even still being put out - and are effectively unenforceable while making the internet worse for everyone - makes it clear that the governments in place simply have a misunderstanding of how the internet works.

None of that will stop politicians from coming up with BS excuses breaking the world with fever-dream laws in order to push their latest political agenda. The real question is: if such clearly unfit legislature is being put in place for political folly in industries we understand - how much of that is happening in industries we don't understand? Health Care, Food, Agriculture, Civil Engineering, Patent Law, and so on. My guess? You can tell what my guess is.

As for Open Web Search, what everyone should really be asking themselves is: ok, so what's the scam that's going to be pulled here?

phoefer · on Sept 22, 2022

> As for Open Web Search, what everyone should really be asking themselves is: ok, so what's the scam that's going to be pulled here?

As I'm one of the people who is working within the Open Web Search project please allow me to feel strongly about your statement. I've been involved in campaigning for this project since around 2014. This project did not originate as a 4d-chess move of some political game. It exists because of the hard work of a group of people, some of which are researchers, some of which are working at smaller search engines and some of which are involved in civil rights organisations. Currently still sitting in the kick-off meeting I can tell you that we are actively discussing how to get this project to produce useful results for a european open web index. Getting the EU to draft problematic legislation is neither in our power nor in our interest.

tremon · on Sept 21, 2022

"One reason democracy is hard is that people are very motivated to game your system. That makes being transparent about how it works a fool's errand."

That's a major fallacy you're opening with. If something is important to us (as a society), we will find ways to make it serve us well even when it's under attack. The rest of your post even makes a similar point: the social value of well-working search is greater than the economic value for even the biggest search monopolist on the planet. So why not socialize it?

xani_ · on Sept 21, 2022

Well, on one side we have zero examples of open engines actually working well, on other we have long nuanced history of SEO vs search engine fighting unending battle, with SEO side trying to circumvent every way the search engine is trying to stop them from poisoning the search results. And barely edging out on that.

Being idealistic about it won't change the outcome.

simiones · on Sept 21, 2022

Do we have any example of an open engine working badly? '

As far as I know, this hasn't really been tried, definitely not with the kinds of resources this project will have available.

account42 · on Sept 22, 2022

Do we have any examples of search engines not run by ad companies having problems fighting SEO?

rjmunro · on Sept 26, 2022

Altavista was initially a demo of the capabilities of DEC Alpha CPUs. It was initially amazing. It got taken over by SEO rubbish, and beaten by Google who invented PageRank to get good results again.

PageRank was initially much less prone to SEO shenanigans because it relied on signals from other sites (incoming links) to decide how important a result was. Of course, as Google became more popular, people started sharing links on other pages and so on to cheat the PageRank algorithm. And Google have been caught in a fight with SEO ever since.

Guidii · on Sept 21, 2022

Democracy has some intentional opacity though. Consider the impact of making your personal voting choice public.

AFAIK, nobody has ever tried that because of the dangers of vote buying and coercion. Essentially, gaming the system.

(I don't think this proves anything! Simply wanted to suggest that comparing search to democracy doesn't significantly change the analysis wrt opacity.)

quickthrower2 · on Sept 22, 2022

I like the idea of opaque citizens and open government/search company though.

systemvoltage · on Sept 21, 2022

Exactly, I think the failure of such project is embedded in the incentives. OpenMaps are good but you’re not going to convince Google to make them default on Android.

The correct strategy isn’t to hand this over to a giant bureaucracy, but to create an atmosphere where we can have a dozen alternatives to Google.

zaarn · on Sept 21, 2022

This does seem to be the goal of the project. They mention that they want to make a centralized control panel and crawler, where you can control how your website is crawled and when

But at the end of the day the resulting dataset would be sold to third parties so they can rank the results appropriately. Which to me seems to be the only sane way forward. Only a government could run something the scale of the Google crawler and succeed at doing so. And then everyone can build search engines on top of that.

denton-scratch · on Sept 21, 2022

> And then everyone can build search engines on top of that.

That bit troubles me. If the index is maintained by a government agency, and every search engine is using the same index, then that's a massive censorship avenue. I wonder how "open" Open Web Index is going to be.

zaarn · on Sept 21, 2022

If the index is censored, nothing stops you from adding to the index yourself.

The vast majority of the net is, after all, now indexed, so you can run your own indexer to cover whatever they didn't cover.

cloutchaser · on Sept 21, 2022

The only way you can beat google is to create a search engine in a small niche where they can’t compete, then eventually over the years expand into general search once they can’t catch up (innovators dilemma basically).

The only search product that comes close to this is Amazon search or booking.com (or maybe YouTube).

All multi billion dollar businesses. And when smaller ones emerge like the flight scanner one, google buys them.

I don’t see google going anywhere soon. They have a lot of faults but they’re still too good.

newaccount74 · on Sept 21, 2022

In Austria there is "geizhals.at" [1], which is a price comparison website for electronics, and they have the best parametric search engine I have ever seen. It's 1000x better than Amazon search.

However, there's no way Geizhals will ever expand into general search. The reason why they are so much better than the competition is that they are focussing on a small, profitable niche and presumably use manual data entry to ensure they have the best data.

[1]: there's a UK version too "skinflint.co.uk"

Lio · on Sept 21, 2022

Just opened skinflint.co.uk to have a look. I was greeted by "Do Not Track detected only storing necessary information". That's wonderful!

I don't think I've seen a website that does that before and I think it's worth praising.

Akronymus · on Sept 21, 2022

Geizhals is my go to for anything computer related. Somewhat amusing that such a good site comes out from this little country.

systemvoltage · on Sept 21, 2022

This is HN and we're not even pretending to be optimistic about beating incumbents, but it is much deeper than that - there is also a profound sense of defeatism directed towards European entrepeurs as well as the world at large. It is an indictment of the current nihilistic zeitgeist.

gnz11 · on Sept 21, 2022

Well, it is a forum run by a Silicon Valley VC firm … so of course there will be a bit of “only SV can do anything interesting in tech” gatekeeping going on.

systemvoltage · on Sept 21, 2022

Idk, even people here in SV has given up.

supermatt · on Sept 21, 2022

> The only search product that comes close to this is Amazon search or booking.com (or maybe YouTube).

What is so great about those search examples?

aydyn · on Sept 21, 2022

Absolutely nothing, Amazon search is horrendous. Search by price is completely broken (and has been for years). Filter by exact phrase is non-existent. No way to search products by size, and on and on.

tyiz · on Sept 21, 2022

Exactly, neither Google nor Apple will open the gates to their fenced garden. It will always be more convenient for users to use their solution.

kazinator · on Sept 21, 2022

Speaking of Google and ads; I witnessed a new low today.

My 3 going on 4 kid was watching a cartoon on Youtube: Curious George.

An ad popped up promoting some show, featuring foul language and sexual intercourse.

Yagoddabekidding.

lock-the-spock · on Sept 21, 2022

That's why I don't let my kids watch YouTube on their own... That said, as a European I'm more triggered by all the gore and violence rather than some naked skin (although I also wouldn't want sub-12 year olds encounter explicitly sexual stuff). The later seems to just lead to questions and adult conversations, but the normalised violence the kids seem to act out.

kavalg · on Sept 21, 2022

I second the violence thing. I has somehow become the new normal in the teenager world which is very disturbing to me. In my family we strictly use only Firefox with ublock for youtube (and the rest of the internet too). The difference is night an day. I always recommend it to friends of mine and every time I get positive feedback after this.

saiya-jin · on Sept 21, 2022

Once or twice I didn't use Firefox with ublock origin on youtube, and since I am pretty allergic to any ads as they insult my intelligence I just closed the whole page before getting to the content.

How can otherwise smart people accept being treated like brainless idiots by similar services is beyond me. And how can they teach their kids that's fine too is something I'll never understand... but hey its your kids, everybody sees quality of their parenting first hand (and then complain how kids are unruly and eat junk these days... gee I wonder where they took inspiration and who built their character).

disgruntledphd2 · on Sept 21, 2022

YouTube kids doesn't appear to have any ads, so might be worth checking out.

tlarkworthy · on Sept 21, 2022

I have YouTube premium. Everytime my kids get exposed to even age appropriate àds it generates attraction to crap. Even in magazines for kids. So I just keep ads out of their system and they seem happier and more internally satisfied

eurasiantiger · on Sept 21, 2022

Advertisements are essentially artificial reminders of how unsatisfying and difficult life is without $PRODUCT.

guerrilla · on Sept 21, 2022

Pretty much a form of psychological abuse and manipulation.

harha · on Sept 21, 2022

I've recently been searching for some how-to videos and the number of ads is insane - but somehow they don't show when I watch in the video search results in DuckDuckGo

april_22 · on Sept 21, 2022

Same here, not using YouTube's built in search function eliminates most of the ad and also recommendations problems. I use You.com since they additionally show Reddit and TikTok results which I sometimes find more useful than YT videos.

easrng · on Sept 21, 2022

I'm pretty sure YouTube doesn't play ads in embeds (as much?)

baq · on Sept 21, 2022

people are primarily motivated by envy even if they don't realize it...

Roark66 · on Sept 21, 2022

One more reason to use an ad blocker for YouTube. I even use on my (android) "smart" TV. The open source app I use is smartyoutubetv. Occasionally it gets broken, but then a fix is usually an update away. It skips not just ads, but intros, self promotion, "please subscribe to this channel" etc (assuming creators tag their content properly). Basically it is YouTube premium for free.

An interesting thing I noticed is than initially when I used the app)(with the same YouTube profile) I would get worse video suggestions. These days they are the same in YouTube app as well as smartyoutubetv.

Another interesting observation is that if the app breaks and I start watching YouTube in their app the frequency of their ads is quite reasonable in the beginning. Few times I even thought. Hey, if they show me an ad every couple of videos that's not that bad right, I might switch back to YouTube. Then as the amount of content watched increases the amount of ads increases to the point of making it unwatchable (4 ad breaks in a 20min video, sometimes even double 40s ads one can't skip). This makes me go back.

Guidii · on Sept 21, 2022

I'm curious here and would love to hear thoughts on:

A) Youtube's terms of service clearly state "not for kids under 13"[1]

B) Youtube has produced a product for the younger age market [2]

C) folks in this thread are reasonably complaining that full youtube isn't appropriate for children.

Am I missing something here? (Like maybe your child was using youtubekids and still got the unacceptable content?)

[1] https://blog.youtube/news-and-events/children-youtube/#:~:te.... "While we permit users between the ages of 13 and 17 to register for an account with parental permission, we do not allow children under the age of 13 to create an account"

[2] https://www.youtubekids.com/

tephra · on Sept 21, 2022

I'll put in a recommendation for the Youtube Kids app where this doesn't happen.

kazinator · on Sept 22, 2022

Unfortunately, voice search is also something doesn't happen on Youtube Kids for Android TV. That greatly limits its usability, which is unfortunate, since there are many good things about it.

I'm using it in Japanese. Using the on-screen keyboard in Japanese mode, you can enter hiragana only. It doesn't do any word recognition to turn into kanji or katakana, which results in the search results being poor.

Nobody uses onscreen keyboards on Android TV for anything beyond entering a Wi-Fi password; it's a nonstarter.

YouTube Kids for Android TV ... ironically, just a toy for now ... with a well-deserved, accurate 1.4 star rating in the Play Store.

GoblinSlayer · on Sept 21, 2022

See? Now you know about that show and you will watch it. Ad was successfully delivered. Engagement goes up.

classified · on Sept 21, 2022

What did you expect? Google is an ad company, first and foremost. Everything else is just a funnel.

kazinator · on Sept 21, 2022

> What did you expect?

Something like, perhaps, the same standards and sense as in traditional broadcasting.

> Everything else is just a funnel

A funnel promoting pick up trucks and financial services to toddlers? Do they count that as an "impression" in the statistics that they feed the client? It seems like borderline fraud.

auggierose · on Sept 21, 2022

You can pay for YouTube premium to get rid of ads, maybe just do that? Or don't let your 3 year old kid watch YouTube in the first place, and show them something from traditional broadcasting. Before complaining about broadcasting standards, maybe first up your parenting standards.

kazinator · on Sept 21, 2022

> You can pay for YouTube premium to get rid of ads, maybe just do that?

"My /own/ children don't see inappropriate ads; therefore there isn't any problem."

> Before complaining about broadcasting standards, maybe first up your parenting standards.

"Before whining for ice cream, maybe first eat your dinner!"

tremon · on Sept 21, 2022

Does Youtube Premium offer higher standards? The post you're replying to isn't asking for fewer ads.

auggierose · on Sept 21, 2022

> An ad popped up promoting some show, featuring foul language and sexual intercourse.

Youtube Premium has no ads.

alfnor · on Sept 21, 2022

If only there was a way to get rid of ads...

kazinator · on Sept 21, 2022

What if I don't want to get rid of ads?

guerrilla · on Sept 21, 2022

That's not right. Use uBlock Origin.

sanxiyn · on Sept 21, 2022

PyPy was funded by EU. I would be very happy if this project is as successful as PyPy.

dagw · on Sept 21, 2022

EU providing tiny amounts of seed funding to and existing project to bootstrap a small proof of concept is entirely different from the EU trying to essentially create a tech giant.

phoefer · on Sept 21, 2022

To be perfectly clear, OWS is an example of the former. No one is trying to build a EU search giant.

baq · on Sept 21, 2022

a tech giant which is also a public utility a level or two lower than running water...

WHATDOESIT · on Sept 21, 2022

"Funded by" is very distinct from what's happening here.

This is another Gaia-x, remember, the EU big tech cloud killer?

arjenpdevries · on Sept 21, 2022

This is not a Gaia-X, it is an exploratory project, showing a possible way forward and setting first steps.

mariusor · on Sept 22, 2022

> EU-funded project cannot succeed in tech because previous EU-funded projects have failed in tech in the past

I think the fundamental problem here is that the people that are interested and in grants and are capable of writing grant proposals are different than the people that are interested in building things. There's very little overlap. So the money goes to the people capable of writing proposals, and the people doing the work do it for free in an obscure corner of the internet.

It's sad really, but I suspect it's a side effect of the huge bureaucratical machine that is the EU. One way to make this better would be to simplify the access to grants so that technical people can do it without needing a class in "EU funding speech".

ffhhj · on Sept 21, 2022

If these thinkers can find a way to remove the incentives of spammers, misinformators, and stakeholders to pollute results, that would be a great achievement. It could be seen as a big economic game, simulating these actors might allow comming up with rules to balance this game and minimize pollution.

petre · on Sept 21, 2022

It might be possible using moderation and reporting abuse or registering with the search engine to aid moderation and banning abusers. This will probably also take down right wing websites, porn and online gambling.

p1necone · on Sept 21, 2022

> EU-funded project cannot succeed in tech

Not sure about software, but all (well, Apple is getting there) phones now use the same connector to charge because of the EU.

6510 · on Sept 21, 2022

We have Estonian which is like a laboratory for govtech complete with human test subjects. They are making the automation focused law sausage.

closedloop129 · on Sept 21, 2022

Wasn't China's unified standard from about 2007 much more influential?

lock-the-spock · on Sept 21, 2022

Which standard?

closedloop129 · on Sept 23, 2022

Some standard about unified chargers.

sschueller · on Sept 21, 2022

The new iPhone 14 still has a lightening connector.

yarg · on Sept 21, 2022

2024.

lurkernomore · on Sept 21, 2022

Any new open-source search option is good, but I also wish more attention was given to prior open projects like GigaBlast[0]/KBlast[1] crawlers, etc.

It hasn't escaped the wider world that quality open-source search is desirable, and it's hard to think what this new EU project brings to the table that isn't already available if others want to contribute to existing efforts. I wish the EU project the best of luck of course!

[0] https://github.com/gigablast/open-source-search-engine

[1] https://github.com/fossabot/kblast

logifail · on Sept 21, 2022

> EU-funded project cannot succeed in tech because previous EU-funded projects have failed in tech in the past (and generally government-funded project in tech are suspicious)

I'm not suspicious of all government-funded projects (back in the day my own PhD was government-funded!) but I can't help be suspicious of claims such as:

"an open European infrastructure for internet search, based on European values and jurisdiction"

and

"The project will be contributing to Europe’s digital sovereignty"

Q1: Who defines "European values"? Is that done by Qualified Majority voting or would - for instance - Hungary have a veto on any proposed definition?

Q2: Which treaties regulate "digital sovereignty"? Recalling that the 27 member states each "remain sovereign and independent"[0], is that digital sovereignty being handled in BRU, in the 27 states, or a mixture of both?

[0] https://op.europa.eu/webpub/com/eu-what-it-is/en/

bryanrasmussen · on Sept 21, 2022

>"an open European infrastructure for internet search, based on European values and jurisdiction"

jurisdiction = GDPR will not be cheated.

European = We mistrust America, because they do crappy stuff that meant we had to pass GDPR.

>Europe’s digital sovereignty

Means - Europe will not be ruled by interests outside Europe, the interests inside Europe can fight it out via EU procedures.

logifail · on Sept 21, 2022

> jurisdiction = GDPR will not be cheated

Are there (m)any global search engines with European users who are "cheating" GDPR? Which ones?

> Europe will not be ruled by interests outside Europe

That's a very bold claim, especially considering the geopolitical situation right now.

> the interests inside Europe can fight it out via EU procedures

Would that be the interests of the estimated 25k-30k lobbyists who work in Brussels on behalf of their corporate paymasters?

bryanrasmussen · on Sept 23, 2022

are you reading my post as being a statement that these will be the results instead of my providing an interpretation of what the original EU text means?

>Would that be the interests of the estimated 25k-30k lobbyists who work in Brussels on behalf of their corporate paymasters?

probably, as well as the various governments that exist in the EU.

nix23 · on Sept 22, 2022

>European = We mistrust America, because they do crappy stuff that meant we had to pass GDPR.

But we still mistrust our citizens that's why we introduce Chat-control 2.0, so we have a "better then China" control over private communication.

European values ;)

phoefer · on Sept 22, 2022

Remember that the EU is not just one person. There are many people, parties and organisations who are actively fighting against legislation like this.

nix23 · on Sept 25, 2022

You mean begging not fighting right? But why should i trust Europe more then for example Google (at least they can protect their "trade-secrets/data").

That's why People trust evil-company's more then stupid[1] Governments.

[1] never attribute to malice that which is adequately explained by stupidity.

vanderZwan · on Sept 21, 2022

> EU-funded project cannot succeed in tech because previous EU-funded projects have failed in tech in the past (and generally government-funded project in tech are suspicious)

Do we distinguish direct EU funding from funding by governments of EU countries? If not ASML would like a word, and I'm sure people from other member countries can come up with other examples.

kidsil · on Sept 21, 2022

and generally government-funded project in tech are suspicious

Like the Internet, or the WWW?

permo-w · on Sept 21, 2022

or computers themselves

nonethewiser · on Sept 21, 2022

You forgot people not trusting the government to curate information.

tyiz · on Sept 21, 2022

Wasn’t Google financed by Darpa? It was always planned big. The student story is bogus.

boyter · on Sept 20, 2022

I have written this before but I’ll put it here again. What I would like to see is a federated search engine. Based on activitypub that works like mastodon. Don’t like the results from one source? Just remove them from your sources, or lower their ranking. Similar to yacy but you can work with the protocol to connect or build whatever type of index you want using whatever technology you like, and communicate over an existing standard. Want to build the worlds best index of Pokémon sites, then go do it. Want to build a search engine using idris or ats? Sure! I did note the professors are on mastodon so perhaps this may actually happen.

One of these days I’ll actually implement the above assuming nobody else does. I figured if I can at least get the basics done and a reference implementation that’s easy to run it could prove the concept. If anyone is interested in this do email my in my bio.

What I worry about for this project is that it becomes another island which prohibits remixing of results like google and bing, and its own index and ranking algorithms become gamed.

I wish the creators best of luck though. I am also hoping for some more blogs and papers about the internals of he engine. So little information is published in the space that anything is welcome, especially if it’s deeply technical.

TacticalCoder · on Sept 20, 2022

> Don’t like the results from one source? Just remove them from your sources, or lower their ranking.

That's basically Usenet killfiles and, yes, I think they're totally due for a comeback in one form or another. Usenet may have had its issues towards the end (although it still exists), but killfiles weren't one of its problems. The simplest one you could just discard sources you didn't want to read anymore but the more advanced you could assign weight/rankings based on various factors (keywords / usernames / if you did participate or not in a discussion / etc.).

arjenpdevries · on Sept 20, 2022

We like Federated search, we like decentralized search, and even P2P search; we are trying to find a good mix, and decided to get started rather than wait! Exciting times.

marginalia_nu · on Sept 20, 2022

What are the benefits from this?

I'm not trying to be dismissive, it's just my feeling from working on search.marginalia.nu is that nearly every aspect of search benefits from locality, not only is the full crawl-set instrumental in determining both domain rankings and relevance signals on a term-level such as anchor tag keywords; but the way an inverted index is typically set up is extremely disk cache friendly where the access pattern for checking the first document warms up the cache for the other queries, but that discount obviously only exists when it's the same cache.

boyter · on Sept 21, 2022

You could get people creating indexes with love such as your own. marginalia could become the de-facto index for long form content. However you probably arent that interested in running the best pokemon website, so someone else could do that.

Enough people add domain specific search endpoints, with perhaps a taxonomy to say "hey send those sort of queries over here" and you have a compelling engine that self heals should someone stop running things, or starts spamming.

arjenpdevries · on Sept 21, 2022

Yes, that is an advantage.

You can also integrate search results for which you cannot have the index, like social media APIs, another reason.

You could also mix and match search results from various topic-oriented indices. That's a research question, whether that is really better than building one unified one. But we think it is the way to bring index fragments to the edge, with the obvious privacy advantages.

hkt · on Sept 20, 2022

I would love to be able to run a node that mirrors part or all of an index like this, and to let people query it - a bit like https://torrents-csv.ml/#/

Good luck! I'll be watching your progress and cheering you all on!

melony · on Sept 20, 2022

What's the point of a federated search engine? At the end of the day most nodes will end up implementing the same regulations/censorship with development driven primarily by a few. It's like ethereum vs ethereum classic all over again. If the EU or the developers' respective governments demand a censorship or forgetting feature to be implemented, it's not like the federated nature would matter. An open source search index is useful, a search engine that can be easily self hosted is also useful. But building a search engine as a federated system is a gimmick with no significant value.

Do you see any major Mastodon nodes interfacing with Truth Social or Gab? I certainly don't. If federation barely works for a social media app, I fail to see how it would even matter for a search engine.

fabrice_d · on Sept 20, 2022

At least one of the partners (https://openwebsearch.eu/partners/radboud-university/) does research on "federated search systems", so there's hope!

cookiengineer · on Sept 20, 2022

Isn't searx what you're describing? I was running an instance for a while, and it's basically a meta search engine that has support for all kinds of providers.

There are also some web extensions available so that you can fill it with more data.

[1] https://searx.github.io/searx/

vindarel · on Sept 20, 2022

I'd say it rather looks like Seeks, unfortunately defunkt: https://en.wikipedia.org/wiki/Seeks

> a decentralized p2p websearch and collaborative tool.

> It relies on a distributed collaborative filter[6] to let users personalize and share their preferred results on a search.

boyter · on Sept 20, 2022

Searx is half of it where it calls out to other searches but does not provide its own index as far as I can see. It also does not remix the results.

cookiengineer · on Sept 21, 2022

If it is about a decentralized index there's also YaCy.net [1] but I don't know how actively maintained the project is.

For me it made more of an enterprise-grade use case (e.g. for building a search for your own file servers or confluence) so I only tested it out a little. It's a huge java project, that's why I decided to go with searx back then...cause yacy was pretty hard to setup.

[1] https://github.com/yacy

asim · on Sept 20, 2022

One of the things I wonder here is if it would be easier to just start by crawling known RSS feeds and then exposing a JSON API for the data and making the whole thing open source. Then keeping a public list of indexes and who crawls what. Eventually moving into crawling other sources but first primarily addressing the majority of useful content that's easily parseable.

boyter · on Sept 20, 2022

That's probably the easiest way I know to get good content into a search engine. Annoyingly however it does not contain all the content available.

googlryas · on Sept 20, 2022

What benefit does federation bring here? Unless it is very simple to set up, most communities are non-technical and probably won't be able to set up their own crawler. I would think just a search engine that lets you customize the ranking algorithm, and maybe hook into whatever ontology they've developed and ranking it accordingly would be sufficient.

viraptor · on Sept 21, 2022

> most communities are non-technical and probably won't be able to set up their own crawler

They can use a solution which already integrates the search. Forums and CMSes are a good target for that. Then you can say "I'd like my search to look at widgetlovers.com too" - and you get their sitemap + featured external links, because they run FooPress that supports it.

Kind of the same as sitemap we already produce for Google.

boyter · on Sept 20, 2022

It can be very simple to setup. Think single binary to run, or lambda to deploy (yes this is possible) with the URL back to it.

I imagine a binary, with a simple Admin UI allowing you to crawl some domains recursively would be enough to index your own website, and then have those results shared.

Where I could see this being really useful, is let someone who knows everything about pokemon provide the index for searching pokemon information. Then when they federate, provide a taxonomy saying "for queries that have these words, call me". Suddenly you have a very high value search source for pokemon.

Throw in some zero click info information boxes and you have added a lot of value.

grishka · on Sept 21, 2022

ActivityPub is not well suited for this application. It's for publishing activities made by actors — hence the name. You'll want to invent your own federation protocol specifically for federated search.

boyter · on Sept 21, 2022

Last time I checked there was something in there for search...

Even so you could base it on activitypub I suspect. It would need to be extended for sure to implement the sorts of things believe would be required.

camel-cdr · on Sept 21, 2022

They've listed "DECENTRALISED SEARCH" as a ongoing project/goal.

Roark66 · on Sept 21, 2022

Seriously... I really wanted to like this project, but it seems everything EU touches as of recently gets worse.

From the webpafe that half of the time shows "Resource Limit exceeded" to a technology stack diagram on the bottom of this page https://openwebsearch.eu/the-project/ being completely unreadable due to bad scaling.

It is very disappointing really. Another example from the top of my head. Here in Poland we have ID cards(as every other EU country) . Those ID cards have to be renewed every now and then (10~15years). In last years an online system for government services was implemented including for renewal of those cards. One could take a photo with a mobile phone, submit an application and pick up a card from a gov office in few weeks time. Unfortunately, EU made a law that ID cards applications have to be acommpanied by biometrics (fingerprints) so this system has been thrown away. One has to physically go to the gov office, scan their fingerprints, apply for a new id card and then go again to pick it up...

Ok, so what happens in 10 years time? They should have the fingerprints already, right? No. They take the fingerprints, they store them only until one picks up the id card and then they are deleted. There are no fingerprint database, they are not stored anywhere. The fingerprints are used only to ensure the same person that submitted the application picks up the document. It makes zero sense, other than to break the previous online system. Thanks EU.

ssmiler · on Sept 21, 2022

And why the hell the government will keep the fingerprints? So that once in 10 years I save 1h. The benefit is minor compared to bad things that can be done with a fingerprints DB (mass surveillance..)

Roark66 · on Sept 22, 2022

I'm not complaining they don't store the fingerprints. I'm complaining they broke a perfectly good system pretending it improves security by getting fingerprints, while it does nothing. The fact they collect the fingerprints to delete them when you collect the document (or 6 weeks past your collection date if you don't show up). demonstrates how bullshit this "rule" is.

Our (Polish - and other EU states as I understand it) government is not allowed to store everyone's fingerprints unless they are a criminal. I as well as other people here have pretty strong feelings against it.

jld89 · on Sept 21, 2022

> And why the hell the government will keep the fingerprints?

The government already does... just saying.

simiones · on Sept 21, 2022

They just explained that it doesn't (assuming we trust them, of course).

scrollaway · on Sept 21, 2022

> they are not stored anywhere

They are. They are stored on the card itself, and they are NOT stored in government databases. This is a good thing...

Roark66 · on Sept 22, 2022

Yes, you're right, they are stored on the card itself and nowhere else so in theory it allows authenticating the person to the document via a fingerprint. I didn't know that. Still I think having them there doesn't provide suffiecient extra value to offset having to go to the gov office twice.

Also, using fingerprints to routinely authenticate people presents a whole new lot of problems. No one voted for a party that proposed such idea.EU simply decided to mandate this and it has to be implemented no questions asked. What about people that have trouble using the fingerprint scanners govs use? I used to play base guitar when I was a teenager. I can't wait to find out how well/bad this tech will work with the thick skin on my fingers.

mtrycz2 · on Sept 21, 2022

I mean, do people really want their government to store their biometrics? What could ever go wrong with that?

tremon · on Sept 21, 2022

and they are NOT stored in government databases

...that we know of.

Yet.

contravariant · on Sept 21, 2022

Intentional on not at least the EU prevented an insane system where you can apply for an ID card without actually being physically present.

Roark66 · on Sept 22, 2022

If you're an adult you already have an ID card...I'm talking about renewing one, not applying for the first time. One applies for one in person when one turns 18 (with a birth certificate, a passport if a citizen of another country, or a piece of paper from the border guards if a refugee etc).

Also one needs an ID card plus 2FA authentication(usually connected to a bank account, a physical smart card, or mobile phone) to login to the government services portal in the first place. This portal is seen as a huge accomplishment (in comparison to the inconvenience of having to do every little thing in person, queue for hours etc). It is not just taxes, it is health service, local councils, national(and health) insurance, building permits, basically almost everything one can do in person can be done via this portal - except renew an ID card since the stupid EU rule came in force...

marcinzm · on Sept 21, 2022

You still have to apparently go in person to pick it up, how is that an insane system? Now you have to go twice.

edit: I mean you can renew a US passport purely by mail without ever interacting with anyone and I haven't heard of massive issues caused by that.

permo-w · on Sept 21, 2022

so your complaints are: the webpage announcing this news isn't perfect, and the (well-known for being incompetent, corrupt and anti-EU) Polish government has implemented an EU policy poorly?

Roark66 · on Sept 22, 2022

haha, very funny. I think you are 100% correct if we invert every statement you made. Out of curiosity, how would you implemnt this "policy" that it wouldn't be "implemented poorly"?

As for being corrupt and anti-EU, I have to say in recent decade at least there has been no other group of politicians more incompetent, corrupt and anti-EU over the EU commision itself. From the botched/corrupt "green new deal" that resulted in complete dependency on Russian hydrocarbons and resulted in another war in Europe, through the "pay Turkey for the problem to go away" mediterran refugee "solution", to complete ineptitude at the first 6 months of the pandemic and basically leaving Italy on its own, culminating in illegal witholding of funds to member states that elected parties opposed to the current option in Brussels.

However, what truly destroys EU is not even the above, but the lack of respect for the rule of law amongst the top officials. They have their goals and no matter what, they will do anything to reach them. For example they want more integration and a federal state. They proposed it fairly some years ago as an EU-Constitution and it was demolished in referendums. Instead of giving up, hearing the democratic choice and going the direction the sovereign(the people) told them to they then proceeded to implement it another way over people's heads(It was supposed to be implemented in the treaty od Lisbon). However, those treaties have to be unanimous, and some countries didn't want to essentially be ruled by the biggest countries so it got watered down back then. Then they realised it is impossible to implement this goal in accordance with the rule of law, so what they are trying to do now is twofold. First, throw out unanimous voting in favor of majority vote so smaller country objections can be disregarded. Second, bully countries that disagree by illegal witholding of funds.

We're very near the end of the EU, and it does make me sad because I still believe in the ideas that led to it in the first place: free market for goods, travel and work, shared values and work towards common goals between member states - not bully eachother or sell other member's state's security for financial gain.

At least in my generation 10 years ago I would think 95% people would consider themselves very pro-EU, now, unless the current political class GTFO promptly I don't see EU being a thing in next 10 years.

permo-w · on Sept 22, 2022

did you notice that no one chimed in to say "yeah we have this same problem in [my EU country] too!"? the reason for that is that - as usual when people are criticising the EU - it is in fact your government's fault, and your politicians/news media/nationalist friends are using the EU as a scapegoat for their own (nation's) incompetence/corruption

marginalia_nu · on Sept 20, 2022

I'm a bit skeptical EU-funding a bunch of professors is the way a search engine will be built.

The primary goal for academics is to publish new findings, while what you need to build a search engine is rock solid CS and information retrieval basics. Academically, it's not very exciting. Most of it was hashed out in the 1980s or earlier.

jjulius · on Sept 20, 2022

>I'm a bit skeptical EU-funding a bunch of professors is the way a search engine will be built.

Heh, so, funny story...

>A second grant—the DARPA-NSF grant most closely associated with Google’s origin—was part of a coordinated effort to build a massive digital library using the internet as its backbone. Both grants funded research by two graduate students who were making rapid advances in web-page ranking, as well as tracking (and making sense of) user queries: future Google cofounders Sergey Brin and Larry Page.

>The research by Brin and Page under these grants became the heart of Google: people using search functions to find precisely what they wanted inside a very large data set.

https://qz.com/1145669/googles-true-origin-partly-lies-in-ci...

dizhn · on Sept 20, 2022

Splitting when the project looks like it's gonna make money is the American way. (Thanks for the public funds. khaaa chingg)

google234123 · on Sept 21, 2022

They did a lot more good by making a company rather than sticking around in academia and publishing a few extra papers.

pantalaimon · on Sept 21, 2022

I see this a lot in a project I'm involved with where the majority of contributors are from academia. The incentive is to mostly push novel things, often with questionable practicability, in order to write a paper about it.

But nobody will implement the 'boring' features needed to make the thing generally useful.

Xeoncross · on Sept 21, 2022

username checks out

permo-w · on Sept 21, 2022

not that I support this approach, but the return on public investment there is clearly huge

ddorian43 · on Sept 21, 2022

The best remote+paying job that I've ever seen online was from a Darpa project (memex project, about search engines). $180K - $250K+. This was ~5 years ago.

Curious what the salaries will be on this one.

mellavora · on Sept 21, 2022

Marginalia, I know you are working on a (fantastic!) search engine of your own https://search.marginalia.nu/

I salute your efforts and endorse your search engine. I also recognize that you know what it takes to build a search engine.

I don't think you have deep familiarity with EU academia. - The primary goal for academia is to influence society. Publishing is a route to that. - Being head of the EU search engine project would give high academic status - There are hundreds of articles which you could publish on this project - Rock solid CS. Would someone like Knuth count as "rock-solid"? Who is better at CS, the person who can implement quicksort because they practiced leetcode, or the person who invented quicksort? - information retrieval basics. Again, these basics were probably developed in academia.

The skills you say are basic to this endevour are more prevalent in top-quality professors and post-docs than they are in industry.

marginalia_nu · on Sept 21, 2022

On top of these skills, you actually need most of all software engineering and architecture experience. I don't think this is common in academia at all. Not in professors, not in PhD-students. You need practical experience building complex software at a large scale. Across that, you need to implement these CS fundamentals.

This is requires far more CS than you'll find in your usual software development effort, for sure, and many CS professors absolutely fit that bill. However, to the same degree it also demands far more on the software engineering side. People out of academia in general, from every time I've seen them build software, have not been all too impressive on that side of things.

Web search has an incredible demand for being well rounded, beyond anything else I've encountered. CS isn't the hard part bottle-necking everything else, it's just one of the many hard parts.

margarina72 · on Sept 21, 2022

One could have been skeptical about US-funding a bunch of university students to build a search engine a few years ago.

harha · on Sept 21, 2022

Search was quite broken back then. It got reasonably good at some point, now I’d argue it’s the content as a whole that’s gotten worse.

I’m just skeptical if the EU bureaucrats will put the money in the right place, and if this is even the right approach.

margarina72 · on Sept 21, 2022

The project can fail, you are correct, but it does not take anything from any other projects, it is just an other initiative trying to contribute in the space.

Parent commenter own search project Marginalia Search [1] could even benefit from it, or even maybe collaborate with it.

It is not a winner-take-all situation, and we need various open initiative in this space to get out of the current conundrum we are in with Google stronghold on search.

1: https://search.marginalia.nu

harha · on Sept 21, 2022

Taxes are taking from funds for personal projects, and universities already get a lot of funding in Europe to do research, maybe they should focus on creating a better environment for students to become researchers or entrepreneurs instead.

Overall I think there are better ways to improve search from an EU perspective by doing what they are supposed to be doing:

- create a fair environment for companies to compete in, e.g., take a look how Google, Apple or Meta's assets are set up to make it harder for competitors, break that up

- improve standards in eduction – it doesn't really make sense for all member countries to think of and maintain a good CS curriculum and they all seem to be pretty bad at it

- make it easier to build something and get funded, and reward creating prosperity, don't tax it to death

Just tested marginalia's random mode btw. Pretty cool, reminds me of the internet when I was a kid

(edit: formatting)

olalonde · on Sept 21, 2022

I don't think the US funded Google?

arjenpdevries · on Sept 21, 2022

Google started out of a CIA funded Stanford project.

ben_w · on Sept 21, 2022

CIA? Other people are saying DARPA and NASA got together to fund the NSF which funded the PhDs of the founders of Google, but even that's a bit too indirect IMO. Where does the CIA fit in?

arjenpdevries · on Sept 21, 2022

https://qz.com/1145669/googles-true-origin-partly-lies-in-ci...

Enjoy! It's a great story.

(Plus: for who might not know, DARPA is US defense research, and heavily influenced by the intelligence services needs. Which is not necessarily bad! Just good to understand where and how Google originated. And wrt DARPA, they funded the creation of the internet itself, for whatever matters.

In Europe, things often go slightly different. The Web is a result of CERN, who are also a project partner of OpenWebSearch.EU. Why? Well, better search can also be beneficial for better science, not just for end users wanting to find their way or buying something.)

ben_w · on Sept 21, 2022

Thanks :)

hkt · on Sept 20, 2022

..correct me if I'm wrong, but Google was started by a couple of postdoctoral researchers, no?

DannyBee · on Sept 20, 2022

Who deliberately did not stay in academia to do it. More to the point, a successful team building a product like a search engine requires roles that academia doesn't really have.

Who is doing product management?

Who is doing product marketing?

etc

This is all applied engineering at this point, not R&D. How does it at all fit into academia's strong suit?

FridgeSeal · on Sept 20, 2022

I think that maybe the point is this is not being tackled as a purely economic endeavour (or if it is, it’s in the “indirect” manner), as such, I suspect roles like “product marketing” are probably unnecessary, at least for now.

Also, tell me you wouldn’t love to work on a large project that wouldn’t be subject to the arbitrary whims and promises of the marketing department.

JumpCrisscross · on Sept 21, 2022

> the point is this is not being tackled as a purely economic endeavour (or if it is, it’s in the “indirect” manner), as such, I suspect roles like “product marketing” are probably unnecessary, at least for now

Which results in an interesting engine nobody uses. Products that start with the tech and then think of selling it fall on their faces for a reason.

viraptor · on Sept 21, 2022

Google started with tech and disdain for online advertising. They didn't start with the ads.

DannyBee · on Sept 21, 2022

and were forced to pivot because it didn't work to do it that way, empirically disproving your thesis?

viraptor · on Sept 21, 2022

They weren't forced to pivot. Gates explained to they how much money is to be made and they changed their mind. For some time they were very excited how unobtrusive and helpful the ads were. Then they realised there's even more money and the rest is history.

Is Google proves anything, it's that greed is real.

mkl95 · on Sept 20, 2022

> 14 European research and computing centers

> 7 countries.

> 25+ people.

There are literally dozens of them!

https://openwebsearch.eu/partners/

marginalia_nu · on Sept 20, 2022

I don't think the number of people or even the size of the budget is wrong. A small team can be incredibly powerful and productive if you have the right people. In fact, I think far more often search engines fail from trying to start too big than too small.

The problem is that you need people who actually know how to architect complex software systems much more than you need revolutionary new algorithms. For that, professors are the wrong people. A professor on the team, sure, that might be helpful. Not half a Manhattan project's worth.

thwayunion · on Sept 21, 2022

> For that, professors are the wrong people.

Have no fear; all of the actual work will be done by PhD students straight out of undergrad, and most of the actual leadership will be done by a string of recent PhD grads who need results in 6 months because they'll be full time job marketing for the 6 months after that ;-)

mkl95 · on Sept 20, 2022

It happens all the time in Europe. Collaboration between public and private companies is pretty much a pipe dream in the EU. Some company that actually works on building search technology would achieve way more than a bunch of professors.

I disagree on the budget though. It is basically pocket change.

marginalia_nu · on Sept 20, 2022

Arguably the biggest most unsolved problem in search is how to make a profit (or even break even). This can be approached in two ways: You can either try to find some way of making search more profitable, or you can find a way to make search cheaper. I think the latter is a lot more plausible than the former.

A shoestring budget keeps the costs down by design and by necessity. A large budget virtually ensures the search engine becomes so expensive to operate it will never break even.

qualudeheart · on Sept 21, 2022

Why not offer a paid tier? Seems to work for Kagi. Information elites will soon flock to paid search engines, which won’t be much more expensive than a streaming subscription. I pay for Netflix and am willing to pay for a search engine that offers as good a search service as the video streaming offered by Netflix.

mellavora · on Sept 21, 2022

> Arguably the biggest most unsolved problem in search is how to make a profit

And the EU just solved that problem.

marginalia_nu · on Sept 21, 2022

Did they though?

warinukraine · on Sept 21, 2022

> Im a bit skeptical EU-funding a bunch of professors is the way a search engine will be built.

Worked for Google.

rgrieselhuber · on Sept 20, 2022

The real game-changer in search would be if companies would agree to publish indexes of their own sites in an open standard to a place that everyone could access. This would undercut the monopoly power that large search engines have and allow everyone to focus on innovating the best way to search that content vs. having to spend so much time and money to crawl and index it.

rrwo · on Sept 20, 2022

There are already sitemaps, and pages used structured data like HTML5/ARIA roles, RDF or JSON+LD to provide some semantic annotations.

I'd rather that web robots use this information to build useful indexes than to have to worry about generating yet another feed in the hopes that it helps people find my content in a search engine.

Besides, a web robot can determine how much other sites link to my content and help determine its overall ranking in results. Adding another type of index file to my site will do nothing to determine how it relates to other sites.

rgrieselhuber · on Sept 20, 2022

The structured data on sites, unfortunately, still requires a crawler to index that content, which serves as a barrier for search engine startups. At a minimum, adding some metadata content to XML sitemaps would go a long way to solving some of this problem (title, meta description, content summary, even structured data to the sitemaps).

DocTomoe · on Sept 21, 2022

Deep down in my soul, the long-locked-away SEO of my money-hustling youth just grinned in anticipation.

We have had embedded metadata in websites for decades. In the beginning, Search Engines did even use them. Until someone started stuffing unrelated keywords in it to rank higher.

rrwo · on Sept 21, 2022

> The structured data on sites, unfortunately, still requires a crawler to index that content

How is that any different from requiring a crawler to index XML sitemaps?

> At a minimum, adding some metadata content to XML sitemaps

The purpose of a sitemap is to tell a web robot what resources there are, with some minimal metadata about page titles and last modified date.

Google has some extensions for identifying images and videos.

But that adds more work for site maintainers, who have to duplicate work.

Eduard · on Sept 20, 2022

What's the problem of using any of the many free webcrawler (libraries) available to crawl a website (even if solely based on the pages advertised by sitemap.xml / robots.txt-announced sitemaps), then extract structured data from these pages?

I don't see this as a barrier unique to startups.

rgrieselhuber · on Sept 20, 2022

It's easy to do for small sets of sites, but try doing this at web-scale and you quickly run into a large financial barrier. It's not about technical feasibility as much as it is cost.

freediver · on Sept 20, 2022

Standard for this already exists [1] but it does not solve the problems of

1. Implementation (sites do not need to have a sitemap; or those that have it, may not have an accurate one)

2. Discoverability (finding sites in the first place, you'll need a centralised directory of all sites; or resort back to crawling in which case sitemaps are not needed)

3. Ranking (biggest problem in creating a search engine)

[1] https://www.sitemaps.org/protocol.html

rgrieselhuber · on Sept 20, 2022

The sitemaps standard (if this is the basis) would need to be expanded to support additional metadata / structured data to support this idea.

1. This would be up to sites, to your point, major question would be best way to create incentives.

2. This is solvable via a number of approaches, but the search engines themselves would be mostly responsible for finding the right approach for their business. I know how I would do it.

3. Indeed, which would be the main point of this decentralization, to let search engines focus on their hardest problem.

Edit: would Kagi not benefit from having to worry about crawling / indexing sites?

freediver · on Sept 20, 2022

> would Kagi not benefit from having to worry about crawling / indexing sites?

It would, but sitemaps do not provide that function as we discussed above. However if EU Open Web Search succeeded, that is something we could probably use to some extent.

lexx · on Sept 20, 2022

Or use to extend

bobajeff · on Sept 20, 2022

One problem with that is now you have to trust the websites to give an accurate index of their content.

jeffbee · on Sept 20, 2022

Anyone who thinks this will work has never tried to index a site. A huge amount of effort is spent trying to figure out if the site is serving different content to users vs crawlers, or if the site is coded to appear visually different to humans vs machines. If you ask sites to index themselves you will get lies only.

rgrieselhuber · on Sept 20, 2022

I index sites all the time and I think it could work. There will be other problems, of course, but we already are partly there with XML sitemaps. Relying on the large search engines to enforce “honesty” from websites puts them into a mediator role that has a number of negative effects both for search in general and, increasingly, society at large.

kittiepryde · on Sept 20, 2022

Relying on sites to be honest about themselves, is even less likely. There are monetary incentives for many of them not to do that. Many sites host dishonest and clickbait content with extreme levels of SEO already. The cost of dishonesty decreases if you can directly modify the index.

rgrieselhuber · on Sept 20, 2022

I think that is primarily a symptom of the fact that we have a bottleneck on search interface providers. If it were easier / cheaper for new search engines / rankers to exist in the market, they could fairly easily filter out unscrupulous domains.

wumpus · on Sept 20, 2022

I've run a web-scale search engine and I don't think it will work.

Not only are some sites malicious -- mostly unimportant ones -- but many good sites are simply incompetent.

rgrieselhuber · on Sept 20, 2022

Indeed

wizofaus · on Sept 20, 2022

I suspect you underestimate how much of the power of search engines is being able to interpret search queries and figure out what a user is really looking for. Even if there were a public, standardised up-to-date high performance full-text index of the entire web freely available I'm willing to bet Google search would be a useful value-add in its ability to answer natural language queries.

mellavora · on Sept 21, 2022

> you underestimate how much of the power of search engines is being able to interpret search queries and figure out what a user is really looking for.

So you mean that a search engine is supposed to ignore what you are asking for and instead give you what it thinks you really meant?

wizofaus · on Sept 21, 2022

If you search for [what year was queen Elizabeth born], Google doesn't return pages that have that phrase in it, rather it returns pages that have an answer to that question. You could call that "ignoring" if you like, but it's what 99% of users expect.

rgrieselhuber · on Sept 20, 2022

I run an SEO platform SaaS, so I'm familiar. :)

denton-scratch · on Sept 21, 2022

I resort to "natural language queries" only in desperation, when queries that are lists of search-terms have failed.

Actually, they aren't really natural language queries. They are just ordered lists of search-terms. Goo provides no mechanism for saying "This is an english-language question". And even if Goo could parse my natural language, and rephrase it as something like "Are you looking for a list of books published by Douglas Hofstadter?", when you turn that into a query on the index, it stops having anything to do with natural language.

wizofaus · on Sept 21, 2022

A "dumb" search that just took a search phrase like "books published by Douglas Hofstadter" is going to return pages that have that phrase in it, or something close to that phrase. Google will prioritise results that actually contain such lists, regardless of whether the page contains a phrase like that (e.g. the word "published" is basically ignored by Google). That's all I meant.

arjenpdevries · on Sept 20, 2022

We will explore that idea in the project, I also think it may help (but vulnerable for Web index spam by adversary parties).

rgrieselhuber · on Sept 20, 2022

That is indeed the biggest problem but maybe something that can be more effectively dealt with downstream by the content rankers and potentially even the user base / custom search algorithm builders. Brave's Goggles project is a good early prototype of this concept.

spookthesunset · on Sept 20, 2022

I'm pretty sure we tried that way back in the day with <meta name="keywords" content="spam spam spam spam">. People would stuff that with every word in the english language. Older search engines that used those keywords returned some pretty awful results. You simply can't trust sites, who have a strong incentive to get to the top of SEO rankings, to not lie. In fact, given at least one of your competitors will stuff their keywords to get to the top you'll have to do it too. It would become an arms race for who can stuff the most garbage into their indexes to "win". It just doesn't work.

All search engines that attempt to be useful will have to filter out the junk. You just have to trust that the search engine you are using isn't withholding results from you that it considers "bad" (eg: "misinformation" (i.e. stuff somebody disagrees with)).

And to me, that is the crux of the debate really. Nobody wants spam for search results--everybody agrees with that and there is no real debate about filtering that crap out. The argument really is should a very large company that has a huge market share get to decide what constitutes "fact" and what is "misinformation". Based on 2.5 years of experience so far, what was once deemed "misinformation" has a sneaky way of becoming "factual information". Labeling and hiding "misinformation" because it goes against some narrative pushed by incredibly powerful entities is very scary and there was a hell of a lot of exactly that going on during this covid crap.

I used to fall on the side of "private companies can do whatever they want" but now I'm not so sure. Companies like FB, Twitter or Google play a huge role in shaping politics and society. I'm no longer convinced it is okay to let them play the role of "fact checker" or anything like that. Filtering spam is one thing, but hiding "misinformation" is entirely different.

rgrieselhuber · on Sept 20, 2022

Your last point is also the one (aside from the economics) I am the most interested in.

I think we live in a world now where we are so used to a few tech giants mediating everything for us that we can't even imagine other solutions to this problem, but it's also how we got to this point in the first place.

closedloop129 · on Sept 20, 2022

>You simply can't trust sites, who have a strong incentive to get to the top of SEO rankings

Why is it not enough to punish sites that abuse the keywords?

spookthesunset · on Sept 20, 2022

Who is the one who punishes the abusers? How can you scale the solution to deal with billions of pages?

closedloop129 · on Sept 21, 2022

The users punish.

You need a trustworthy core by which you can judge the vote of new users. You can incorporate them until somebody complains about a result that is out of place.

This doesn't have to fully scale. There are many pages without monetary value that won't be manipulated. The tags are an additional signal that can be used where they work. If they don't work, they can be ignored.

But it will scale because there are far more consumers than producers.

_Algernon_ · on Sept 20, 2022

People would abuse that for SEO purposes within seconds.

rgrieselhuber · on Sept 20, 2022

The market need would then be shifted to the best search interfaces instead of who has the most money to build the biggest index. A much better focus, IMO.

TheFerridge · on Sept 20, 2022

I believe that is precisely what the project is aiming to do, and to turn it into a public resource.

boyter · on Sept 20, 2022

I’d rather see them publish a federated search of their own content.

rgrieselhuber · on Sept 20, 2022

Your comment prompted me to check out Searchcode, looks very interesting. How would the federated search model work in this example? Instead of you having to index the various code repositories, they would index themselves and make their search of those indexes available via a federated API?

dataking · on Sept 20, 2022

I don't see any mention of Quaero, the EU search engine that was supposed to compete with Google [0, 1]. How is this time different?

[0] https://en.wikipedia.org/wiki/Quaero

[1] https://www.dw.com/en/germany-pulls-away-from-quaero-search-...

arjenpdevries · on Sept 20, 2022

For starters: the objective is to create the index not the engine, that's quite a different ambition.

We are very aware of the Quaero/Theseus history :-)

marginalia_nu · on Sept 20, 2022

What is the difference?

freediver · on Sept 20, 2022

Supposedely the project is about just building the platform/infrastructure (which is what the index is) upon which search engines can be built.

These search engines will then have the freedom to define their own search product experience, business model, even ranking of results.

jonas21 · on Sept 20, 2022

So something even more vaguely defined and detached from real use cases than last time? Great.

solanav · on Sept 21, 2022

It is very precisely defined, you not understanding what they are building does not mean it is not worth the effort.

freediver · on Sept 20, 2022

The above actually defines the scope very well. There is lot more to be built upon it, but it is not what the project is trying to solve.

Eridrus · on Sept 21, 2022

Is there any discussion on how this work will differ from Common Crawl?

notright · on Sept 20, 2022

This was the past legislature project. The new legislature brings CHANGE. They are not the same..

jacooper · on Sept 20, 2022

> A new EU project OpenWebSearch.eu … [in which] … the key idea is to separate index construction from the search engines themselves, where the most expensive step to create index shards can be carried out on large clusters while the search engine itself can be operated locally. …[including] an Open-Web-Search Engine Hub, [where anyone can] share their specifications of search engines and pre-computed, regularly updated search indices. … that would enable a new future of human-centric search without privacy concerns.

So.. Who's going to create the index? Indexing the web is expensive, and its offset by the ads the indexer runs on their search website, such as Google, bing, brave and others.