Hacker News new | past | comments | ask | show | jobs | submit login
Why Is the Web So Monotonous? Google (reasonablypolymorphic.com)
303 points by amf12 on Aug 4, 2022 | hide | past | favorite | 246 comments



This is a clear demonstration of the user's expectation of search having exceeded what's possible today.

From the article:

> Lets look at some examples. One of my favorite places in the world is Koh Lanta, Thailand. When traveling, I’m always on the lookout for places that give off the Koh Lanta vibe. What does that mean? Hard to say, exactly, but having tourist amenities without being touristy. Charming, slow, cheap. I don’t know exactly; if I did, it’d be easier to find. Anyway, forgetting that Google is bad at long tails, I search for what is the koh lanta of croatia? and get:

This is a near impossible query for human beings, let alone for computer given the state of the AI at this point in time.


GPT-3 prompt:

I really love the vibe of Koh Lanta in Thailand. It has tourist amenities without really being touristy. I find it charming, slow and cheap. Is there a place in Croatia like it?

Completion:

There is no exact place in Croatia that has the same vibe as Koh Lanta in Thailand. However, there are places in Croatia that have a similar charm and slower pace. These places include Hvar Island, Korcula Island, and Istria. While these places do have some tourist amenities, they are not as touristy as Koh Lanta and still offer a more authentic Croatian experience.

Another:

There is no exact equivalent to Koh Lanta in Croatia, but the small town of Nin on the Dalmatian coast comes close. Nin is a charming, slow-paced town with beautiful beaches and a relaxed atmosphere. It is also quite inexpensive by Croatian standards.


The next problem is verifying all of that information.

It will require multiple dozen auxillary Google searches and reading through countless articles and message boards.

GPT-3 is a great research tool as a brainstorming engine, but sadly cannot currently be formally trusted to output truthful information, even with conditioning.

It takes someone with prior knowledge or subsequent research in order to make use of the result. I'd say this isn't what the author was intending.


Even if you got the info straight from another human being I can't imagine anyone would take it at face value and just schedule their trip.


I wouldn't buy the tickets sight unseen, but I would trust you that it's actually a real place, actually in Croatia, and not being conflated with similar-sounding places.

Humans, unless they're disturbed or malicious, tend not to hallucinate things out of whole cloth.


So... we’ve got ML models to the point of being able to hallucinate?


Yup: prompt a large language model to tell you about itself and it will sometimes weave weird tales. For example, I just had the following interaction with GTP2large[0]:

Me: What is your name? Model: I'm Mark A. <continue> Dreyer, and I'm with the <continue> Federal Communications Commission (FCC)

Obviously, it's not actually a guy named Mark A. Dreyer and, as far as I can tell, there is no Mark. A. Dreyer in a prominent role at the FCC. That, as far as I'm concerned, is essentially a hallucination.

If you think about how they work, this isn't totally surprising. The model has been programmed to take the input, turn it into a series of tokens, and then produce another series of tokens that are especially likely to follow the input ones. That's it.

To the extent that this corresponds to reality, it's only because reality is overrepresented in the training data, and so token sequences corresponding to reality are often more likely under its model. This also explains how it goes off the rails: Mark is a very common name (in the top 25 for men for decades) and once you generate a stereotypical middle-aged white guy name, it's not a huge leap to him being a bureaucrat, etc. I forgot to delete the huggingface boilerplate, which talks about advances in technology (etc), which probably sent it down the FCC route too.


That’s totally wild. The visual ML output is very obviously like tripping. But what you’re describing is very similar to how someone once described being high on acid to me, where no sentence could be completed without it turning into another and another and another until it was all just gibberish. It’s an interesting feat to be at this stage simply though back propagating neural nets, but there’s a huge gulf between where we are and creating __understanding__ (he said, trying to sound smart)


Hallucination in context means something rather different.

Something like coming to accept some data-set local optima as a global truth and deriving beliefs without underlying basis.


"Yes"?

The whole point of these models is to make up something that seems "realistic", but not regurgitate the original contents. If that's not hallucination I don't know what is.


Google's answer box also can't be trusted to output truthful information.


Fun one today: https://www.google.com/search?q=is+fortigate+open+source

Prompt: "Is Fortigate open source?"

Article: "Based on the parameters we compared, it seems that pfSense is the more favorable solution because it is open source and also offers great features."

Google: "It is open source!"


Yes, these are forever spreading disinformation on various topics in my social group. I am forever telling people to actually click and read the article rather than trust the automatic tl;dr by Google.


I believe this is the point that @drivebycomment was making, that answering such a query is not possible right now. The query needs to be reframed by using keywords like the GPT-3 prompt did, and even further refined than that, and you still have to finish the last mile of research.


Google simply returns human written results, so delegating the trust problem. You can track author's record and judge if they're trustworthy yourself. Doing the same thing for GPT-3 is much harder. How are you going to debug the model? Probably there is only a handful amount of people who is capable of debugging the model even with all necessary access granted. You can think it as autonomous driving which already has reached to a good level for 99% of use cases (at least for Waymo and Cruise), but no one think it's going to mass deploy robotaxi anytime soon.


Could this be addressed by an ensemble of GPT-3s fine tuned with different data?


Beat me to it, I got a very similar response by telling GPT-3 to behave as a 'highly advanced search engine'. It also went for Hvar and Korcula.

query: what is the koh lanta of croatia?

response:

Koh Lanta is an island in Thailand that's known for its beautiful beaches and clear waters. It's a popular destination for tourists, and it's also become a popular spot for digital nomads in recent years.

Croatia doesn't have an island that's quite like Koh Lanta, but there are a few options that come close. The island of Hvar is a good option, as it has beautiful beaches and a laid-back vibe. The island of Korcula is another option, and it's known for its medieval villages and stunning scenery.


That proves GPT-3 is good at constructing sentences based on a prompt, not that it actually knows how to vibe-check tourist destinations.


I have been in HVar. It's not cheap, it's the opposite... It's where the super rich people hang out.

Anecdote: When I told a random person I got there by plane he automatically assumed I owned a plane."


GPT-3 and DALL-E scare me. This is too good.

I wonder how long it will take before we have a GPT-3 bot commenting here on HN


I get sent GPT-3 blogspam as "guest posts" from wannabe "contributors" every day, and I can say that the moment you expect GPT-3 to write anything remotely technical, it turns to shit. Simultaneously, anything people have written by GPT-3 is frequently full of algorithmically detectable plagiarism.

Needless to say, all of these people get told to fuck off and get a job.


GPT-3 authored content was already top of HN not long ago

https://www.theverge.com/2020/8/16/21371049/gpt3-hacker-news...


the author of that post was called out by the HN mod for engaging in vote manipulation in the past, and lying about several aspects of the GPT3 post [1]. I wouldn't put so much faith into that author

[1]: https://news.ycombinator.com/item?id=24164749


How do you know there's not one already?


Whistleblowing is usually a way to redirect attention.


how do you know I'm not one? I'm appaled I'm not getting recognized


>It has tourist amenities without really being touristy

All the results it gave are flooded with tourists, and some of them are among the most expensive you can get in Croatia. :shrug:

Unsure of what to think of the results, as it seems to have very little relevance to the input description other than being in Croatia and being tourist spots.


I'd be super interested to see if these other places actually fit the descriptions it gives, or if it essentially just generated a likely pattern of text to be seen after the first part saying there's no exact match and then filled in random known Croatian location. I honestly don't have any intuition of which of those is more likely!


Hvar is filled with 25 year old tourists who drink a lot and maybe hope to get into modeling as a side hustle. It does not fit the prompt at all. Korchula is a slow, family oriented island. Maybe a better fit? But it really just seems like GPT has picked up on a list of the most popular destinations in Croatia.


Istria is also very touristy.


Can anyone familiar with Croatia say whether GPT-3 is reasonable here? It's great at saying plausible things, but it will freely make things up if it doesn't know.


I think it really depends on what you're after and the micro location where you spend your day and night. Hvar and it's city is definitely not laid back and slow. It is just very expensive, busy and loud. Korcula is fine, but I would suggest something like Vis island or for someone appreciating very small islands: Susak or Ilovik. Being on all of those places many times in my life, I think it mostly depends who you're with and where your apartment is. Every island can be very diverse in itself as well. Let me explain: if you're placed in a town usually it will be very loud on weekends and nights due to open air concerts/performances, and sometimes even during the weekdays. So if you desire something where you'll get a nice sleep this can actually be hard to come by because every cafee is open during the whole night each day entertaining a few drunks, blasting music, and not letting hundreds of people sleep with open windows (nights are hot as well). Actually it was much more bearable with covid restrictions in place because the cafees weren't allowed to play music after IIRC 10PM. A lot of islands also have no police or very minimal presence, so there is a lot of elements of lawlessness and locals making their own "rules".

I could write pages of this stuff, but let me write more positive things. The best way to enjoy the islands in Croatia is actually by boat, rented or otherwise (they are not expensive). There are beautiful quiet coves where you can spend your days and nights, only going for resupply in towns. You anchor in a quiet cove, and in the evening you take your tender boat to the town if you want to walk around it. Next day you sail to another island and repeat. That experience is actually awesome.

If a boat is out of the question, being in some apartment is an option, I would just advise not to be in the town center or near it, or anywhere where some very local beach is. Mostly there will be a caffee with blaring music and nobody likes to sleep with earplugs. Also this is and advice for any island, and true, a lot of those islands are quite different than each other.


> While these places do have some tourist amenities, they are not as touristy as Koh Lanta

This shows, if you had any doubt about it, that it did not actually understand the question.


If search engines ever become like this there will be debates about copyright just like we have for copilot.


Maybe. If this were a couple of decades ago and this search were performed, the searcher may simply see zero results. This used to be an acceptable response from a search engine - unless there's some website that specifically contains the search term being sought, simply return zero results. Not thousands or tens of thousands results.

What we get instead are incorrect results which are being presented as reasonable answers to the search query in the name of advertising revenue.


I remember, in the 90's, I typed "hexdit" in altavista and spent the time visiting exactly 100 websites to find exactly the hex editor I was searching for. I remember the number exactly because it was in the last position (10) of the last page and there were 10 pages.


Ugh. I distinctly remember a dinner party in 1993-94 where we talked about how amazing the web was and how impossible it was to find anything. We discussed registered keywords a la AOL, and I’m sure someone suggested search engines would eventually solve it, but I certainly wouldn’t have wagered significant money on that based on Altavista.

When I discovered Google (I think it was running at beta.stanford.edu at the time) it was just short of a miracle.


This is a big part of the problem. The incentives.

There is no blank space to fill with something useful.

Instead everything is filled with SEO driven, ad riddled, bullshit.

And creating something useful is more cost and effort than creating more SEO driven crap.

And even if you do create something useful, you’d then need to focus on SEO in order to compete.

There’s no way this problem isn’t solvable. But google seems to lack any motivation to solve it.


Around 20 years ago I was trying to intersect data sets on two machines without transferring the entire set across the wire. I was slowly, painfully trying to invent Bloom Filters, because none of my searches turned up Bloom Filters. The company folded before I could well and truly embarrass myself.


>having tourist amenities without being touristy. Charming, slow, cheap.

Ugh, this reminds me of the people that move from high tax places to low tax places but want all the benefits of high tax places at low cost.

Now, back to the main point, if this was searchable it wouldn't exist.

The number one complaint people have today is popular places are too busy, too many tourists come and prices go up and it's over ran. So what do you think is going to happen to some place that 300+ million wealthy people around the world can search in a few seconds and find out it's cheap an empty. The answer is "become busy and expensive".

The only reason places like this still exist is Google can't find them.


I'm glad I wasn't the only one to cringe at the lack of self-awareness on the part of the author at the inherent paradox of wanting something highly ranked by google but also quiet, cheap, and slow.

Here's a tip for the blog author. Learn a foreign language. Then search google in that foreign language. You might be surprised. Be the change you want to see.


I learned a foreign language (English), I search Google in that foreign language. My browser's language is set in that foreign language. My search parameters are set to only show results in English. I use Google.com, not my localized version of Google. And yet about half the time when I search using keywords (as opposed to long expressions), there are always two or three links in my language (French) in the first page results.

Youtube is worse, I have two YT accounts in English where I only watch videos in English, and Youtube keeps recommending me videos in French that are not relevant at all to my interests (even though most recommendations in English are perfectly relevant). It's as if their AI just saw my IP and decided to recommend to me the most popular French videos of the moment.


A VPN is what you need. Or perhaps simply changing your location would work, when I visited France I didn’t get any French-language results.


> A VPN is what you need.

What I needed was to use a different search engine. DuckDuckGo gives me a toggle to include or not French results.


ah! you had the solution all along!


Here's another tip. There are quiet, cheap, slow places you can go that don't require learning another language, or even going to another country. That's not what people really want though.


"You're not in traffic, you are traffic".


I saw this happen in Ko Phangan. By pure luck I was there for the first really big Full Moon Party. Watching the beautiful sleepy little town be engulfed by 30,000 locusts was an experience I’ll never forget.

I passed back through a few months later and barely recognised the place. Wooden chalets has given way to hotel rooms and beachside stalls to night clubs.


I have no idea what Koh Lanta is like, but the world has 1.11 million km of ice-free coastline. I'm also not sure what tourist amenities are, but you can probably build them anywhere you want.


Build with what money?

The whole point of even having tourist amenities is that there's a community that that's attempting to draw tourist money to their location, and probably create a thriving economy.

An unknown tourist attraction with amenities is a failed location where the amenities are unmaintained/dilapidated, the local economy has nosedived, the young have moved out and the population is collapsing.


I'm not disagreeing with the overall point, but when you say:

> there's a community that that's attempting to draw tourist money to their location, and probably create a thriving economy

Usually that's not how it works (at least from first-hand experience). It's about a local business/mafia trying to draw tourist money while exploiting cheap labor from the local community. Let's not even get started on the scheming to privatize public beaches, environmental pollution (noise, lights, trash) and other shenanigans the tourist industry is creating to the detriment of local communities.


The comment I replied to suggested that if tourists could find good places they'd become too busy. My reply was that if one spot becomes busy, it means places like that spot are profitable, and people will build more places on the million-kilometres of global coastline.

In the context of the discussion, I don't understand your question. The money comes from the hypothetical tourists who're hypothetically making all the usual spots too busy. What point are you making?


Eh, I dunno. I feel like if there had been a human somewhere blogging or writing on a forum "I was thrilled to find this village in Croatia which really reminded me of my trip to Koh Lanta, one of my favorite places" -- then Google of 15 years ago would have found it and put it on the first page. Even if there are ten such things on the web today, I don't think it's gonna wind up in my google search. But am I overly optimistic about historical Google?


IIRC Google of 15 years ago would have prioritised high PageRank sites containing the strings "Ko Lanta" and "Croatia" which would probably have been the Ko Lanta listings page of popular travel booking websites (with "Croatia" referenced somewhere in the sidebar).

It was less likely to ignore the absence of a particular keyword but certainly wasn't better at NLP


I think that's definitely the google of 10 years ago, 15 years ago might still have been like the parent commenter suggested. The reason being that the persons blog with travel information would be likely to have a higher pagerank than the popular travel booking sites which might also not be SEO optimized enough to hit that particular search.


> then Google of 15 years ago would have found it and put it on the first page.

I don't think Google of 15 years ago will do a good job on the current age of internet. The size of web has increased a lot like 100x where most of the increment comes from unstructured formats like video or image and its signal to noise ratio deteriorated considerably. SEO has become much more sophisticated ever. A large fraction of useful information is now locked in the unindexed walled garden.


Surely significantly more than 100x. I'd hypothesize 10,000x at minimum from 15 years ago, at least if you measure total bits. Not sure how I could confirm that conjecture though.


However the number of actual, genuine content (so excluding SEO listicles and so on) is definitely less than 100x.

In fact it could be less than 1x in many cases. People that would have blogged about their Croatia trip might now instead just have a 10 word tweet, an instagram story (of similar low quality and probably not publicly indexable), or a youtube video (so non text content)


> But am I overly optimistic about historical Google?

Well, I was about to post a similar reply. It would have taken some messing around with the search string, but I'm confident I would be able to find something like this on Google 18 years ago, and I'm certain that if it exists, I can't find it now (unless it's on some big name site).


I think you're overly optimistic about web of 15 years ago. You are more likely to end up with "CatLady05's archive Aug 2007 - list of beaches in Thailand (I'd like to visit some day!) - (blah blah) - summer trip to Venice - ..."

And then you learn that on a good day you could see Croatia from Venice, or something like that.

(I just completely made that up, don't quote me on that.)


It did used to be possible to craft searches that'd find that kind of thing, with Google. Not in natural language, but you could guess enough unusual words that might show up on a page that had that info to zero in on it.

Their apparently paying a lot less attention to certain things (like link text—"what might someone use to describe a link to this resource, but which might not appear on the resource itself?" used to be a very fruitful way to search), and freely substituting words or dropping terms that are merely not common on hits (but not totally absent from 100% of results) has made this kind of thing impossible.


It's absurdly common for Google to completely ignore the first term I include in a search. It's not like I typed it just for fun!


Indeed. It was so much easier to find stuff you are looking for a decade ago. Now it is mostly spam or (what really grinds my gears) AI written posts that only after reading it for a while you realise it is all bollocks.

That being said, I remember my cousin had a blog where he was writing utter nonsense about trending topics and he was actually able to pull enough money from Ad Sense to afford a pub crawl once a month.

Now of course all of this is streamlined - one button website generators with AI content and posts for given topic and self optimising for engagement and advert clicks etc.

But I see Google has thrown in the towel and no longer cares for search results.


It is not only that. This search specifically is something that it's impossible because if the search worked, those "places out of the beaten path" the author is searching for, would immediately become popular and they wouldn't be true results anymore. They not being "indexed" by Google (or the internet in general) is what makes them the way they are.


It's not difficult to find info on Ko Lanta or Croatian resorts on Google. Its not invisibility that keeps them relatively quiet, it's the sheer amount of competition.

The reason even a fairly advance NLP system is going to struggle to find what the author considers to be the "Ko Lanta of Croatia" is that they can't read the author's mind about qualities associated with Ko Lanta, so they're going to highlight islands, not "small, quiet resort villages with $10-$20 rooms and hammocks" (Not really Croatia's "vibe" though I suspect the OP would probably like most Croatian islands anyway...). The same is going to be true even if you put in strings that return comparisons humans directly make themselves like "Venice of Asia".


Some off the beaten path places stay off the beaten path because they're genuinely hard to get to.

https://en.wikivoyage.org/wiki/Previously_Off_the_beaten_pat...


Sort of like how I hope no one ever learns about my city/state and it can keep being pleasant and affordable.


https://www.google.com/search?q=fres%20small%20too%20many%20...

Once I lived in the city, it was too big and noisy

So I moved to the country to stop and smell the rosies

All my city friends joined me and put up nice new housies

Now it's too big and noisy, think I'll move to the country


I know that search engines are optimized to facilitate commerce, and yes, depending on the extent to which you're bought into their ecosystem, Google probably knows a lot about you, like, for instance, whether or not you've been to Koh Lanta before as a tourist, but good lord what a self-centered way (for the author of this blog post, not the parent comment) to think about this.

Like, Koh Lanta is surely not just a vacation destination, it's also a place where people live their lives, go to work and school, are homeless, deal with illness. The expectation that Google will assume that by "koh lanta of croatia" you're really asking "vacation destinations in Croatia that are similar to Koh Lanta, Thailand" really rubs me the wrong way


Can you think of a better question??


Yeah unless there's a blogger who have the same train of thought of OP this is not going to be findable by "Koh Lanta vibe". Quantifying it as charming slow and cheap does lead to results: croatia off the beaten path slow cheap search result #5 is https://www.theguardian.com/travel/2021/jul/17/croatia-10-be... which has this gem

> Places where things move at a less hurried pace, where Croatian life can be savoured, where you get a flavour of what the Dalmatians call fjaka – the art of doing nothing. These islands and mainland destinations are what you want in a post-lockdown escape: peace, beauty and the chance to discover why Croatia is such an enticing country.

Now I know I want to search for fjaka -- if I want to search further because this article might just be what OP wanted.

The success for search always has been finding the keywords to search for. You can see here it took me just a few minutes.


I actually built this exact thing for myself when I was looking for property to buy. I built a VAE DNN to identify sandy white beaches, added in a bunch of data about economics and such, built a custom McPain index (Inverse of the distance to the nearest chain restaurant (McDonalds, Burger king, Starbucks, etc)*1000000)[1].

Fed a bunch of countries coastline into it, and popped out ranked beachfronts to investigate more.

1. I once went to Ao Nang, Thailand a few years after the tsunami and found a really cool chill cheap beach town and loved it. Went back a few years later and right on the main street there was a McDonalds. Across from that there was a Burger king, ~100M up the road was another McDonalds. I will never return.


Sounds super useful!

Where did you get the data from? Do you still have the code somewhere?


OP is asking google to search their feelings, it's ridiculous expectation for sure.


There's some mix of people in here who weren't around for or forgot about when Google handled stuff like this brilliantly. In fact, one of the best ways to get search traffic was to write a quality article tackling questions like this. Google was thirsty for answers to the ever-growing number of queries it had never seen before and would shower traffic on anyone who answered them even if it was a bad answer. A good answer had staying power.


I was there for it.

It worked for a select group of toy problem domains and the answer for most high-concept searches would be noise or nothing.

The author never manages to convince me that the information they're looking for is on the Internet for Google to index in the first place. The author then asserts "don't just hit me with garbage," which is a fine assertion, except what they're interpreting as 'garbage' is the information other people making the search could actually use. Google A/B tested the hell out of "say nothing" vs. "guess something close to what the requester might care about" and the latter won out every time.


I was there too and I agree with this take.

If this data is there now, it is most likely in a Facebook post which Google isn't seeing.


What won was what _appeared_ to be something useful, at least at first glance.


The sobering realization is that a lot of the quality phpbb forums which would answer these questions, and link-juice quality posts, got shuttered a long time ago.

Google could find them on Archive but we both know that's not the business.


And HN just destroyed this search. When I tried the same search, HN is on the front page now.


I asked GPT-3 "What is a place in Croatia similar to Koh Lanta?" and the answer was "A place in Croatia similar to Koh Lanta is Hvar."


>This is a near impossible query for human beings

As is every query that we need computers for (e.g., a SELECT query that goes over millions of records).

On the other hand, expecting a search engine to find pages which at least mention both Croatia and Koh Lanta is a very, very, very low bar.

And it's going to be better than what Google is giving today.


So true. People see sci-fi representations of technology in movies and TV shows and extrapolate that to the real world never realizing that satisfying the prerequisites for any technology to become a reality is easier said than done. It’s always handwavvy too: the computer will figure it out. Let the computer do it.


People want their entire worldview spoonfed to them.

It sounds like this person wants to have a philosophical discussion with a friend about this type of subjective question rather than simply receive a factual search engine response.


Part of the problem is the way that data is presented. Google indexes HTML, but wouldn't it be better if sites provided a standard interface to their [shareable] data, so that data quality, and searches could get better?


They tried JSON-LD


"Semantic web".

Tried and failed.

How would that prevent spam?


Vectoring Words (Word Embeddings) - Computerphile

https://youtu.be/gQddtTdmG_8?t=835


That’s easy, just ask DALL-E and do an image search


Yet Google encourages this type of search, where Google operates as some sort of oracle.

What seems like an endless army of Google proponents (perhaps they are simply "status quo" proponents) discourage anyone from even attempting to think about how the web could be organised without the need for a Google.

In a library, I can do searches restricted to certain subject areas. I might only be searching certain databases that pertain to certain subjects.

Whether it could ever be feasible to search the www by "subject area" is left as a question for the reader. In any event, if it were I would search for "koh lanta" in a relevant subject area, e.g., "travel reviews". One can search for pages that contain "koh lanta" on www sites using domain names registered from the .hr registry, e.g.,

  "koh lanta" site:.hr
What if all travel sites registered names from some registry, .travelreviews, and one could search pages that contain "koh lanta" on www sites using registered .travelreviews domains.

Even if I could just obtain a list of travel review websites, I could index those sites and then search that index for reviews that contain the terms (a) "croatia", or names of various destinations in Croatia, and (b) "koh lanta". If I was really patient I could use a script that simply performed the same search on each site, either using the site's "search" option or a search engine.

I am still not sure this search even makes sense. It seems to rely on an assumption that some traveller will compare Koh Lanta with some location in Croatia. What could make more sense is to define what are the specific charactersitics of Koh Lanta that one wants to find in a Croatian destination, then search for those. Then I might search not only travel reviews but sites that describe characteristics of Croatian travel destinations.

In any event, it is this sort of catgorisation of sites that is generally missing from the www. I believe it is feasible but "tech" companies like Google and its followers are not interested in promoting such facilities, preferring instead to pursue data collection about www users and programmatic online advertising. If Google is the "front page" of the www, then it can portray the www in the way it sees most beneficial to Google. For example, an endless sea of disorganised information that is impossible to utilise without Google's assistance. I certainly do not need to search billions of web pages to find the "koh lanta of croatia". I only need to search sites with travel reviews. But I do not get to limit a search like that. An advertising company gets to decide what is "relevant" to the terms I input. _Popularity_ (potential value for advertising to a wide audience) dictates what I can and cannot see of the www through Google.

About 10 years ago I started a proof of concept that aimed to catgorise a web (cf. "the" web) via non-ICANN issued domain names issued by a non-commercial domain name registry not governed by ICANN. The idea is that the FQDN can contain 1. a subject matter description (subdomain) 2. a trademark (domain) and 3. a Nice trademark class (TLD). Searches can then be done largely based on FQDN instead of heuristics such as "page rank", popularity or other metrics designed for advertising purposes. This new web has value to me because it is constructed from legitimate organisations that have invested in trademarks, i.e., they have a registered address and they pay their legal bills. Registering a mark suggests they have an existing or future brand to protect, a business that needs a mark to protect consumers from becoming confused about the origin of a product or service. (How many of those Chinese sellers on Amazon today have trademarks.) Ideally, this system filters out those who are not legitimate businesses. The garbage at the top of Google would not be possible. To game this system would require obtaining trademarks, not simply gaming ICANN's system of registries and registrars. (And we know ICANN itself is not a trusted steward of DNS, but a means for a select few to make huge profits from it.)

This prototype web was not meant as a replacement for "the" web, but as an experiment to separate out the commercial entities on the www from the non-commercial ones. Google and its ilk want a www where there is no distinction between commercial and non-commercial www use. All www use is surveilled for commercial purposes.

This mixing of commercial with non-commercial, to me poses one of the biggest threats to the www we started using in the early 1990s being wiped out. In the early days, there was this idea that TLDs would represent different categories of websites. For example, ".org" would be non-commercial websites, ".edu" would be educational institutions, and so on. Things have changed. The root.zone has exploded in size with countless "gTLDs", most of them are purely commercial. History is showing that it is infeasible to demand that TLDs enforce some sort of rules over the contents of websites that use them. One needs "Google" to help figure out what websites are worth visiting.

The prototype web in theory allows one to find legitimate brands on the www by searching registered marks or subject matter descriptions within Nice trademark classes that are contained in FQDNs, not the contents of web pages. It is the antithesis of what one sees on Amazon today, what with a gazillion Chinese knock-offs, including Amazon's own "brands". There is no "SEO" on this new web because the contents of web pages, e.g., "backlinks", are not the basis for search.

But do not worry, the prototype web does not exist and would never work. It is a terrible idea. Google will never die. See you in the Metaverse.


>I am still not sure this search even makes sense. It seems to rely on an assumption that some traveller will compare Koh Lanta with some location in Croatia

Well exactly. If anyone ever mentioned Koh Lanta in their review of Croatian destinations, we'd hope that that it would be at the top of the results.

Or that there would be no results.

Getting results that don't even mention Croatia is not the intended outcome.


Trademarks are incredibly easy and cheap to get. Spammers already pay for domains and servers. Adding the cost of a trademark doesn't make things much more difficult I'm afraid.

Also which country's trademark office would be the gate keeper? There's one in every country. I assume you mean the USA trademark database in this case, but that only covers US trademarks.


"Trademarks are incredibly easy and cheap to get. Spammers already pay for domains and servers. Adding the cost of a trademark doesn't make things much more difficult I'm afraid."

This is irrelevant to the purpose of the "prototype web". It is a different problem. The purpose of the prototype web is not "to make things much more difficult" for "spammers". Its purpose is to _separate_ the commercial web from the non-commercial web and to _categorise_ the commercial web in a way that makes web search easier.

Given that trademarks are allegedly "easy and cheap to get", akrymski could share with us how many he has registered. Surely it would be equal to the number of domain names he has registered since the UDRP favours trademarks registrants and trademark holders have the additional option of using the ACPA. It would make sense to have an "easy and cheap" trademark for each domain name, registered in every class in every territory.^1 How much would that cost.

1. Because domain names have no such limitations.

"Also which country's trademark office would be the gate keeper? There's one in every country. I assume you mean the USA trademark database in this case, but that only covers US trademarks."

It depends on the ISO-3166-1 country code TLD. If the country code is US then the USPTO would be the applicable office.


"The idea is that the FQDN can contain 1. a subject matter description (subdomain) 2. a trademark (domain) and 3. a Nice trademark class (TLD)."

Forgot about 4. an ISO-3166 country code. Thus, the FQDN becomes 1. subject matter description (subdomain), 2. trademark (subdomain) 3. Nice TM class (domain) and 4. ISO-3166 country code (TLD).


[flagged]


>The correct result is to return pages containing "Croatia" AND "Koh Lanta".

But that isn't the correct result. "Koh Lanta" is being used by the OP to describe the vibe of a place "having tourist amenities without being touristy." It isn't a common English idiom, but one they made up based on their personal experiences as a tourist in Thailand. Results which simply contain the text "Croatia" and "Koh Lanta" aren't going to reflect what they want.


I think it's deeper than misaligned expectations. Consider this line:

> Google has gotten exceedingly good at organizing everyday life. It reliably gets me news, recipes...

What I mainly hear on this site and elsewhere is people complaining about how the news is partisan rubbish, fake, and serves the "elites". And recipes. Seriously? I had a half dozen people berate me earlier for suggesting that cooking your own meals should be an enjoyable part of life.

Nobody wants news and recipes.

Here's the problem.. "The Internet" was military project that got loose. "The Web" was a solution for research scientists to exchange papers. That's all. Driven by a massively profitable industry, a solution looking for a problem expanded in expectations, and took over the consciousness of generations of an entire society.

"Organising everyday life" is an inadequate appraisal. Most businesses had to be dragged screaming and kicking online. We still possess a vague, poorly thought-through ideal that somehow a connected, "technological" society is a better one, and a thirst for convenience and speed. But let's be honest, we don't really know what the net is for. It has no telos or guiding design principle other than what we overlay on it.

Military tools in the hands of "the people" tend towards insurgency and revolution. Wasn't The Internet's "killer app" the Arab Spring? The brightest moment of the Web was in it's formative years, as an explosion of ordinary speech and new political power. Much since then has been an reaction to try dampening it, domesticating netizens, or even putting the genie back in the bottle.

Maybe a reason to be optimistic, excited even, is that we're still in the infancy of the network. It's still pregnant with unimaginable possibility.


There's a lot of Google hate on HN, and much of it is deserved, but I think this particular trend of blaming Google for the prevalence of useless clickbait articles is a bit much.

The real answer to this question is that walled-garden social media took over everything. The much-pined-for "Old Google" worked because people used to actually create content on their own sites, and not just post it on walled garden monolithic social sites like Facebook or Twitter.


The reason why there's more content on walled garden social media instead of personal websites is in no small part Google's fault and not just because of Facebook and Twitter.

Google released several updates (such as "Panda") that greatly lowers the rank of domains that aren't linked to by major domains they deem to be trusted, such as those with edu or gov TLDs or whatever whitelist of domains they decided to add. There have also been updates that specifically lowers the rank of websites that are running forum or blog software. This was probably done to fight spammers but it has effectively killed off the long tail of topics.

This is admittedly speculation since the inner workings of Google search are not released to the public. But these findings have been corroborated by others on the internet.

I used to have a blog and several forums that used to rank highly for some niche keywords and over the years the search rankings started to drop off the front page while a single reddit comment with no content mentioning the same keywords would be #1.

I'd say that Google is very much to blame for this situation as they took the easy way out to fight spammers.


> This is admittedly speculation since the inner workings of Google search are not released to the public.

No, no, Eric Schmidt openly said that "Brands is how you sort out the 'mess'" (in search). And he started prioritizing big brands in search results and disappearing everyone else.

Panda even wiped out individual webmasters or small software houses - they were adding small back links to the websites that they built for their clients, per google's OWN recommendations, for years.

Then google turned around and suddenly penalized all of those legitimate links while 'sorting out the spam'. millions of small businesses, developers, software houses found themselves with zero traffic within a day. their hard earned traffic coming from legitimate business clients. whose sites were also affected similarly.

This caused the rise in internet marketplaces. from amazon to elance, upwork. because the small businesses and individuals were now totally invisible in search results and instead 'brands' dominated. a side effect was killing of independent content in search results and forcing everyone to have to post in social networks for visibility, forcing content creators out of their own blogs into social networks. Which exacerbated the damages that algorithms because now everyone had to obey individual corporations' unaccountable algorithms.

So yes, google has created this mess to a very large degree. Their nonchalant, uncaring attitude towards their users, customers that plagues all of their products crippled search as well. By saying "Brands is how you sort out the mess" and enforcing it top-down, one single ceo single handedly decided the fate of millions of small businesses, blogs, professionals. In a totally unaccountable way with no input from anyone affected.


https://www.wired.com/2008/10/google-ceo-call/

Speaking to a group of magazine executives at the Google headquarters, Google CEO Eric Schmidt said yesterday that the Internet is becoming a breeding ground for false information, reports Ad age. But trusted brands help weed through the disinformation:

"Brands are the solution, not the problem... Brands are how you sort out the cesspool."

https://battellemedia.com/archives/2008/10/cesspool_brands_e...

"Brand affinity is clearly hard wired….It is so fundamental to human existence that it’s not going away. It must have a genetic component."


I disagree. The fact that Google has such a power over the shape of web traffic, along with a few other websites, absolutely re-shapes the Internet to fit whatever its standards make a website rank well. It's in many ways almost exactly the same process as natural selection. Whatever works gets traffic and gets copied by other websites, whatever doesn't fades away.


I think the analogy to natural selection is apt. So is it really Google's fault? Seems like a consequence of a fitness function. And even if there were 4 competitive search engines in the world, each spammy site would optimize for one (or would even have 4 versions that each optimized for a single search engine).


If nature was a conscious entity that optimized for some goal, I think we'd say it would be at fault.

I'm sure it's not Google's intention to ruin the web, it's just something they happily accept as a by-product of their optimization strategy towards maximum value extraction via ads. Like some chicken farmers happily accepting creating chicken hell because it lowers cost. I don't think they set out to torture chickens and go "hey, I'll start a chicken farm as a cover story".


If you had multiple search engines, say four of them like in your example, then the rewards for successfully specializing in either one of them would be a fourth what it is now, meanwhile you would have to put in four times the work to maintain four versions of the site, you would have to conduct four times as much research to maintain your advantage. The economies tip to make specialization less rewarding.

Just as in nature, monoculture rewards specialization, whereas ecological diversity rewards generalization. Not that specialization doesn't exist in diverse ecosystems, it's just not as devastating.


yes but we can build our own query tools, for ex. Im using `$ so graphql vs sql` and such terminal commands more and more these days


Oh yeah, this is primarily a problem for Google. They're sort of stuck being chased by the shadow they cast.

I've had very little problems cutting through the SEO spam running an independent search engine, but then I don't shape traffic like they do.


Sorry, unrelated, but your comment hit a pet peeve of mine.

I love GraphQL, and I think it has a number of features that make it better and more usable for most APIs than a RESTful interface.

The thing I despise about GraphQL is they put "QL" in the name, so legions of software developers think that it is somehow comparable to SQL, or somehow is a generic query language for data.

GraphQL has absolutely nothing to do with SQL. Comparing "GraphQL vs SQL" is like comparing "HTML vs Java".


> somehow is a generic query language for data.

In what sense is GraphQL not a generic query language for data?


GraphQL is a format and contract for exposing APIs, just like REST is a structure for exposing APIs, just like SOAP is a format and structure for exposing APIs, just like gRPC is a format for exposing APIs.

Those are all the right analogies for comparing GraphQL. GraphQL doesn't say anything about how the underlying data is queried - at the end of the day it's just a bunch of resolver functions that you can implement however you want.


because it's not? it's got no "where" clauses or transforms built in and anything of the sort had to be custom built. you can't query any data in the database, only what the API chooses to expose. it's anything but generic. even pagination isn't standardized


I disagree because it’s 100% within Google power to purge junk websites, see Panda in 2011.

But they didn’t want to do that anymore. Why? Because of ads money.

To drive the point home, it is not that difficult for Google to have a junk score and simply phase out the junkiest of junks. But they didn’t even want to do that.


Feels like they'd be much more liable to get in trouble for anti-competitive behavior today than they were in 2011, given that they are in a very different position now compared to then.


Have you heard about the concept antitrust regulation? This is the real thing.


If Google purged sites at their own digression to such a degree, then HN would just be complaining about that instead.


More to this, no algorithm can accurately detect spamy websites. There is bound to be some inaccuracy and bias. Cue to posts on HN lamenting that "Google purged my website".


Purging junk sites doesn't do anything for the creation of new sites and articles from real human beings sharing their lives and experiences because they want to and not as an excuse to embed affiliate links. All those real humans only post in the walled gardens.


I do wonder how things would be different if facebook posts (instagram, tiktok, etc) were exposed to Google... I'm not sure they'd be better though.


> What I dream of is Google circa 2006. A time where a search engine searched what you asked for. A time before aggressive SEO. A time before social media, when the only people on the internet had a reason to be there. A time before sticky headers and full-screen modal pop-ups asking you to subscribe to a newsletter before reading the article. A time before click-bait and subscription-only websites which tease you with a paragraph before blurring out the rest of the content.

Well, I built sort of exactly this: https://search.marginalia.nu/

It's not great, but it sure has its moments.

Should be noted, regarding Koh Lanta, that travel is one of the most aggressively SEO-spammed topics, along with pharma and online casinos. It's extremely difficult to cut through the noise and reach any sort of signal.


Social media existed way before 2006, it just wasn't so popular.

But this was indeed a great time. Remember being on a social network ca 1999 and people used it as a diary where others could comment, give advice and meet others with similar interests. There was a timeline and posts from people you "followed". I think it didn't have likes or ratings though. Most people were super friendly and we were doing meet ups, so very much everyone knew each other. At peak time there was about 20k people and at that point owners couldn't cope with it. With that amount of people you'll find some nasty ones that will ruin it for everyone else and that's what happened. Owners couldn't afford to host it, so they were doing crowdfunding to cover the server costs and maintenance. Some bitter people didn't like that and started making unfounded accusations of theft or that they will report everything to the tax man and the police. So owners eventually closed it.


[Quoting your quote] - A time before aggressive SEO? No, just that Google was able to defeat it with Page Rank (one of their possibly few innovative ideas to the search arena). Mathematically, Page Rank did very well to weed out the SEO'd websites of the time.

Point is, I don't think SEO at the time was any more or less "aggressive", per se. Just different. And clearly, the game has since escalated altogether.


> Well, I built sort of exactly this: https://search.marginalia.nu/

What's the elevator pitch? What do you actually do that is special?

From the about section:

> This is an independent DIY search engine that focuses on non-commercial content, and attempts to show you sites you perhaps weren't aware of in favor of the sort of sites you probably already knew existed.

> The software for this search engine is all custom-built, and all crawling and indexing is done in-house. The project is open source. Feel free to poke about in the source code or contribute to the development!

But that doesn't really give me much info.


The point is to surface human websites and discriminate against content mills, click funnels and e-commerce.

For one I have a sort of budget for how much javascript I will tolerate. Some is fine, like your standard wordpress config probably will fly, but not much more. I also do a personalized pagerank biased toward the blogosphere. The likelihood your website shows up in the results is directly determined by whether real humans link to the website.

It's not a Google replacement by any measure, it's meant as a complement, but if you are looking for more in-depth content on a topic than you can find elsewhere, it's often a good starting point.


I like the idea of a search engine that acts as a sort of filter for known dark patterns along with returning useful results. Limiting on javascript quantity seems like a sledgehammer approach, but I like the spirit of it all the same.


> For one I have a sort of budget for how much javascript I will tolerate. Some is fine, like your standard wordpress config probably will fly, but not much more.

What about sites built with React, etc? Like would you be filtering out cool things like Ableton's "Learning Synths"[1] or do you mostly mean third party scripts?

[1] https://learningsynths.ableton.com


Yeah, those sited go out the window entirely. It's a search engine for documents, not applications.


Why should it matter if a website is marked up on the server side or client side to a search engine?


I don't know why it matters, but enforcing a JS budget turns out to rather effectively reduce the amount of problem websites down to nearly none.

Do babies go out with the bathwater? Almost certainly. But when you're running a small scale search engine, you're never going to index everything anyway, so that's entirely fine.


Why should I download 50M of Javascript implementing a buggy, bespoke browsing engine to render perhaps 1k of text when I'm using a browser that is already capable of rendering 1k of text? For example, I used to enjoy browsing the C2 wiki, but recently they decided they needed to experiment with their own Javascript based browser, thus rendering their site unreadable unless you are using the very latest web browser (that needs to be updated every 20 minutes mind you).


Thanks for the link! I’ve tried it on a few queries and I really really like the results so far! Cheers!


I know it’s of low value to ask, but who really gives a shit?


Sigh, I guess I should’ve clarified I meant “who cares whether or not there’s an elevator pitch, this is just something that someone thought we all needed and would be cool to build”.

FWIW I think it is cool and much needed.


I do!

(also low value, perhaps :)


Neat, I wonder if applying something like GPT-3 to understand what each page gives might help with that a little. Kind of like reversing the SEO spam. I wonder if that's something that ML tools could detect given enough input?


My guess is doing exactly that is most the source of Google's problems right no, meanwhile the spammers do the exact same thing except trying to teach a ML model to successfully pose as authentic content, and what you get is the gloriosu mess we are in. (Even though it's just the last battle in a long war against spam)


Too much / too little, both are problematic I guess. This ends up becoming a war of algorithmic generation vs understanding, and devolves to effectively answering the age old question "is this spam"?

I was going to suggest that perhaps adding info about whether the site had ads, etc. might help the usability. Then I tried marginalia ... :) Perhaps we need to invest in similar detection for other antipatterns (listicles, blogspam etc.) - this is mostly what I was talking about. Less the understanding for indexing, and more the classification for user safety and authenticity of results.

But again, this is very cool.


Yeah maybe.

I don't know how useful these "hobo signs" actually are, but they did feel like a natural response to a lot of web sites justifying their use of trackers and cookies and affiliate links and whatnot with "everyone is doing this!"


Impressive! I'm genuinely curious: Where do you host the index and the DB, and how do you finance it? Hosting an index of that size on AWS is not cheap, and maintaining the server infrastructure yourself seems quite some effort.


It's literally hosted off a beefy PC in my living room. I could definitely not afford to host this on AWS or rented servers.

About ~5k USD total investment including a small UPS, operational costs are $40/mo. Doesn't actually require all that much maintenance, at least not compared to the work I've put into building a search engine (including crawler, index) from scratch.


Brilliant! This is fantastic. I'm in love with it already :)


I was thinking about this recently, because at my business we're working on improving our SEO.

The big problem is that Google is the major way people discover goods and services. If you have a site that sells garden hoses, the best way to get eyeballs is to write blog posts that a potential user of garden hoses might search for. "How to install a garden hose." "What is the best garden hose."

This kind of content marketing can drive huge traffic and thus huge sales.

But only results on the first page of Google matter, and the farther down the first page you are, the less traffic you get.

Now, the 15 companies that produce garden hoses are fighting against each other to get the 10 slots in Google's first page for any given keyword.

With AI-generated articles, it's become a race to the bottom. Google rewards AI-generated, keyword-optimized blog spam. If you want to be one of the 15 companies that makes it into those 10 slots, you better believe you need to write AI-generated blogspam too. And when everyone has to do that in order to compete, all that's left is the AI-generated blogspam that has infested the modern web.

You almost have no choice as a business. Either write AI-generated blogspam that Google loves, or your competition will, and bury you in the search results.


This is a good summary of the situation.


> Google search results are significantly less helpful than they used to be.

> Anyway, forgetting that Google is bad at long tails, I search for `what is the koh lanta of croatia?`

Well, in at least one significant case Google searches are more helpful than they used to be, in that they now take into account natural language queries like "what is..." and try pretty hard to DTRT. Contrast that to "the internet of yore" that the author yearns for, where search engines - including Google - treated all words as keywords, and tended to weight them in the order they were given, so that the query would mostly look for pages containing the word "what", and "is" and "the", and then hope to find a subset containing "koh" and "lanta", and finally, almost incidentally, rank those by whether they might seem relevant to "croatia".

Incidentally, those old-style keyword searches may still provide better results in some cases. Try searching `croatia place charming slow cheap`, and see how that compares.


> and try pretty hard to DTRT

TRTD works fine though.


It's not just google.

In the past, Mr Joe Blogs was into a passion, he would write about it on his own shoddy blog with gif links to other blogs part of a blog ring of nerdy stuff he cares about.

Nowadays, Mr Joe Blogs writes on Medium, shares on Facebook and potentially has an outdated instagram account with 4 pictures of his passion. All buried down by algorithm to never be found, because his passion "isn't sexy".

The closed web and their algorithms are just as much to blame.


>Nowadays, Mr Joe Blogs writes on Medium, shares on Facebook and potentially has an outdated instagram account with 4 pictures of his passion. All buried down by algorithm to never be found, because his passion "isn't sexy".

From what I've observed, the opposite tends to be true more often than not. Plenty of people have been able to create communities around niche interests because platforms bring something independent blogs never really did - an audience. Joe Blogs probably started a subreddit and facebook group for his passion and gets more people reading his content in a week than likely happened to stumble across his blog in a year.


Yeah but you get no independence that way. Slip up and post the wrong "badspeak" phrase of the month? Shadow banned. As long as content is sitting underneath a huge lingering banhammer, the content will never be as raw and authentic as content was back in 2006. This reinforces the echo chamber issue and also makes the internet feel contrived and soulless.


Getting banned isn't an issue for most people, at least not under the politically driven terms you're describing (just look at the kind of extremist content than flies on Twitter), but even thought banning exists (and it did on the "old" web as well) there's still far more authentic content out there than ever. It just happens that the web is more complex than it was in 2006, and far more culturally culturally diverse, and most people can be authentic on the internet now without also being edgelords or pushing the bounds of TOS.


SEO person here. If you look at any topic or vertical with a lot of competition, there is bound to be monotony. That's because SEO people are trained to study "search intent", meaning, what do people think about when they search? What questions/desires can we, as publishers, help them fulfill?

It's a good theory, but becomes circular when you consider that one of the primary ways to determine search intent is . . . seeing what does well on Google.

So if you search for, say, "customer feedback", you see a lot of general guides about strategy, offering general definitions. That's likely because the first couple pages that ranked did it, and then more players saw that and said "oh, search intent for 'customer feedback' is [whatever they are doing]."

When you see all the listicles for certain queries, that's what you're seeing. Yes, it probably reflects some desire from people to see a list of options. But it's also just SEO people saying "look, the search intent is X." It leads to everyone copying each other and reduces the incentive to try and be innovative/think outside the box with content.


> Why don’t other search engines compete on search results? It can’t be hard to do better than Google for the long tail.

Search is ridiculously hard. Everyone underestimates the forces you are up against with SEO/Spam industry. Microsoft has invested into Bing for 13 years and has no market share and the #1 search term is "google."


It seems like we've all individually been able to find spam, and I'm guessing Google engineers that use Google for searching have the same experience. Is it just a matter of them allowing us to help train their algorithms? I'd be more than willing to report sites if I had the option.


Be careful what you wish for. AI capable of writing weird garbage nonsense content that looks truthful is coming down the pipeline, with a reward function built in for user engagement. In a world of effortless surreal content, authoritatively truthful answer will become more valuable. We're going to need encyclopedias again!


Google is not the front page of my internet.

I recently decided to create my own landing page, stored locally on disk, with a funny gif and favicon.ico, and a shortlist of my most used work and personal links (including this site!).

No ads, no tracking. Shockingly - I had to install browser extensions to override the new tab page in both Edge and Chrome. This used to be a built in setting! Shady guys. Very shady.

Although... Not as shady as the fact that Chrome by default sends your entire browsing history to Google for analysis.


> Not as shady as the fact that Chrome by default sends your entire browsing history to Google for analysis.

Citation needed. Are you saying that, as I click around the site here, Google knows what comments threads I'm interested in?

Edit: I think they're talking about the History sync, which does indeed by default send Google every link you visit, but only if you explicitly enable it. We should get access to this data and leak porn preferences of every senator. That ought to get the laws changed quickly.


Yes. And I can prove it to you.

Settings > Sync and Google Services > Other Google Services > Make Searches and Browsing Better.

Underneath the setting it explains, "Send URLs of pages you visit to Google".


we need to expand wiretapping laws to this stuff, the disclosure should be upfront and default off


Indeed. When I discovered this setting existed, and was enabled on all my devices, it made the hairs on my neck stand on end.

I think Today is the day I try moving back to Firefox again. I hope it's gotten snappier.


It's on by default, and history sync is a different setting than what I described. I discovered that actually in checking the setting to describe it in my other reply, I had it enabled on my personal PC. Although I would swear I disabled it a while back.

I would 1000000% never enable this setting. And I suspect that Goog is pulling a Zuck with "accidentally" re-enabling it and hoping nobody notices.


I'd like to set something like this up on an Android tablet.

I have Termux installed and there are a few options for running a simple webserver (e.g., via Python). I've yet to look into hosting my own homepage and directory there, though ... it would be nice and useful to do this.


Oh no my search history


I miss yahoo of 1996 when they were a table of contents. Practically unmaintainable which is why they gave up, but it was nice to follow a few links and find something interesting on weird subjects. No search at all in those days, but it was better for finding things.



Wow, it actually lets me discover interesting things.


Is this still active? Seems like a great idea, but the page says it's an archive and the search doesn't work. :(


They've apparently moved to https://curlie.org

Their discussion forum is https://www.resource-zone.com/forum/


[flagged]


Maybe i'm just lucky, but every link I tried was not only live but updated reciently.

Though i'm still disappointed they are not updating.


“ For example, have you ever noticed that the main content of most websites is something like 70% down the page? Every recipe site I’ve ever seen is like this — nobody cares about how this recipe was originally your great-grandmother’”

If anyone is curious, this is at least part due to copyright law. A recipe is a set of facts, which is no copyrighted. But recipes with substantial literary expression (making it unique) can be copyrighted. Content creators don’t want their stuff copied so viola…that’s how you get a long winded story before the goods.

https://copyrightalliance.org/are-recipes-cookbooks-protecte....


I see this claim repeated but is there any evidence for it? It’s not like writing a long story adds any protection for the recipe. Seems the author’s explanation is simpler (though even simpler is that the audience of those long-winded sites appreciates the content, or the authors enjoy creating it).


Surely they could put that crap after and achieve the same status?


I always thought it was more for the SEO than the copyright.


The About page of Read Something Interesting (https://readsomethinginteresting.com/about) conveys this well.


I've personally thought it's a side effect of bringing the masses to the internet.


Probably true. The only Internet had a great filter. You needed to have money, technical know-how, and a great deal of openness and vision to do anything on the Internet, especially publishing anything there. Nowadays, there is no more filter, neither for access to the Internet, nor for publishing on it.

Also, getting people to pay you money is now a proven concept on the Internet. Whereas, in the early days, not only where there few options for taking payments or for monetizing content with ads, but it also wouldn’t have been worth the effort because that just wasn’t a big enough target audience.


Among other things, Linux on the Web can be used to usher in a new paradigm for accessing online resources.

The current document-centric approach has an intensely graphical (attention grabbing) aspect which very much encourages the least common denominator types of results that the author abhors.

Those who want to be more thoughtful about these things need better toolsets to allow them to focus the way that they go about their online lives (including general search).

Google is in indeed more than search. They are also the machine learning framework, Tensorflow. Linux on the Web is in as good a position as anything else to start putting the JavaScript implementation (Tensorflow.js) to very good use.

We don't need new search engines as much as we need new search engine interfaces.


this page makes a comment that I wish more site developers understood–no results to my search is a much more meaningful result than changing my search to provide results I didn't ask for.


> As it happens, Google rewards websites which use keywords in their url, title, headings, and first 100 words. Just by eyeballing, we can see that this particular website is targeting the keywords “water”, “system”, “irrigation”, and “garden”. Pages like these hyper-optimized to come up for particular searches. The stupid expository stuff exists only to pack “important keywords” into the first 100 words.

Another explanation is that the type of company that buys thousands of links is also the type that pays rock bottom for their writers. I don't think you need to necessarily write trite crap in order to do well in SEO.


FTA:

> Why don’t other search engines compete on search results? It can’t be hard to do better than Google for the long tail.

I think that's the "citation needed" that the author's missing. It's not like Google is unaware of this problem, and a huge piece of their research spend goes into improving search.

It's entirely possible nobody's doing better than Google because nobody knows how. The entire paradigm Google's leading in might be a saddle-point with significant activation energy needed to escape it while still having something as usually-useful as Google.


Except that “improving search” here means maximizing Google profits, from ads in particular, as opposed to what users would consider an improvement to search.


There are probably thousands or tens of thousands of people working on search. Some of them are interested in and interface with the business side but I'm sure most of them have objectives just related to search. It's possible they've found and not implemented massive improvements because it's getting sat on by higher ups who are more aware of the business and objectives of the ads teams. If that were the case though you'd probably hear about it and see people leaving in droves to start competitors


You wouldn’t hear about it at all because of NDAs. And people wouldn’t be “leaving in droves”, because they are already used to it.


Does this mean that pages with excessively complicated cookie acceptance/rejection also get ranked higher because they have a longer dwell time? Talk about a dark pattern.


90% of content writers that I have talked during product validation are using Google to search for writing financial content.

They mirrored the sentiment of the author that search results have gotten significantly worse and do not provide much auxiliary insights. The noise/hit ratio is really high.

Interestingly, Google Search works really well for Stack Overflow.


> Interestingly, Google Search works really well for Stack Overflow.

Maybe it's because people who use these kinds of websites are generally less tolerant of BS websites than the average person, so Google tends to derank the crappy ones just because we always click away instantly.


Maybe it's because Google Search is maintained by software engineers and they actually dog food their stuff in this context?


For decades now, SEOptimizers have constantly found new tricks to get their websites ranked higher than they should, while search engines have constantly updated their algorithms to make them harder to game like that.

These two things have co-evolved in response to each other, and what this article laments is the result of exactly that process.

But the interaction between SEO and Google search is way more complex than the article implies. In particular, the influence goes both ways, and it's not a clear one-way cause and effect relationship. Also playing into this are a lot of other factors, like the increasing commercialization of the web, how the online advertising business works, Googles dominant position in both web search and web advertising, etc...

I really don't think the solution to this can be as simple as "change how Google rewards keywords". It's not that simple.


We all have to get better at using search engines to get past the SEO

Google "reminds me of Koh Lanta" turns up Boracay, Philippines in the top three links

https://www.google.com/search?q=%22reminds+me+of+Koh-Lanta%2...

No you ever used this phrase about anything in Croatia

https://www.google.com/search?q=%22reminds+me+of+Koh-Lanta%2...

Translating "reminds me of Koh Lanta" to Croatian doesn't help

https://www.google.com/search?q=%22podsje%C4%87a+me+na+Koh-L...


Google's "It looks like there aren't many great matches for your search" response probably means you're onto something interesting :)


I get that a lot. They have a really interesting little character fishing that is animated. It appears to be SVG animated by CSS. I need to learn how to do that.


> thus changing the problem of trustworthiness to one of authority

I wish Google would just penalize Quora. I can't ask any question on Google without seeing a bunch of Quora fake experts on the first page providing vacant answers to any question.


Even worse when it's CS related and you get fed sheer nonsense from a "Dev Guru 25YOE" like damn idk what the target demographic is for the SWE side of Quora but I sure hope to never be part of it.


When the web was younger than it is now, everything on it was a novelty. It was interesting to see because pages would appear that had never existed before. Now it's all corporate influencer recycling with ads slapped on it.


I would also ask, why is HN so monotonous?

If you compare 'new' vs what goes on the wall, you'll see a non-trivial amount of Big Tech bias and very good posts (often criticisms) effectively getting censored. Is the paycheck that good?


> and very good posts (often criticisms) effectively getting censored

I browse new occasionally and I encounter a lot of lowbrow substack spam to be honest. Rants against big tech make you a contrarian, problem is you have to be contrarian and right to be interesting.

I don't think HN has a big tech bias, most criticism of big tech is just plain awful. If anything in today's discourse there is a stupendous anti big-anything bias (for context, I'm not paid by big tech, I work for a small German company on the other side of the pond and find much of SV culture annoying personally)


It shouldn't be a surprise that most of the content posted to a forum owned by a Silicon Valley startup incubator is related to "Big Tech."


Because to avoid trolling and arguments HN filters out controversy and most politics.

Yea, it's more monotonous, but it's far less hate filled than most forums and keeps on target far more often.


Exactly! The top "comment" right now is: "this is a clear demonstration of the user's expectation of search having exceeded what's possible today".


This comment seems pretty off-topic from the article linked to, can you provide a bit more context on how it's related other than the concept of monotony?


Cancel culture fear => Peer pressure hypersensibility => Monotonicity


For those who agree that the internet is getting worse, I recommend joining the http://www.indieweb.org movement.


The author seems to only have a superfluous understanding of SEO. Yes, keywords matter, as well as inbound and outbound links, but, these days Google is also quite sophisticated about surfacing results that meet the criteria for Good Page Experience and pass the Core Web Vitals. That means responsive web sites, with highly optimized images, font delivery, minimal on-load jitter that load super fast, even on old mobile devices. All of that is good for the end user.


True but by Google own posts about initiatives like Core Web Vital and mobile first these are used as tie-breaking factors not as ranking factors. If you have 2 pages that are judged to have equally relevant content per all of Google’s other content ranking, the one with faster core web vitals will appear first. Pages with amazing Core Web Vitals, but poor content ranking, are not ranking more highly than relevant pages with poor web vitals. Example.com has amazing Core Web Vitals.


I do work in this area and my observation is that they're used as a lot more than tie-breaking factors. Yes, you can't have an empty page / filled with spam, I never said that you could.


You don't need such a complex analysis, it's actually much simpler: you are mostly shown websites that try to make money from you, because those are the ones optimizing as hell to be at the front of the search.

Any other website, where people just share information freely, where you might get different opinions, experiences, hell, _anything that is different from just trying to sell you something_ (even if it is ads), will be very far in the back.


> Mass-appeal queries are, almost by definition, not particularly hard to find. If I need a bus schedule, I know to talk to my local transit authority. If I’m looking to keep up with the Kardashians, I’m not going to have any problems (at least, no search problems.)

In identifying a 95/5 reward for supporting "mass-appeal queries", is much of the answer right there?

And will the average general-purpose alternative search engine escape similar incentives?


I saved all kinds of sites on ZIP drives in the mid-90s.

It's amazing how much stuff/information there used to be, instead of whatever it is we have today on the WWW.


We're moving towards the imaginary internet.

https://semiosis.github.io/about/

https://semiosis.github.io/posts/imaginary-internet-survival...

Better start encoding your personal mythology into code right now.


I have resorted to using Russian (Yandex) and Chinese (Baidu) search engines in the last few years. It is interesting that non commercial sources have more of a voice on these search engines than on Google today.

Google is good at many things, but their leadership is questionable now, unlike in the past.

I would pay a subscription fee for a service that provided partity with Google and customer service that answers the phone.


> It is interesting that non commercial sources have more of a voice on these search engines than on Google today.

I sort of agree, I also sometimes use Yandex and am usually positively surprised.

I can immediately see two reasons:

A) SEO efforts are targeted Google/Bing, not other search engines that don't focus on the target market/language/region. Of course that argument can be twisted around: these search engines don't focus on Western languages, so could in theory be more susceptible to SEO trickery. It's interesting that they aren't.

B) Ad money. Google is strongly incentivized to prefer search results that show Google ads. For foreign search engines there is no money to be made, so they can disregard that signal.


Have you tried this one? https://neeva.com/

No idea how good it is, I remember it was posted here a while ago


Have not. I will try it now though.


Aren't those heavily censored?


Yandex censors the results mostly for Russian IPs as far as I can tell, but they stay neutral for the outside users.


Nope, just yandexed for “bucha massacre”, 2nd place is already Rashist propaganda.


Sounds like a use-case for search but - should it ever re-materialize [1]

A bazaar of boutique search indexes provided by People seems like it would be nice to have. Better than Google constantly trying to sell me shit.

[1] https://paste.sr.ht/~sircmpwn/048293268d4ed4254659c3cd6abe67...


> Google rewards websites which use keywords in their url, title, headings, and first 100 words

can't say whether this is true for sites but they definitely encourage this behavior in ads -- the ads dashboard will shame you ('poor quality ad! we may not show this' kind of prompts) if you don't obey silly rules about 'how many keywords' appear in your title.


> None of this is hard to do. It just requires someone to get started.

I'm not an expert in this area but that seems very optimistic. Creating an algorithm that gives better quality results than google for certain kinds of queries probably isn't too difficult. But scaling it up to indexing the entire web, continuously, would require a LOT of resources.


It's not all Google. The entire medium has shifted for the average user.Overfitted even.

The prior barrier if entry for quality content was that you had to be tech savvy. Now anyone and their mom can publish shit content, play the SEO game, and spam dunk the Internet.

Maybe the solution to this is really some kind fo web3 variant which Wil put up some hurdles again.


As an aside, 15 years ago I spent a week in koh lanta. It was a lovely time and having returned to briefly visit other parts of Thailand in the years since, I am pleasantly surprised to learn that koh lanta still has some of the charm from that long ago. I too look for koh lanta in every beach town I have visited since then


Isn't this just getting older? You could just as well ask why is life so monotonous. There's a finite number of experiences, practically speaking, and as you go through each, subsequent versions lack novelty. This is also why time seems to speed up as we age.


Hey yeah! Imagine like two small guys in a garage could like make a indexer for the web and call it a search engine and just make it work the best, and gain market dominance and become a little bit slower to innovate than authors “back in 2006” nostalgia…


> My last point: websites are penalized for even linking to low-ranking pages!

No they aren't.


That was my impression also. My understanding is that your page "quality" is 1 + the sum of the quality all pages that link to your page, divided by the total number of outgoing links. So the quality of pages with incoming links matters, but the quality of pages linked to doesn't, just the number.


It’s why recipe sites have a dumb story.

In the old days, you might search something simple and it would have the answer in the listing. So you wouldn’t have to click. They are now gone too.


"If I’m looking to keep up with the Kardashians, I’m not going to have any problems (at least, no search problems.)" Touché :]


Google, facebook, walled gardens in general, and the fact that ~100% of users neither know nor care how much better things could be.


Is there a search engine out there that implements the good old page rank algorithm?

The patent expired a while ago now.


It wouldn't work if you did. Back in 2000 everyone created their own web site with a list of links to others they found intersting. Now nobody is creating that input and so you don't get and data to rank by.


That's true but we now also have things like Reddit providing those links.


What are your best links to the dark, beautiful corners of the internet?


I disagree that its google's fault. I think it went from a diverse mainstreet to a walmart. Likely an inevitable progression of any economy if not deliberately prevented. Was likely quickened by Google though.

"They paved paradise to put up a parking lot"


I glad we have GPT-3 and DALLE-E 2 and all the improvements in generative ML. It means the internet will switch from wide discovery mode to narrow subscription and recommendation. This is the only way to deal with generated search spam.


I think eventually more people will realize that the modern web is terrible and they will get fed up with it. People are already building alternatives. There are several situations that could cause Google to lose a foothold as the search behemoth and drive consumers to find alternate ways to find products.

Possible catalysts include: regulation of search or advertising negatively affecting google, an unknown search competitor entering the market, another of the behemoths competing (amazon with search, bing getting their act together, etc.), antitrust suits, insert black swan event here.

I think the pendulum is also swinging towards smaller, local communities. Maybe that's just my bubble. Eventually everyone will get sick of it though, and something will change. I don't think we can predict what will replace it, but I hope it's not another big, centralized source. We've seen the negatives of that.

I don't know how to build it, but to me the ideal Google search killer would be:

1) Decentralized somehow. Hosting search indexes collectively in order to reduce the need for a single entity to host the data. Crypto has this possibility maybe. But I don't know how well it could be implemented, and whether you could have search still be "free" like Google is. How do you solve for that problem?

2) Filters out the SEO spam. Yes, this is a huge problem. My idea to fix it is manual curation. Not scalable. How do you solve that? I think something like StumbleUpon. A curated list of sites that is searchable in its own index. Perhaps there could be a trusted network of curated indexes. Members only to host the curated indexes to keep out the black hats, but free to search for the public at large.

I also think we need to educate everyday users to host their own sites on their own hardware. Making it super simple to spin up a web host in a VM that is running on a laptop in the bedroom. Or off a raspberry pi 0. Or whatever is cheap and available. Turn it off when you need to and the website goes offline. Yes, but that is okay in the interim to get Joe Regular hosting a website independently. Sure it's not okay for a business, but for your hobby site about bonsai trees, why not? Of course this is against most ISP's ToS so how do you solve for that? How do you give Joe Regular the ability to deploy a simple, secure website easily? How do you incentivize them to get out of the walled gardens? Give them a garden of their own.

That was what made the early web (as I knew it in the 90s and 00s) awesome. You were figuring it out on your own. Google/Bing/DuckDuckGo can stick around and be the digital yellow pages. They are good at that.

We need to build a new web for ourselves.


Richard, CEO of you.com here. Love your two points. That's what we're doing!

For 1) we have an open app platform that allows everybody to collaborate on that first entry of the web

For 2) We let people decide the sources they want to see and if they don't like a source or app, they can downvote it and see it less in their ranker.

We will try to decentralize more over time though it's a big effort to make it fast enough still.


...and purple? Huh? Why everything is purple? Seriously


“Google is awful for the long tail.”

Exactly right. Big opportunity.


> like it consists of ten big sites

Optimistic. Which are those?


>websites are penalized for even linking to low-ranking pages!

So the top hundred sites link to each other and everything else gets ignored.

This the equivalent of social media's echo chambers.


I would like to see a source for that claim though. Seems odd to me. Also most pages are low ranking. Even on say HN which is probably high ranking, the page for this comment is likely to be low ranking, because it is new and nothing is linking to it (other than the one link from another leaf-like page).


This gets even more effective once you layer in content-based ranking adjustments -- where defined types of content or viewpoints get automatically downranked. You downrank the content directly, and then the rest of the web is highly penalized from even linking to the "disapproved" content, and so that content effectively disappears from the internet, except in tiny dark corners that aren't found via search.


RSS or similar is the solution.


This SEO blog ecosystem isn't going to die anytime soon but alternatives will exist.

In the next 5 years, "Knowledge Networks" are going to become curated high-density graphs of domain-specific knowledge. Some will be paywalled, some will have ads, but you'll know how to find the information you're looking for.


Projecting.


i dig your layout


[flagged]


> if your goal is to maximize the next few quarters results and don't care if you crash the company.

Also makes sense if you are optimizing a metric and only considering the data, not where it is leading you.


I had a scenario today where I looked something up and the search result page yielded zero results. However, there were 3-4 ads above and below the text that said there were zero results.


It could be a situation where Google doesn't mean to do it, but also isn't incentivized to fix it. The people who make the most money off Google have the most time and motivation to find ways to manipulate it to protect and increase their profit. Anyone at Google who wants to fix it is incentivized to come up with new features for promotion rather than bugfix.


It's not technology. It's about culture (which is linked to the social imaginary, a kind of group consciousness operative system, hence a kind of immaterial technology itself).

Psychologically, at the core of the problem, this is what happens:

Cancel culture fear => Peer pressure hypersensibility => Monotonicity




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: