Hacker News new | past | comments | ask | show | jobs | submit login
In search of the least viewed article on Wikipedia (colinmorris.github.io)
213 points by ot on May 27, 2022 | hide | past | favorite | 68 comments



> there’s unfortunately no easy way to sort out the least viewed pages, short of a very slow linear search for the needle in the haystack

So this sentence made me wonder why he didn't actually just go do it, 6M pages isn't really all that big of a data set. Turns out that it's a problem of how the data is arranged. The raw files are divided by year/month/day/second[0], and then each of those seconds is a zipped file of about 500MB in size, where pages are listed like this:

en.m Alcibiades_(character) 1 0

en.m Alcibiades_DeBlanc 2 0

en.m Alcibiades_the_Schoolboy 1 0

en.m Alcide_De_Gasperi 2 0

en.m Alcide_Herveaux 1 0

en.m Alcide_Laurin 1 0

en.m Alcide_de_Gasperi 1 0

en.m Alcides_Escobar 3 0

en.m Alcimus_(mythology) 1 0

with en.m getting a separate listing from the desktop em, and all the other country codes getting their own listings too. So just collating the data would be a huge job.

The API also doesn't offer a specific list of all pages by time, so you'd have to go and make a separate call for each of the 6M pages for a given year and then collate that data too.

[0]for example, https://dumps.wikimedia.org/other/pageviews/2019/2019-01/


MapReduce and dump everything into something like DuckDB.


Should be titled differently, all of the least viewed articles are now going to be viewed, lots. I'm really having trouble myself not visiting them. Mind you it's time some obscure moth species articles got some improvement :)


This is the interesting number paradox: the least interesting number in the world ceases to be boring once it has the interesting property of being the least interesting attached to it.

https://en.wikipedia.org/wiki/Interesting_number_paradox


Doesn't that just mean the least interesting number definitely exists but is simply unknowable?


Off the top of my head, I can think of two ways to resolve the paradox:

1. Reject the intuitive definition of "interesting"ness as being ill-formed. We need an objective definition non-self-referential definition of the word interesting.

2. Accept that there are interesting or uninteresting numbers, but their status as objects of interest is tied to a specific time. For example, 1729 might have been an uninteresting number until the conversation between Ramanujan and Hardy. Several of the examples on the Wikipedia page have such temporal limits on when they were / were not interesting.


I would instead say it has more to do with “interestingness” being an impermanent quality.

As an odd extension of this idea (I don’t know of an English equivalent), if this were Spanish we would probably use “estar” instead of “ser” to affix the description.


As someone who lives and breathes social media and Google trends, "interestingness" is something that has both trend and stochastic event-driven temporality.

Example: https://trends.google.com/trends/explore?date=2022-01-01%202...

There was interest in "Ukraine" at a certain background level until events in January 2022 began an increase of interest. Then the invasion in February drove it through the roof. Interest in Google Trends peaked specifically on 24 Feb 2022. However, after the initial invasion, interest steadily and rapidly waned. By 24 Mar, interest had waned to 10% of its invasion peak. By 24 April, 5% of that peak. And 3% by 24 May.

So "Ukraine" interest is still 3-4x times its pre-war baseline (2021 trend line), but it is 1/25th to 1/33rd of peak interest near the start of the invasion.

And, due to a specific awards show incident, interest in Ukraine was eclipsed by searches on "Will Smith" centered around 27-28 Mar 2022. That interest did not peak as high as the peak interest in "Ukraine," and was even more ephemeral. By 02 Apr 2022 interest in Will Smith fell below that of interest in "Ukraine." And now interest in that individual has basically resumed to its pre-stochastic event baseline.

Interest in topics is not only temporal, it is also geographical and social. Interest may rise (or fall) in certain countries or states or even cities. Or its locus may be based on virtual communities other social reference groups. "Marvel" or "Star Wars" fans might be hyped by certain news, and interest can spike amongst them, while only having peripheral interest in other social reference groups or amongst the general public and popular culture.


No, interesting is a value judgement which is easy until it gets hard, an exact quantitative measure of interest is impossible as what is interesting is, to a degree, a matter of opinion.


Also a paradox of any value system based on scarcity and obscurity of what is ultimately a zero-marginal-cost production function.

Cult-followings of music, art, or fashion, mass-produced luxery fashion labels (where market growth is the kiss of death, though a new brand can of course, be easily launched, and is), a secret vacation spot (where travel costs are largely the same as to any other point on Earth), etc., etc.


It's not as if 'not being visited' is some special protected status we need to preserve for posterity.


[flagged]


I’m sorry but I cannot follow what you are trying to say. It sounds like you didn’t like what softgrow said but I don’t see anything wrong with it. Mind explaining again?


That comment looks like it was generated by AI.


I noticed moths came up a couple times, a brief guess why: Wikipedia says it's "one of most speciose orders" (besides flies and beetles). But maybe it has the most pages because they are so easy to catch with a light in your backyard, that it's far easier to name them all than something parasitoid wasps [1]?

Although this same paper says that "more species of beetles (>350,000) have been described than any other order of animal, insect or otherwise"

[1] https://www.biorxiv.org/content/10.1101/274431v1.full.pdf

Just speculation...


Mother here. The examples listed in the article aren't of much use, as there's not much there you can use to identify species. If you're interested in moths, you'll visit a specialist web site, and if you aren't, you won't visit the Wikipedia pages either. The web site I use is https://www.ukmoths.org.uk, which has photographs of each (UK) species, as well as its range, food plant, and flying time.

Regarding the number of beetle species, JBS Haldane (probably) said that "God is incredibly fond of beetles" (or words to that effect).


As an aside, is mother actually the term? My first readthrough of your comment had me envisioning an incredibly knowledgeable parent with a cool hobby before I realized what you meant!


According to Wiktionary, it's been used since at least 2001. I was similarly confused.


https://en.wiktionary.org/wiki/moth-er is filed with hyphen ("moth-er"); hyphenless spelling ("mother") is listed as an alternative form.

https://en.wiktionary.org/wiki/mother does not mention that fact, though.

Nevertheless, what a nice new ambiguous word to spice up the context!


It had me rereading the parent to see when parents or kids had been mentioned.


So Wikipedia, just like the real world, is filled with staggering quantities of different types of bugs.


May brain parsed that as if you were talking about software bugs for a second.


Interestingly, two of these — one of the disambiguation pages, and one of the least-viewed non-disambiguation pages — were created by the same user, Carlossuarez46, and are both about a location in Iran.

Furthermore, the user Ruigeroeland has contributed to three of the insect pages.

So I guess these users have the distinction of having contributed to multiple of the least-seen articles on Wikipedia. (They probably also contributed to many widely-seen articles!)


This is addressed at the end of the article. Many are probably made with automated tools

> For example, the 12-word stub Pottallinda (5 views last year) was created on 18 January 2011 by User:Ser Amantio di Nicolao, who happens to be the most active editor in all of Wikipedia (as measured by number of edits). Within 60 seconds of creating this page, the same editor also created Polmalagama, Polommana, Polpitiya, Polwatta, and dozens of other substantially identical articles.


Yeah, he’s likely using AutoWikiBrowser for this, it’s a great tool for automating MediaWiki tasks.


Since they have links to all of these least viewed articles they will soon cease to be the least viewed articles.



Also seems related to the interesting number paradox:

"the smallest uninteresting number is itself interesting because it is the smallest uninteresting number, thus producing a contradiction."

https://en.wikipedia.org/wiki/Interesting_number_paradox

Incidentally, the last line of that article is:

> The mathematician and philosopher Alex Bellos suggested in 2014 that a candidate for the lowest uninteresting number would be 247 because it was, at the time, "the lowest number not to have its own page on Wikipedia".


perhaps resolvable by moving the self-reference to the definition e.g. "the smallest number not interesting for any other reason that its uninterestingness"


Note that page view data is also available at https://pageviews.wmcloud.org/?project=en.wikipedia.org&plat...

(I didn't see this linked in the post. Apologies if I missed it)


And a quick way to get there is the Page information link in the sidebar. On the info page for each page[0] is a count and plot of views in the past 30 days, and at the bottom is a link to external tools like the one above.

[0] example: https://en.wikipedia.org/w/index.php?title=Weimer_Township&a...


Great sleuthing. Is there a convenient alternative algorithm that might be used for random article that would also continue to work fine as more articles are added or removed?


You can make a binary tree where each node counts the total (recursive) number of children. Updating these counts is log(N).

To insert a node, generate a random bit string: descend the tree and when the counts are equal then take the branch corresponding to that position in the bit string. When the counts are unequal, take the branch with the smallest count.

To remove a node just remove it from the tree and update the counts up along its path to the root. This assumes that articles are added as often as they are removed.

To sample, just generate another random bit string and traverse the tree according to it.

But I agree with the sibling comment that the current non-uniform algorithm is fine for a cute “Random article” button.


If the articles have some reasonably dense small IDs associated with them, then an easy algorithm is to simply pick a random ID between 0 and the max, check if it's a still-existing article, and repeat if not. There are plenty of distributed queue designs capable of distributing IDs to servers so they can be given to new articles, and re-using the id from a deleted article is fine to keep things denser.


The database itself may internally store a more dense identifier than an autoincrementing primary key. For e.g. ctid in Postgres or, under certain conditions rowid in sqlite. These should be fully dense after a vacuum, and rejection sampling can be used to paper over tombstones generated between vacuums.


Seems like it would be pretty easy to just maintain an alternative index of articles, with an integer ID counting from 1 to N. Then just pick a random number from 1 to N.

It could even be maintained entirely outside the Wikimedia servers, relying on database dumps.


I disagree that anything is wrong with the current algorithm.

However there was a brief time period where randomPage used elasticsearch to get the random article instead.


To get a more equal spacing between the articles maybe one can make use of quasirandom numbers instead of random numbers: https://en.wikipedia.org/wiki/Low-discrepancy_sequence


For the smallest change with the fairest result, I would reseed the random IDs on a schedule.


Added is easy. It’s removed that is the tricky part, though perhaps you could reuse IDs somehow.

Maybe a hashing function could work.


Why was the random function done this way? Is it not possible to index into a database table at a random position?


I assume it's for performance reasons, a "select next greater than X" being faster than "get a random row" - not sure if databases even have the concept of returning a random row, or how performant it is.


I can't think of a reason why getting an article with ID of randomIntBetween(0, count(articles)) would be slow (maybe UUID primary keys?)


Even if you used UUIDs, couldn't you just "select count(id) from articles" to get the list of articles, generate a random number between 0 and the count, then "select id from articles offset {index}"?


I don't find the random gap argument convincing. There are so many articles i think the gaps are going to be very small and reasonably close in size as to not matter (i have not done any analysis to back this up).

Sure there will of course be some unlucky articles, but does that actually matter?


No need to speculate on how big or small the gaps may be, the article looks at the actual random gap values used on Wikipedia.

Quote: "The least viewed article in the sample, Erygia sigillata, has a page_random value of 0.500764585777. The article Katherine Hanley is right on its tail with a value of 0.500764582314, which is just 0.000000003 less, or 3e-9 in scientific notation. This is 98% smaller than the average random gap. In other words, Erygia sigillata is an extremely unlucky article as far as the “Random article” button is concerned! It’s 50 times less likely to be landed on than an average article."


If you randomly draw z numbers from a bucket of numbers from 0 to 10000 times x, the gaps will differ because there is no mechanism that would make some number with close neighbors less or more likely to be drawn than any other remaining number.

It’s only once your supply of numbers runs low that differences will start to equalize, reaching 1 when you have exhausted your supply.

Besides, the author isn’t really making an argument. They are giving you actual data showing the differences to the next lowest number. It’s hard to argue with that.


Right, they said the absolute worst case outlier was a gap about 100x smaller than average. Which i guess depends on your views, but i would say that's not bad.

> the gaps will differ because there is no mechanism that would make some number with close neighbors less or more likely to be drawn than any other remaining number.

When a new article is inserted, there is a higher probability it will be inserted in a large gap than a small gap, so it should balance out.

I suppose you're right,i am responding to an implied criticism to the randomness method that the author didn't make. They just offered it as explanation.


Think of it this way. The article presents a random-gap list with only five entries, and it looks terrible; the most common page occurs 9 times as often as the least common.

Now think about a random-gap list with five million entries. With that many entries, will the gaps balance out? In the best case, five of them end up in the range from 0 to 1 millionth, five of them end up in the range from 1 millionth to 2 millionths, etc. But we've already seen what it looks like to have five uniform random numbers in a range (whether it's big or small doesn't matter); the gaps tend to be really varied. So we're going to get this sort of imbalance between gap sizes (viewed as a ratio) no matter how many entries we insert.

One other thought experiment: the article mentioned a page with a gap size of one billionth. How many pages will it take for that to balance out so that page doesn't have an unusually small gap any more? How many pages does Wikipedia have?

(This is similar to the reason that infinite space packed with marbles has the same packing density as infinite space packed with bowling balls.)


The gap size is simply the difference between two (presumably) uniform random numbers, which is a distribution you can just look up.

Differences around zero is actually preferred: https://mathworld.wolfram.com/UniformDifferenceDistribution....


That’s only the case if there are only two articles. In general you’d be looking at the difference of two order statistics (of the uniform distribution). You could probably approximate this quite well as a Poisson process though (since there are so many articles) and so have exponentially distributed gap size.


> When a new article is inserted, there is a higher probability it will be inserted in a large gap than a small gap, so it should balance out.

That’s a good point and I’m not entirely sure why it (appears to) not work that way. Maybe it’s because that interval has a higher likelihood, but there is no preference for numbers towards the middle, that would dissect it into (roughly) equal parts?


I think it is working. The worst case was 50x less likely than the average. That sounds like a lot when stated like that, but its really not when taken in the context of ~6.5 million articles.

More interesting question would be what is the standard deviation (of gap size), not what is the worst outlier


When this topic came up the last time, I plotted the probability distribution. It was far from 50x, the range was several orders of magnitude, which seems to be the correct behavior. See https://gist.github.com/mormegil-cz/84d0cc34eb5f1234be8966f7...


The most interesting thing here is that all of the 6 million articles on Wikipedia have been seen by 1 person at least once. So if we take all humans and ask them to search 1 unique thing on Wikipedia, would there be any articles that would get 0 hits?


Well, presumably the person who contributed the article is guaranteed to see it, I don't think an article can exist with 0 views.


There are many articles created by bots that may never have been seen by a human.

https://wap.business-standard.com/article/pti-stories/wikipe...


Note that Lsjbot is only active on certain language editions; it doesn't edit English Wikipedia. There are a lot less bot-created articles on enwiki.


The article was about annual views, so not necessarily.


Wonder what the least viewed articles were without the "random article" function?


You'd have a big stack of bot-generated 0 or 1 view articles about moths.


Looking up “fox tossing” on Wikipedia ended up being just as barbaric as I imagined.


That page leads down an unpleasant Wikipedia rabbit hole. It's amazing to me that people thought "Goose pulling" was a pleasant pastime—tie a goose to a frame, ride towards it at a gallop, and grab it by the neck so its head comes off.

https://en.wikipedia.org/wiki/Goose_pulling


Anyone who has had significant experience with Canada geese would understand this pastime.


This is good stuff!

Besides finding hidden gems or undervalued articles, I would really like to see traction tracked. Now that rarely viewed articles are somewhat in the spotlight and get promoted, a before and after comparison would be interesting.

Fun anyway.


Do you mean pageview stats for the mentioned articles? You can add them in this for example. Data for today are not collated yet, should be visible by tomorrow.

https://pageviews.wmcloud.org/?project=en.wikipedia.org&plat...


I wouldn't equate the least viewed with unpopular. They may not show up well in search for lack of content.

I've come across plenty of articles which were low in views and only have a handful of lines of content. I would love to expand these articles, but the "culture" of Wikipedia actively works against this.


> I would love to expand these articles, but the "culture" of Wikipedia actively works against this.

Can you expand on this? I've only had positive experiences editing for Wikipedia as a newbie.


Easy answer: moths!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: