> there’s unfortunately no easy way to sort out the least viewed pages, short of a very slow linear search for the needle in the haystack
So this sentence made me wonder why he didn't actually just go do it, 6M pages isn't really all that big of a data set. Turns out that it's a problem of how the data is arranged. The raw files are divided by year/month/day/second[0], and then each of those seconds is a zipped file of about 500MB in size, where pages are listed like this:
en.m Alcibiades_(character) 1 0
en.m Alcibiades_DeBlanc 2 0
en.m Alcibiades_the_Schoolboy 1 0
en.m Alcide_De_Gasperi 2 0
en.m Alcide_Herveaux 1 0
en.m Alcide_Laurin 1 0
en.m Alcide_de_Gasperi 1 0
en.m Alcides_Escobar 3 0
en.m Alcimus_(mythology) 1 0
with en.m getting a separate listing from the desktop em, and all the other country codes getting their own listings too. So just collating the data would be a huge job.
The API also doesn't offer a specific list of all pages by time, so you'd have to go and make a separate call for each of the 6M pages for a given year and then collate that data too.
Should be titled differently, all of the least viewed articles are now going to be viewed, lots. I'm really having trouble myself not visiting them. Mind you it's time some obscure moth species articles got some improvement :)
This is the interesting number paradox: the least interesting number in the world ceases to be boring once it has the interesting property of being the least interesting attached to it.
Off the top of my head, I can think of two ways to resolve the paradox:
1. Reject the intuitive definition of "interesting"ness as being ill-formed. We need an objective definition non-self-referential definition of the word interesting.
2. Accept that there are interesting or uninteresting numbers, but their status as objects of interest is tied to a specific time. For example, 1729 might have been an uninteresting number until the conversation between Ramanujan and Hardy. Several of the examples on the Wikipedia page have such temporal limits on when they were / were not interesting.
I would instead say it has more to do with “interestingness” being an impermanent quality.
As an odd extension of this idea (I don’t know of an English equivalent), if this were Spanish we would probably use “estar” instead of “ser” to affix the description.
As someone who lives and breathes social media and Google trends, "interestingness" is something that has both trend and stochastic event-driven temporality.
There was interest in "Ukraine" at a certain background level until events in January 2022 began an increase of interest. Then the invasion in February drove it through the roof. Interest in Google Trends peaked specifically on 24 Feb 2022. However, after the initial invasion, interest steadily and rapidly waned. By 24 Mar, interest had waned to 10% of its invasion peak. By 24 April, 5% of that peak. And 3% by 24 May.
So "Ukraine" interest is still 3-4x times its pre-war baseline (2021 trend line), but it is 1/25th to 1/33rd of peak interest near the start of the invasion.
And, due to a specific awards show incident, interest in Ukraine was eclipsed by searches on "Will Smith" centered around 27-28 Mar 2022. That interest did not peak as high as the peak interest in "Ukraine," and was even more ephemeral. By 02 Apr 2022 interest in Will Smith fell below that of interest in "Ukraine." And now interest in that individual has basically resumed to its pre-stochastic event baseline.
Interest in topics is not only temporal, it is also geographical and social. Interest may rise (or fall) in certain countries or states or even cities. Or its locus may be based on virtual communities other social reference groups. "Marvel" or "Star Wars" fans might be hyped by certain news, and interest can spike amongst them, while only having peripheral interest in other social reference groups or amongst the general public and popular culture.
No, interesting is a value judgement which is easy until it gets hard, an exact quantitative measure of interest is impossible as what is interesting is, to a degree, a matter of opinion.
Also a paradox of any value system based on scarcity and obscurity of what is ultimately a zero-marginal-cost production function.
Cult-followings of music, art, or fashion, mass-produced luxery fashion labels (where market growth is the kiss of death, though a new brand can of course, be easily launched, and is), a secret vacation spot (where travel costs are largely the same as to any other point on Earth), etc., etc.
I’m sorry but I cannot follow what you are trying to say. It sounds like you didn’t like what softgrow said but I don’t see anything wrong with it. Mind explaining again?
I noticed moths came up a couple times, a brief guess why: Wikipedia says it's "one of most speciose orders" (besides flies and beetles). But maybe it has the most pages because they are so easy to catch with a light in your backyard, that it's far easier to name them all than something parasitoid wasps [1]?
Although this same paper says that "more species of beetles (>350,000) have been described than any other order of animal, insect or otherwise"
Mother here. The examples listed in the article aren't of much use, as there's not much there you can use to identify species. If you're interested in moths, you'll visit a specialist web site, and if you aren't, you won't visit the Wikipedia pages either. The web site I use is https://www.ukmoths.org.uk, which has photographs of each (UK) species, as well as its range, food plant, and flying time.
Regarding the number of beetle species, JBS Haldane (probably) said that "God is incredibly fond of beetles" (or words to that effect).
As an aside, is mother actually the term? My first readthrough of your comment had me envisioning an incredibly knowledgeable parent with a cool hobby before I realized what you meant!
Interestingly, two of these — one of the disambiguation pages, and one of the least-viewed non-disambiguation pages — were created by the same user, Carlossuarez46, and are both about a location in Iran.
Furthermore, the user Ruigeroeland has contributed to three of the insect pages.
So I guess these users have the distinction of having contributed to multiple of the least-seen articles on Wikipedia. (They probably also contributed to many widely-seen articles!)
This is addressed at the end of the article. Many are probably made with automated tools
> For example, the 12-word stub Pottallinda (5 views last year) was created on 18 January 2011 by User:Ser Amantio di Nicolao, who happens to be the most active editor in all of Wikipedia (as measured by number of edits). Within 60 seconds of creating this page, the same editor also created Polmalagama, Polommana, Polpitiya, Polwatta, and dozens of other substantially identical articles.
> The mathematician and philosopher Alex Bellos suggested in 2014 that a candidate for the lowest uninteresting number would be 247 because it was, at the time, "the lowest number not to have its own page on Wikipedia".
perhaps resolvable by moving the self-reference to the definition e.g. "the smallest number not interesting for any other reason that its uninterestingness"
And a quick way to get there is the Page information link in the sidebar. On the info page for each page[0] is a count and plot of views in the past 30 days, and at the bottom is a link to external tools like the one above.
Great sleuthing. Is there a convenient alternative algorithm that might be used for random article that would also continue to work fine as more articles are added or removed?
You can make a binary tree where each node counts the total (recursive) number of children. Updating these counts is log(N).
To insert a node, generate a random bit string: descend the tree and when the counts are equal then take the branch corresponding to that position in the bit string. When the counts are unequal, take the branch with the smallest count.
To remove a node just remove it from the tree and update the counts up along its path to the root. This assumes that articles are added as often as they are removed.
To sample, just generate another random bit string and traverse the tree according to it.
But I agree with the sibling comment that the current non-uniform algorithm is fine for a cute “Random article” button.
If the articles have some reasonably dense small IDs associated with them, then an easy algorithm is to simply pick a random ID between 0 and the max, check if it's a still-existing article, and repeat if not. There are plenty of distributed queue designs capable of distributing IDs to servers so they can be given to new articles, and re-using the id from a deleted article is fine to keep things denser.
The database itself may internally store a more dense identifier than an autoincrementing primary key. For e.g. ctid in Postgres or, under certain conditions rowid in sqlite. These should be fully dense after a vacuum, and rejection sampling can be used to paper over tombstones generated between vacuums.
Seems like it would be pretty easy to just maintain an alternative index of articles, with an integer ID counting from 1 to N. Then just pick a random number from 1 to N.
It could even be maintained entirely outside the Wikimedia servers, relying on database dumps.
I assume it's for performance reasons, a "select next greater than X" being faster than "get a random row" - not sure if databases even have the concept of returning a random row, or how performant it is.
Even if you used UUIDs, couldn't you just "select count(id) from articles" to get the list of articles, generate a random number between 0 and the count, then "select id from articles offset {index}"?
I don't find the random gap argument convincing. There are so many articles i think the gaps are going to be very small and reasonably close in size as to not matter (i have not done any analysis to back this up).
Sure there will of course be some unlucky articles, but does that actually matter?
No need to speculate on how big or small the gaps may be, the article looks at the actual random gap values used on Wikipedia.
Quote: "The least viewed article in the sample, Erygia sigillata, has a page_random value of 0.500764585777. The article Katherine Hanley is right on its tail with a value of 0.500764582314, which is just 0.000000003 less, or 3e-9 in scientific notation. This is 98% smaller than the average random gap. In other words, Erygia sigillata is an extremely unlucky article as far as the “Random article” button is concerned! It’s 50 times less likely to be landed on than an average article."
If you randomly draw z numbers from a bucket of numbers from 0 to 10000 times x, the gaps will differ because there is no mechanism that would make some number with close neighbors less or more likely to be drawn than any other remaining number.
It’s only once your supply of numbers runs low that differences will start to equalize, reaching 1 when you have exhausted your supply.
Besides, the author isn’t really making an argument. They are giving you actual data showing the differences to the next lowest number. It’s hard to argue with that.
Right, they said the absolute worst case outlier was a gap about 100x smaller than average. Which i guess depends on your views, but i would say that's not bad.
> the gaps will differ because there is no mechanism that would make some number with close neighbors less or more likely to be drawn than any other remaining number.
When a new article is inserted, there is a higher probability it will be inserted in a large gap than a small gap, so it should balance out.
I suppose you're right,i am responding to an implied criticism to the randomness method that the author didn't make. They just offered it as explanation.
Think of it this way. The article presents a random-gap list with only five entries, and it looks terrible; the most common page occurs 9 times as often as the least common.
Now think about a random-gap list with five million entries. With that many entries, will the gaps balance out? In the best case, five of them end up in the range from 0 to 1 millionth, five of them end up in the range from 1 millionth to 2 millionths, etc. But we've already seen what it looks like to have five uniform random numbers in a range (whether it's big or small doesn't matter); the gaps tend to be really varied. So we're going to get this sort of imbalance between gap sizes (viewed as a ratio) no matter how many entries we insert.
One other thought experiment: the article mentioned a page with a gap size of one billionth. How many pages will it take for that to balance out so that page doesn't have an unusually small gap any more? How many pages does Wikipedia have?
(This is similar to the reason that infinite space packed with marbles has the same packing density as infinite space packed with bowling balls.)
That’s only the case if there are only two articles. In general you’d be looking at the difference of two order statistics (of the uniform distribution). You could probably approximate this quite well as a Poisson process though (since there are so many articles) and so have exponentially distributed gap size.
> When a new article is inserted, there is a higher probability it will be inserted in a large gap than a small gap, so it should balance out.
That’s a good point and I’m not entirely sure why it (appears to) not work that way. Maybe it’s because that interval has a higher likelihood, but there is no preference for numbers towards the middle, that would dissect it into (roughly) equal parts?
I think it is working. The worst case was 50x less likely than the average. That sounds like a lot when stated like that, but its really not when taken in the context of ~6.5 million articles.
More interesting question would be what is the standard deviation (of gap size), not what is the worst outlier
The most interesting thing here is that all of the 6 million articles on Wikipedia have been seen by 1 person at least once. So if we take all humans and ask them to search 1 unique thing on Wikipedia, would there be any articles that would get 0 hits?
That page leads down an unpleasant Wikipedia rabbit hole. It's amazing to me that people thought "Goose pulling" was a pleasant pastime—tie a goose to a frame, ride towards it at a gallop, and grab it by the neck so its head comes off.
Besides finding hidden gems or undervalued articles, I would really like to see traction tracked. Now that rarely viewed articles are somewhat in the spotlight and get promoted, a before and after comparison would be interesting.
Do you mean pageview stats for the mentioned articles? You can add them in this for example. Data for today are not collated yet, should be visible by tomorrow.
I wouldn't equate the least viewed with unpopular. They may not show up well in search for lack of content.
I've come across plenty of articles which were low in views and only have a handful of lines of content. I would love to expand these articles, but the "culture" of Wikipedia actively works against this.
So this sentence made me wonder why he didn't actually just go do it, 6M pages isn't really all that big of a data set. Turns out that it's a problem of how the data is arranged. The raw files are divided by year/month/day/second[0], and then each of those seconds is a zipped file of about 500MB in size, where pages are listed like this:
en.m Alcibiades_(character) 1 0
en.m Alcibiades_DeBlanc 2 0
en.m Alcibiades_the_Schoolboy 1 0
en.m Alcide_De_Gasperi 2 0
en.m Alcide_Herveaux 1 0
en.m Alcide_Laurin 1 0
en.m Alcide_de_Gasperi 1 0
en.m Alcides_Escobar 3 0
en.m Alcimus_(mythology) 1 0
with en.m getting a separate listing from the desktop em, and all the other country codes getting their own listings too. So just collating the data would be a huge job.
The API also doesn't offer a specific list of all pages by time, so you'd have to go and make a separate call for each of the 6M pages for a given year and then collate that data too.
[0]for example, https://dumps.wikimedia.org/other/pageviews/2019/2019-01/