This is such a clever way of sampling, kudos to the authors. Back when I was at Pew we tried to map YouTube using random walks through the API's "related videos" endpoint and it seemed like we hit a saturation point after a year, but the magnitude described here suggests there's a quite a long tail that flies under the radar. Google started locking down the API almost immediately after we published our study, I'm glad to see folks still pursuing research with good old-fashioned scraping. Our analysis was at the channel level and focused only on popular ones but it's interesting how some of the figures on TubeStats are pretty close to what we found (e.g. language distribution): https://www.pewresearch.org/internet/2019/07/25/a-week-in-th...
Perhaps stop and reconsider such a dismissive opinion given that "you've never had this issue before" then? Or go read up a bit more on how crawlers work in 2023.
If your site is very popular and the content changes frequently, you can find yourself getting crawled a higher frequency than you might want, particularly since Google can crawl your site at a high rate of concurrency, hitting many pages at once, which might not be great for your backend services if you're not used to that level of simultaneous traffic.
"Hammered to death" is probably hyperbole but I have worked with several clients who had to use Google's Search Console tooling[0] to rate-limit how often Googlebot crawled their site because it was indeed too much.
I have a website thats get crawled at least 50 times per second. Is that a real deal? No not really. The site is probably doing 10.000 requests per second. I mean a popular site is indexed a lot. Your webserver should be designed for it. What tech are you using if I may ask?
My specific case doesn't really matter (and my examples are from some years ago and of smaller clients, not my own setup).
My point was that people provision capacity ideally based on observed or expected traffic, and that crawlers can, and do, show up and exceed that capacity sometimes, having a negative effect on your customers' experience.
But you are correct that it's absolutely manageable. And telling crawlers to slow the F down is one of the tools you can use to manage it. :-)
if your site is popular and you have a problem with crawlers use robots.txt (in particular the Crawl-delay stanza)
also for less friendly crawlers a rate limiter is needed anyway :(
(of course the existence of such tools doesn't give carte blanche to any crawler to overload sites ... but let's say they implement some sensing, based on response times, that means a significant load is probably needed to increase response times, which definitely can raise some eyebrows, and with autoscaling can cost a lot of money to site operators)
I worked at a company back in 2005-2010 where we had a massive problem with Googlebot crawlers hammering our servers, stuff like 10-100x the organic traffic.
That's pre-cloud ubiquity so scaling up meant buying servers, installing them on a data center, and paying rent for the racks. It was a fucking nightmare to deal with.
This is one of the most important parts of the EUs upcoming digital services act in my opinion. Platforms have to share data with (vetted) researchers, public interest groups and journalists.
This technique isn't new. Biologists use it to count the number of fish in a lake. (Catch 100 fish, tag them, wait a week, catch 100 fish again, count the number of tagged fishes in this batch)
That's typically the Lincoln-Petersen Estimator. You can use this type of approach to estimate the number of bugs in your code too! If reviewer A catches 4 bugs, and reviewer B catches 5 bugs, with 2 being the same, then you can estimate there are 10 total bugs in the code (7 caught, 3 uncaught) based on the Lincoln-Petersen Estimator.
A similar approach is “bebugging” or fault seeding: purposely adding bugs to measure the effectiveness of your testing and to estimate how many real bugs remain. (Just don’t forget to remove the seeded bugs!)
That's not actually the technique the authors are using. Catching 100 fish would be analogous to "sample 100 YouTube videos at random", but they don't have a direct method of doing so. Instead, they're guessing possible YouTube video links at random and seeing how many resolve to videos.
In the "100 fish" example, the formula for approximating the total number of fish is:
total ~= caught / tagged
(where caught=100 in the example)
In their YouTube sampling method, the formula for approximating the total number of videos is:
total ~= (valid / tried) * 2^64
Notice that this is flipped: in the fish example the main measurement is "tagged" (the number of fish that were tagged the second time you caught them), which is in the denominator. But when counting YouTube videos, the main measurement is "valid" (the number of urls that resolved to videos), which is in the numerator.
Did you understand where the 2^64 came from in their explanation btw?
I would have thought it would be (64^10)*16 according to their description of the string.
The YouTube identifiers are actually 64 bit integers encoded using url-safe base64 encoding. Hence the limited number of possible characters for the 11th position.
Catching fish is theoretically not perfectly random (risk-averse fish are less likely to get selected/caught) but that's the best method in those circumstances and it's reasonable to argue that the effect is insignificant.
You make a very weak argument, and are simply assuming the conclusion.
What makes it the "best method"? Would it be better to use a seine, or a trap, or hook-and-line? How would we know if there are subpopulations that have different likelihood of capture by different methods?
To say it's "reasonable to argue that the effect is insignificant" is purely assertion. Why is it unreasonable to argue that a fish could learn from the first experience and be less likely to be captured a second time?
If what you mean is that it's better than a completely blind guess, then I'd agree. But it's not clearly the best method nor is it clearly unbiased.
Fair points. But, mark-recapture is about practicality. It's not perfect, but it's a solid compromise between accuracy and feasibility (so I mean best in these regards, to be 100% clear). Sure, different methods might skew results, but this technique is about getting a reliable estimate, not pinpoint accuracy. As for learning behavior in fish, that's considered in many studies (and many other things, like listed here: https://fishbio.com/fate-chance-encounters-mark-recapture-st... ), but overall, it doesn't hugely skew the population estimates. So, again, it's about what works best in the field, not in theory.
> You generate a five character string where one character is a dash – YouTube will autocomplete those URLs and spit out a matching video if one exists.
Won't this mess up stats though? It's like a lake monster randomly swapping an untagged fish with tagged fish as you catch them.
So as usual, the exploitative agents get to destroy the commons and come out on top.
We need to figure out how to target the malicious individuals and groups instead of getting creeped out by them to the point of destroying most of the so praised democratizing of computing. Between this and locking down the local desktop and mobile software and hardware, we've never got to having the promised "bicycle for the mind".
And what kind of accountability is that? An engagement algorithm is a simple thing that gives people more of what they want. It just turns out that what we want is a lot more negative than most people are willing to admit to themselves.
I would rephrase that to 'what we predictably respond to'.
You can legitimately claim that people respond in a very striking and predictable way to being set on fire, and even find ways to exploit this behavior for your benefit somehow, and it still doesn't make setting people on fire a net benefit or a service to them in any way.
Just because you can condition an intelligent organism in a certain way doesn't make that become a desirable outcome. Maybe you're identifying a doomsday switch, an exploit in the code that resists patching and bricks the machine. If you successfully do that, it's very much on you whether you make the logical leap to 'therefore we must apply this as hard as possible!'
This comment has a remarkable lack of nuance in it. That isn't even remotely close to how how human motivation works. We do all kinds of things motivated by emotions that have nothing to do with "liking" it.
I don't think people "like" it as much as hate elicits a response from your brain, like it or not.
If people had perfect self-control, they wouldn't do it. IMO it's somewhat irresponsible for the algorithm makers to profit from that - it's basically selling an unregulated, heavily optimized drug. They downrank scammy content for instance, which limits its reach - why not also downrank trolling? (obviously bc the former directly impacts profits, but not the latter, but still)
The original open API from the Facebook was open for the benefit of the good actors to use their data. You can disagree with how it's used, but u can't disagree with the intention.
With the CA scandal, now all the big companies would lock down their app data and sell ads strictly through their limited API only, so the ads buyer would have much less control before.
It's basically saying: u cant behave with the open data. Then we will do business only
CA was about 3rd parties scraping private user data.
Companies are locking down access to public posts. This has nothing to do with CA, just with companies moving away from the open web towards vertical integration.
Companies requiring users to login to view public posts (Twitter, Instagram, Facebook, Reddit) has nothing to do with protecting user data. It's just that tech companies now want to be in control of who can view their public posts.
I'm a bit hazy on the details of the event but the spirit still applies: there were more access to the data that were not 100% profit driven. Now the it's locked down as the companies want to cover their asses and do not want another CA
It is a little more sophisticated. They say they use an exploit that was found where a URL with five characters with a dash will get autocompleted by YouTube (I wonder why that is.) That improves sampling by 32,000 times apparently
This is an interesting way to attack mitigations to the German Tank Problem [0].
I expect the optimal solution is to increase the address space to prevent random samples from collecting enough data to arrive at a statistically significant conclusion. There are probably other good solutions which attempt to vary the distribution in different ways, but a truly random sample should limit that angle.
I didn't read it in the article but this hinges on it being a discrete uniform distribution. Who knows what kind of shenanigans Google did to the identifiers.
Actually the method works regardless of the distribution. It's an interesting and important feature of random sampling. Consider 1000 IDs assigned in the worst (most skewed) way: as a sequence from 0 to 999. If there are 20 videos they will have IDs 0 to 19. If you draw 500 independent random numbers between 0 and 999, each number will have 2% probability of being in the range 0 to 19. So on average you will find that 2% of your 500 attempts find an existing video. From that you conclude that 2% of the whole space of 1000 IDs are assigned. You find correctly that there are 0.02*1000 = 20 videos.
Exactly, and if I make a mistake and assume that the possible space is 0 to 3999, it still works! I'll just need a bigger sample to estimate the number of videos with the same precision. (The method does fail if I exclude valid values, e.g. assume a space of 0 to 499).
We assume we know the possible space for YouTube URLs, but it might not be a fair assumption.
Take phone numbers as an analogy. Without foreknowledge or without careful analysis of the distribution of phone numbers, you might assume all numbers are valid, but in fact 555-xxxx is always invalid within each area code [0]. For each set of reserved numbers our address space is that much smaller, which can skew the results of the statistics we gather from it if we don't exclude them from our original calculations.
It may be that YouTube reserves off certain address spaces (eg maybe it can't start with a 0, or maybe two visually similar values cannot be next to each other (eg I and 1), etc), which may make this sampling method (slightly less) accurate than it might otherwise appear.
Would be quite the challenge to use a skewed distribution of the address space that's skewed enough to mitigate this type of scraping while at the same time minimizing the risk of collisions.
this is exactly what springs to mind when it emerged google "conveniently" autocomplete under certain circumstances, thus making those identifiers more likely to be targeted. this completely skews the sample from the outset
How does a random sample solve for a clustered, say, distribution? Don't the estimations rely on assumptions of continuity?
Suppose I have addresses /v=0x00 to 0xff, but I only use f0 to ff; if you assume the videos are distributed randomly then your estimates will always be skewed, no?
So I take the addressable space and apply an arbitrary filter before assigning addresses.
Equally random samples will be off by the same amount, but you don't know the sparsity that I've applied with my filter?
As long as the sampling isn't skewed, and is properly random and covers the whole space evenly, it will estimate cardinality correctly regardless of the underlying distribution.
There is no way for clustering to alter the probability of a hit or a miss. There is nowhere to "hide". The probability of a hit remains the proportion of the space which is filled.
It's important to know stats like this dislike counts, because youtube is such a large and public platform that it borderline a public utility.
from the article:
> It’s possible that YouTube will object to the existence of this resource or the methods we used to create it. Counterpoint: I believe that high level data like this should be published regularly for all large user-generated media platforms. These platforms are some of the most important parts of our digital public sphere, and we need far more information about what’s on them, who creates this content and who it reaches.
The gov't ought to make it regulation to force platforms to expose stats like these, so that it can be collected by the statistics bureaus.
Nobody is stopping users from selecting one of the many YouTube competitors out there (eg - Twitch, Facebook, Vimeo) to host their content. We could also argue that savvy marketers/influencers use multiple hosting platforms.
YouTube's data is critical for YouTube and Google, which is basically an elaborate marketing company.
Governments should only enforce oversight on matters such as user rights and privacy, anticompetitive practices, content subject matter, etc.
Vimeo is not a viable alternative for creators who are trying to make monetize their content.
With very limited exception, Vimeo imposes a 2TB/month bandwidth limit [0] on all accounts. If you exceed that limit and don’t agree to pay for your excess usage, Vimeo will shut you down.
> youtube is such a large and public platform that it borderline a public utility.
So has been the big banks, large corporations, land but they all feed off each other and the government. What we want as a community is usually quite different to what they decide to do.
I was expecting to find out how much data YouTube has, but that number wasn't present. I've used the stats to roughly calculate that the average video is 500 seconds long. Then using a bitrate of 400 KB/s and 13 billion videos, that gives us 2.7 exabytes.
I got 400KB/s from some FHD 24-30 fps videos I downloaded, but this is very approximate. YouTube will encode sections containing less perceptible information with less bitrate, and of course, videos come in all kinds of different resolutions and frame rates, with the distribution changing over the history of the site. If we assume every video is 4K with a bitrate of 1.5MB/s, that's 10 exabytes.
This estimate is low for the amount of storage YouTube needs, since it would store popular videos in multiple datacenters, in both VP9 and AV1. It's possible YouTube compresses unpopular videos or transcodes them on-demand from some other format, which would make this estimate high, but I doubt it.
That storage number is highly likely to be off by an order of magnitude.
400KB/s, or 3.2Mbps as we would commonly use in video encoding, is quite low for original quality upload in FHD or commonly known as 1080p.
The 4K video number is just about right for average original upload.
You then have to take into account YouTube at least compress those into 2 video codec, H.264 and VP9. Each codec to have all the resolution from 320P to 1080P or higher depending on the original upload quality. With many popular additional and 4K video also encoded in AV1 as well. Some even comes in HEVC for 360 surround video. Yes you read that right. H.265 HEVC on YouTube.
And all of that doesn't even include replication or redundancy.
I would not be surprised if the total easily exceed 100EB. Which is 100 (2020 ) Dropbox in size.
I mean, it would explain the minutes-long unskippable ads you get sometimes before a video plays. There's probably an IT maintenance guy somewhere, fetching that old video tape from cold storage and mounting it for playback.
I pine for the day when "hella-" extends the SI prefixes. Sadly, they added "ronna-" and "quetta-" in 2022. Seems like I'll have to wait quite some time.
For anyone wondering "queca" would be the normal spelling of the "profanity" although it's probably one of the milder ways to refer to "having sex". "Fuck" would be "foda" and variations. Queca is more of a funny way of saying having sex, definitely not as serious as "fuck".
Hyundai Kona on the other hand was way more serious and they changed it to another Island in the Portuguese market. Kona's (actual spelling "cona") closest translation would be "cunt", in the US sense in terms of seriousness, not the Australian more light one.
> Two of the suggestions made were brontobyte (from 'brontosaurus') and hellabyte (from 'hell of a big number'). (Indeed, the Google unit converter function was already stating '1 hellabyte = 1000 yottabytes' [6].) This introduced a new driver for extending the range of SI prefixes: ensuring unofficial names did not get adopted de facto.
On one hand: just two formats? There are more, e.g. H264. And there can be multiple resolutions. On the same hand: there might be or might have been contractual obligations to always deliver certain resolutions in certain formats.
On the other hand: there might be a lot of videos with ridiculously low view counts.
On the third hand: remember that YouTube had to come up with their own transcoding chips. As they say, it's complicated.
Source: a decade ago, I knew the answer to your question and helped the people in charge of the storage bring costs down. (I found out just the other day that one of them, R.L., died this February... RIP)
For resolutions over 1080, it's only VP9 (and I guess AV1 for some videos), at least from the user perspective. 1080 and lower have H264, though. And I don't think the resolutions below 1080 are enough to matter for the estimate. They should affect it by less than 2x.
The lots of videos with low view counts are accounted for by the article. It sounds like the only ones not included are private videos, which are probably not that numerous.
I did the math on this back in 2013, based on the annual reported number of hours uploaded per minute, and came up with 375PB of content, adding 185TB/day, with a 70% annual growth rate. This does not take into account storing multiple encodes or the originals.
Do you know that for certain? I always suspected they would, so they could transcode to better formats in the future, but never found anything to confirm it.
On all of the videos I have uploaded to my YouTube channel, I have a "Download original" option. That stretches back a decade.
Granted, none of them are uncompressed 4K terabyte sized files. I haven't got my originals to do a bit-for-bit comparison. But judging by the filesizes and metadata, they are all the originals.
Google used to ask scaling questions about youtube for some positions. They often ended up in some big-O line of questions about syncing log-data across a growing an distributed infrastructure. The result was some ridiculous big-O(f(n)) where the function was almost impossible to even describe verbally. Fun fun.
The author notes that they used "cheats". Depending on what these do the iid assumption of the samples being independent could be violated. If it is akin to snowball sampling it could have an "excessive" success rate thereby inflating the numbers.
> Jason found a couple of cheats that makes the method roughly 32,000 times as efficient, meaning our “phone call” connects lots more often
> it was discovered by Jia Zhou et. al. in 2011, and it’s far more efficient than our naïve method. (You generate a five character string where one character is a dash – YouTube will autocomplete those URLs and spit out a matching video if one exists.)
I assume the cheat is something like using the playlist API that returns individual results for whether a video exists or not.
So you issue an API to create a playlist with video IDs x, x+1, x+2, ..., and then when you retrieve the list, only x+2 is in it since it is the assigned ID.
The data probably wouldn't look so clean if it were skewed. If Google were doing something interesting it probably wouldn't be skewed only by a little bit.
Admittedly, I did not read the paper linked. But my point is not about google doing something funny. Even if we assume that ids are truly random and uniformly distributed this does not mean that the sampling method doesn't have to be iid. This problem is similar to density estimation where Rejection sampling is super inefficient but converges to the correct solution, but MCMC type approaches might need to run multiple times to be sure to have found the solution.
Proving that using cheats and auto complete does not break sample independence and keeps sampling as random as possible would be needed here for stats beginners such as me!
Drunk dialing but having a human operator that each time tries to help you connect with someone, even if you mistyped the number... Doesn't look random to me.
However I did not read the 85 pages paper... Maybe it's addressed there.
> By constructing a search query that joins together 32 randomly generated identifiers using the OR operator, the efficiency of each search increases by a factor of 32. To further increase search efficiency, randomly generated identifiers can take advantage of case insensitivity in YouTube’s search engine. A search for either "DQW4W9WGXCQ” or “dqw4w9wgxcq” will return an extant video with the ID “dQw4w9WgXcQ”. In effect, YouTube will search for every upper- and lowercase permutation of the search query, returning all matches. Each alphabetical character in positions 1 to 10 increases search efficiency by a factor of 2. Video identifiers with only alphabetical characters in positions 1 to 10 (valid characters for position 11 do not benefit from case-insensitivity) will maximize search efficiency, increasing search efficiency by a factor of 1024. By constructing search queries with 32 randomly generated alphabetical identifiers, each search can effectively search 32,768 valid video identifiers.
They also mention some caveats to this method, namely, that it only includes publicly listed videos:
> As our method uses YouTube search, our random set only includes public videos. While an alternative brute force method, involving entering video IDs directly without the case sensitivity shortcut that requires the search engine, would include unlisted videos, too, it still would not include private videos. If our method did include unlisted videos, we would have omitted them for ethical reasons anyway to respect users’ privacy through obscurity (Selinger & Hartzog, 2018). In addition to this limitation, there are considerations inherent in our use of the case insensitivity shortcut, which trusts the YouTube search engine to provide all matching results, and which oversamples IDs with letters, rather than numbers or symbols, in their first ten characters. We do not believe these factors meaningfully affect the quality of our data, and as noted above a more direct “brute force” method - even for the purpose of generating a purely random sample to compare to our sample - would not be computationally realistic.
That's very clever. Presumably the video ID in the URL is case-sensitive, but then YouTube went out of their way to index a video's ID for text search, which made this possible.
Good observation, but they also acknowledge:
> there are considerations inherent in our use of the case insensitivity shortcut, which trusts the YouTube search engine to provide all matching results, and which oversamples IDs with letters, rather than numbers or symbols, in their first ten characters. We do not believe these factors meaningfully affect the quality of our data, and as noted above a more direct “brute force” method - even for the purpose of generating a purely random sample
In short I do believe that the sample is valuable, but it is not a true random sample in the spirit that the post is written, there is a heuristic to have "more hits"
This is a fun dataset. The paper leaves a slight misimpression about channel statistics: IIUC, they do not correct for sampling propensity to reweight when looking at subscriber counts (it should be weighted ~1/# of videos per channel since the probability of a given channel appearing is proportional to the number of public videos that channel has, as long as the sample is a small fraction of the population).
I skimmed through the article but that's a lot of assumptions there if so.
1. So let's say that possible range of values is true (10 characters of specific range + 1). That would represent one big circle of possible area where videos might be.
2. Distribution of identifiers (valid videos) is everything. If Youtube did some contraints (or skewing) to IDs, that we don't know about, then actual existing video IDs might be a small(er) circle within that bigger circle of possibilities and not equally dispersed throughout, or there mught be clumping or whatever... So you'd need to sample the space by throwing darts in a way to get a silhouette of their skew or to see if it's random-ish, by I don't know let's say Poisson distribution.
Only then one could estimate the size. So is this what they're doing?
I see what you did there. So basically an overlapped proportion (or hits proportion) would be overlapping hits divided by samples run, and then an estimated total would be this proportion divided by total space of possibilities. That would work.
It’s really easy for them to block this method: return a random video for a certain percent of non existing identifiers. Throw in a bit of random for good measure
You could keep that assumption and just serve a random video if some external IP dials a non-existent ID.
Presumably, no one would do that except researchers trying to count videos (or randomly find hidden ones?).
You can break the assumption of unicity (if an unassigned ID is later assigned) if you do that internally, although not sure that’d be common but it’s not an assumption that has to be strict for non-attributed ones, and you never use the fake ID.
This is easy to detect, though, at least for public videos. Click through to the channel on a different IP and find the video link, or search for the video's title and description, or find a canonical link to the video by any other means. If the IDs don't match, it's been faked.
You misunderstand that this technique was executed by using YouTube _search_ to find videos, not by querying the exact URL. They can doctor search results however they like.
Though if they didn’t say they were doing that we wouldn’t know the method was invalid. Further, that other video would have its own existing uid, so in theory we could know if they’d duplicated it to thwart these measures.
Are the video IDs sequential in the available domain? or just all over the place? Is there anything in common with all known live video IDs that could make it easier to scan the quintillion possibilities?
They could use a random redirect function. Throw the user a video URL which is random, but then redirects to the video that you want. So one could never count the space?
The paper says they made 1.8e10 attempts to produce 1e4 videos. TFA says they now have 2.5e4 videos, so they’ve made >4e10 attempts so far. No way in hell you can scrape YouTube like that without buying access to a big proxy IP pool (e.g. Luminati).
does that sampling function assume every "area code" contain the same number of usable numbers? In the case of some big sites out there (twitter, etc), it's common to have certain shards be much less dense when they hold data that's more requested - eg there'd be fewer numbers on the area code Justin Bieber is in, etc. That might skew things considerably..
Yeah that was my immediate response - what makes them think that videos are actually distributed randomly across the space? I would assume that the numbers are more like a UUID which has random sections but also meaningful sections as well, so it is quite likely that certain subsets have skewed or more/less density.
Their estimator does not require any assumption of uniformity. Notice that P(random ID is valid) = total_valid_IDs / total_IDs is the same regardless of the distribution of valid IDs.
I don't think this is correct. In order to generate a random number at all you first have to assume a distribution. If they generate a series of random ids, and those random ids are uniformly distributed across the space of all possible ids - while the valid ids have clusters - then won't their method still give a skewed estimate?
Nope. Let's take a small example. We have 20 bins. You are going to put 3 things in those bins, with each thing getting its own bin.
I'm going to roll a d20, which gives me a uniformly distributed integer from 1 through 20. I'll look into the bin with that number and if there is something there take it.
You don't want me to take any of your things. How can you distribute your 3 items among those 20 bins to minimize the chances that I will take one of your items?
If I were not using a uniformly distributed integer from 1 through 20 and you knew that and knew something about the distribution you could pick bins that my loaded d20 is less likely to choose.
But since my d20 is not loaded, each bin has a 1/20 chance of being the one I try to steal from. Your placement of an item has no affect on the probability that I will get it.
It works the same the other way around. If you place the items using a uniform distribution, then it doesn't matter if I use a loaded d20, or even just always pick bin 1. I'll have the same chance of getting an item no matter how I generate my pick.
In general when you have two parties each picking a number from a given space, if one of the parties is picking uniformly then nothing the other party can do affects the probability of both picking the same number.
> Let's take a small example. We have 20 bins. You are going to put 3 things in those bins, with each thing getting its own bin.
Now imagine a number of these bins _can hold zero things_ (not 3). Eg in a world where all bins are the same size, you can always steal 3 things from any of the bins, whereas in a world where the bin sizes vary. You'd hit a few bins which are guaranteed empty. Doesn't this directly affect the probabilities?
The key here is that the query sample is uniformly distributed, and that this is sufficient. I think some other comments in this thread give some decent intuitions why this is true. Cheers
I'm not sure they've sampled enough videos to be able to make that kind of central limit theorem assertion. A trillionth of a percent is an awfully small sample of the total space.
How did you get the trillionth of a percent? The total space is 2^64 and they found 24,964 videos. Based on their estimated number of videos (13,325,821,970) we can infer[1] that they have made the equivalent of a sample of size 3.46e13. I say equivalent because of the "cheating" they mention in the article which improves the efficiency of their method so their actual sample would be 32000 times smaller I guess). Anyway as shown in the link below that's a good sample size since it gives a precision of about 0.6%.
Wait a second. If the IDs are all allocated in a contiguous block, and the author samples the whole space at random—then the estimate will converge to the correct value. If, on the other hand, the IDs are allocated at random, the estimate will also converge to the correct value.
If you believe that there is some structure in YouTube video IDs, that would have no effect on this experiment. It would just reduce the fraction of the total address space that YouTube can use. This is a well-known property of "impure" names, and it means there is a good chance that the IDs have no structure. In other words, the video IDs would be "pure" names.
Is 32,000 a good enough number to estimate the entirety of the Youtube’s video space? It felt to little for what they are trying to accomplish (especially when they started doing year by year analysis)
32000 is just the "cheat factor" by which they increase the method's efficiency.
I'm not sure how much the "cheating" would affect the precision of the result. But assuming it has no effect, it's easy to estimate this precision:
They found X = 24964 videos in a search space of size S = 2^64. For the number of existing videos they report the estimate N = 13,325,821,970. From this we can find their estimate for the probability that a particular ID links to a video: p = N / S ≈ 7.22e-10. So the equivalent number of IDs that they have checked (the number of checks without cheating that would give the same information) is n = X / p ≈ 3.46e13.
Since X is a Binomial, its variance is Var(X)=n⋅P(1-P) (where P is the real proportion corresponding to the estimate p above). And N = X⋅S/n so its variance is Var(X)⋅S^2/n^2. The standard deviation of N is thus σ = S⋅sqrt(P⋅(1-P)/n). Now we don't know P but we can use our estimate p instead to find an estimate of σ!
We find that the standard deviation of their estimator for the number of YouTube videos is approximately S⋅sqrt(p⋅(1-p)/n) ≈ 8.43e7. That's just 0.633% of N so their estimate is quite precise.
When you're estimating a ratio between two outcomes, the rule of thumb is that you want at least 10-100 samples of each outcome, depending on how much precision you want.
They got 10,000 samples of hits, and a huge number of samples of misses. Their result should be very accurate. (32,000 was a different number)
Anything random-ish will make this estimation method work. On either youtube's side or the measuring side.
Why would they not be random? Nobody has ever found a pattern that I'm aware of, and there are pretty solid claims of past PRNG use. And a leak of the PRNG seed was likely why they mass-privated all unlisted videos a couple years ago.
The beauty of this method is that it doesn't matter. Even if YouTube generated sequential IDs, the researchers could still sample them fairly by testing random numbers.
It would be even cooler if it had a deeper view of categories. Right now the biggest category by far is People & Blogs but it's possible to get much more information if it was broken down into sub-categories.
Turns out there are 2^64 possible YouTube addresses, an enormous number: 18.4 quintillion. There are lots of YouTube videos, but not that many.
Let’s guess for a moment that there are 1 billion YouTube videos – if you picked URLs at random, you’d only get a valid address roughly once every 18.4 billion tries.
I mean, sure, they did reduce that a little:
Jason found a couple of cheats that makes the method roughly 32,000 times as efficient, meaning our “phone call” connects lots more often.
With that in mind, how many attempts did they make to get a hit every three minutes?
It's surprising that they weren't throttled back for making excessive requests.
finding random videos is fairly easy these researchers
are doing it the dumb brute-force way:
1.Block any famous accounts(1M+views):
these poison your recommendations.
2.Like only old and lowest-view videos.
3.Only click on lowest-viewcount videos that are not
marked as recent.
in a few days your home feed and recommendations will
consist of random low-ranked junk.
Then, start blocking the highest views channels
and repeat the 1.2.3. steps to get even more obscure content.
No need to scrape millions of IDs or flood the servers
with random requests.
The fact that reports on "misinformation" don't look at the denominator when considering the volume of impressions is a great example of selectively reporting statistics to support a preconceived notion.
Not to mention bot farms. The most-viewed source of “misinformation” in the linked 2020 report simply… doesn’t exist anymore. Have we just been hearing about crappy sites buying views these past few years?
The videos they find also show view counts. This allows them to estimate (very roughly!) the view counts across all videos, because it allows them to see the distribution.
The issue with that is if there is one or a handful of videos that have a significant portion of all youtube views. Most likely they will not be in your sample which could lead to a big underestimate.
It should be obvious, but "misinformation" is an arbitrary political designation and therefore a constantly moving target. "You should wear an N95 mask to avoid getting COVID" was misinformation in March 2020 but not 3 months later. "Vaccines may not prevent COVID transmission" was misinformation from 2020 until sometime in 2022. "COVID infection may confer significant immunity from future infection" was misinformation for about the same period. The "lab leak hypothesis" was a "racist conspiracy theory" from 2020 until 2023, when it was officially endorsed by YouTube's political sponsors. And so on.
also its hard to replicate. like for reddit there's a lot of alternatives cause it's mainly texts but youtube videos require datacenters which are very expensive. perhaps some kind of distributed video storage? but seems hard.
Non-view-weighted and non-impression-weighted stats are interesting but basically useless for disinformation research (which is what the post starts about). To a first approximation every viewpoint and its opposite is out there somewhere on YouTube, along with zero-value videos like security cam backups. The real questions of societal interest involve reach and volume.
Same here. A computer decides I'm not going to see this and that's that, the info on page is 403 forbidden and the word nginx. No recourse or way to let anyone know the thing is broken. If my IP address or language setting were Russian or something I'd at least be able to see a reason for it, but no, Germany and English. On an Android! I know, very dangerous.
Can anyone post what this site is? So I can tell whether this #1 story is worth it trying a different ISP or user agent string or whatnot
It's the blog of a researcher at the University of Massachusetts-Amherst. He does research on the impact of social media on society. It's a Wordpress site, hosted by someone commercial. I know him, and I'm almost certain no blocking would be intentional. If you (or others) give me details about how your access is failing, I'll pass the info on to him so he can try to get it fixed.
> you (or others) give me details about how your access is failing
Well, I click the link and get the access denied page. Do you need an IP address (93.135.167.200), user agent string¹, timestamp, or what would help?
Randomly noticed my accept-language header is super convoluted, mixing English-in-Germany, English, German in Germany, Dutch in the Netherlands, Dutch, and finally USA English, with decreasing priority from undefined to 0.9 to 0.4 in decrements of 0.1. I guess it takes this from my OS settings? Though I haven't configured it explicitly like that, particularly the en-US I'd not use because of the inconsistent date format and unfamiliar units system. Maybe the server finds it weird that six different locales are supported?
Thanks for responding and relaying!
¹ Mozilla/5.0 (Linux; Android 10; Pixel Build/QP1A.190711.019; wv) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Mobile Safari/537.36
I'll pass it on. I'll note that when I tried to create an archive link with archive.ph, its crawler also reported that it failed with 403. The archive.org (unrelated organization) crawler worked as expected. Which makes me think it's probably not specific to your setup, but some sort of anti-DOS protection gone awry. The sudden HN load probably looked abnormal, and it erroneously started blocking legitimate requests from some parts of the internet.
Youtube is in a fairly sad state from it's former self. Search has become, quite frankly, unusable. I recall searching for a <band> recently at the gym, and instead of just showing me <band>. Results showed 3-4 results then completely unrelated topics to <band>.
Results are entirely off-topic, but related to other interests I have; tech, planes, trains, and automobiles, etc.
Just FYI, in case you didn't notice, you can bypass the stupid thing that makes search useless by using a filter, like filtering by "Type: Video" or something. I'm sure that workaround will stop working some day randomly.
The "For You" section of my youtube search is awful, like terribly awful. It's decided for some reason I enjoy popping and weird skin infection videos. I think this is because I clicked through to an external link to a horse getting a hoof infection shaved once (did not enjoy and did not watch all of it). That's literally the only time I've watched anything like that and since then it's always at the bottom of my search. These videos are never recommended on my front page, though.
The one comment on this post is hilariously unhinged. For posterity:
> NOT big enough; that it can’t be wiped away from the surface earth; with a relatively cyber attack from the Alliance. In fact; Google/YouTube will just be one; of 63 communist controlled platforms; that will be completely erased from existence once the order is given.
I actually was more interested in that comment than the article (not that the article wasn't interesting, but that comment was just wow). Who is the Alliance?? What are the 63 communist controlled platforms?? Who is giving the order to wipe them out?
Semi colons are a little understood and rarely used punctuation symbol, the mark of a true intellectual; it's only natural then, that those of us that know how the world really works use them more than others - and then there are those that think they know how they world works, but are mistaken: they similarly misuse the semicolons.
I wonder how the wingnut in the comments arrived (and whether he read the article at all - wingnuts usually don't):
> NOT big enough; that it can’t be wiped away from the surface earth; with a relatively cyber attack from the Alliance. In fact; Google/YouTube will just be one; of 63 communist controlled platforms; that will be completely erased from existence once the order is given.
I have to wonder what this "Alliance" is supposed to be.
My guess, "The Alliance" would be a QAnon trope, a confederation of select "White Hats" in the US military and parts of US government (and other select humans) and some alien "Arcturian" UFO contingent. They are trying to overthrow the Deep State (corrupt civil/intelligence/military service in USA and Western governments) that are colluding with "globalists" and exploitative capitalists to economically and socially oppress "good white people of European descent" and corrupt their morals and dilute their influence and moral strength. Presumably The Alliance chose Trump to become President in 2016 as part of an attempt to awaken the masses and usher in "The Great Reset / The Storm / The Awakening" where global corruption will be exposed, corrupt politicians, businesspeople, officials, cultural influencers will quickly be judged and executed.
There are many variations on this theme, and though not widely held as a belief, elements seep into the wider discourse and thought.
I'm not sure what the aliens' motivation would be in this scheme, and often they're not part of the story.
EDIT: here's a random page that likely will bring more questions than answers but it references the elements described above ... https://divinecosmos.com/davids-blog/22143-moment-of-truth-q... ... there are variations of this conspiracy theory tailored to many different tendencies: New Agers, Evangelical Christians, White Supremacists, Conservative Republicans, etc. For a further look, find interesting interviews with a certain Jan Halper Hayes and her supposed work with USA Space Force. Some of these promoters a true believers, others are grifters. Also ... grand corporatists and communists are usually conflated in these narratives.
It's sort of a wonder that this screenplay has not yet been a major hollywood production. I mean, sure, anything QAnon is terribly uncool for us informed types /s but its hard to believe that this would not be a major success if done with high production values and non-cheesy script.
Something vaguely related to next year's A24 release "Civil War", but much more QAnon-adjacent and informed. Even better if Q would denounce it all as an attempt to "innoculate people from the truth".
"We're starting some experiments to understand how the videos YouTube recommends differ from the "average" YouTube video - YouTube likes recommending videos with at least ten thousand views, while the median YouTube video has 39 views."
In experiments by yours truly, it seems that quality goes down as views go up.
As a user, for me, YouTube has been shrinking ever since the first adpocalypse. By that I mean I rarely see new videos or relevant recommendations. It is all stale content and pages only couple of pages instead of going infinite like it used to. My most visited page is home page which shows recommended content(it used to be on a dedicated page) and I literally see the same content for days, rarely with new videos. And I have hundreds of subscribed channels. Youtube is a living corpse, I do not care what numbers they put out in the public.
Go to the subscribed page and watch the channels you subscribed to. The home page is whatever YouTube is pushing you to watch - it’s not necessarily what you want to watch…
Yeah some people never do that for some reason. I have a simple little tampermonkey script set up to automatically just redirect me to my subscriptions when I go to the home page on accident.
Funny, for me it's the opposite, I hardly ever set foot outside of my subcriptions page except for searching, which is how I find new channels to add to subs.
There's indeed a big difference of objective between YouTube and their users because of that.
However, that's also an opportunity that YouTube hasn't realized yet, they could become a primary platform for content if they had a better algorithm and a better search, it would help them to monetize better those subscriptions.
Youtube has one of the lowest per viewer revenue and the poor discovery isn't an accident in that.
Looking at things in terms of ratios seems like an odd way of judging whether misinformation is a problem. If 0.5% of the sentences I say in a day are violent threats towards children that'd be a problem, right?
“Won’t you think of the children?” type questions are not very useful, imho.
In your case, no one (barring a few people employing child soldiers) think children are fair game to be hurt so the answer to your question is that yeah, 0.5% of statements which are violent threats to children is a problem.
Covid misinformation isn’t as clear cut - reasonable people asked about the lab leak theory while yet others asked why a vax was being rolled out without holding manufacturers liable for adverse outcomes.
We’ve had situations in the past when a new medicine caused children to be born with shortened arms and other birth defects. In this light, it is reasonable to wonder why we should trust a pharma company when they didn’t trust themselves.
But the linked post is not even about the correctness of misinformation - given something has been classified as misinformation, how often is it viewed compared to non-misinformation videos is what they’re trying to answer.
These calculations should be viewed as long term expected value. Long-term is probably 10-20 years in this case. So, based on your calculations, each video is about 2-4 bucks a year.