The same point or a similar critique can be made a few ways, I’d say.
Running with the brain/neuron analogy, there’s a measurement problem (as there is in real neuroscience!). The synaptic activity of the “meta-mind” has been recorded with keyboards, smart phones and plain text. These aren’t the native ways of human communication though, the synapses if you will. That’s more like spoken conversation and physical interaction. All richer phenomena.
To the extent that “textual” communication is now native/normal to humanity, it’s still partial in coverage of all human interaction, new, and shifting with tech developments like video/streaming.
So the internet is a lossy representation, apart from whatever other biases it might have, as suggested above.
Do the datasets follow the algorithmic weightings? I thought they included all content for their domains without weighting by popularity / engagement algorithm.
By looking at the internet, especially web 2 content, you're getting what the engagement algorithms have decided is good for advertisers.
There's plenty of stuff that humanity does that the internet does not incentivize and thus has no representation for