Exactly. Everybody always talks about turning tech companies into public utilities without really explaining what the utility would be. A public index of the web would be an amazing utility for many of the reasons in this thread and would spur on a ton of new innovation and businesses.
- A cost of membership runs contrary to establishing this group, especially at such a high recurring charge.
- I'm not sure what your software/AWS situation looks like, but 20 million robots.txt files acquired from Common Crawl is something I can analyze on my PC. It doesn't seem to presently justify such high costs.
- Prioritize building a mockup index with an intuitive frontend. This is essential for non-technical people to understand
- Exclusively talk with EU legislators (they are motivated, whereas nothing will happen in the US).
I think the price for membership dues is reasonable and many people agree evidenced by them signing up. I think I might start a petition that is free to sign up on though, thank you for the inspiration!
It is possible to analyze those files on the pc, it just takes a much longer time. The analysis is an iterative process and so the faster the computers the faster the iterations and process go. I was analyzing them on my pc with python for the first year until it got too slow and my I am using an aws server with some rust and that is going much better. I also need to increase the number of files analyzed by about two orders of magnitude soon as well.
Great idea, very cool. That’s going on the todo list!
And I am going to be reaching out to and speaking with whoever is interested. One of the fun things about this is that it is an international dynamic, with some jurisdictions having abilities that others don’t. For example, the UK CMA has subpoena powers that the US Congress lacks and got a ton of information out of Google and Bing that shocked me. The US has the ability to get the CEO’s to show up to hearings while the UK does not in the same way. Why limit ourselves to one government when there are so many to mix and match from here?
What kind of research is it? Is it just funding you or members can participate? I do Bayesian and non Bayesian data analysis and modeling, would be potentially fun to poke at the data
Right now the research is two parts, analyzing websites robots.txt files for bias and then talking with search engine operators and website operators to get their stories that validate these ideas. Right now I am the main person working on this research but I would like to get the robots.txt parsing and analysis code open sourced soon to allow people to start digging in. Getting people access to the data is trickier, but it feels doable as well. You'd be welcome to join in!
If you would like to read more about all this, please checkout https://knuckleheads.club