2021: https://twitter.com/jackclarkSF/status/1376304266667651078
2022: https://twitter.com/william_g_ray/status/1583574265513017344
2022: https://twitter.com/mtrc/status/1599725875280257024
Common Crawl and the Internet Archive crawls are probably the two most ready sources for this, you just have to define where you want to draw the line.
Common Crawl's first crawl of 2020 contains 3.1B pages, and is around 100TB: https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-05/inde... with their previous and subsequent crawls listed in the dropdown here: https://commoncrawl.org/overview
Internet Archive's crawls are here: https://archive.org/details/web organized by source. Wide Crawl 18 is from mid-2021 and is 68.5TB: https://archive.org/details/wide00018. Wide Crawl 17 was from late 2018 and is 644.4TB: https://archive.org/details/wide00017
2021: https://twitter.com/jackclarkSF/status/1376304266667651078
2022: https://twitter.com/william_g_ray/status/1583574265513017344
2022: https://twitter.com/mtrc/status/1599725875280257024
Common Crawl and the Internet Archive crawls are probably the two most ready sources for this, you just have to define where you want to draw the line.
Common Crawl's first crawl of 2020 contains 3.1B pages, and is around 100TB: https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-05/inde... with their previous and subsequent crawls listed in the dropdown here: https://commoncrawl.org/overview
Internet Archive's crawls are here: https://archive.org/details/web organized by source. Wide Crawl 18 is from mid-2021 and is 68.5TB: https://archive.org/details/wide00018. Wide Crawl 17 was from late 2018 and is 644.4TB: https://archive.org/details/wide00017