Hacker News new | past | comments | ask | show | jobs | submit login

2024 might already be too late, since this sentiment has been shared since at least 2021:

2021: https://twitter.com/jackclarkSF/status/1376304266667651078

2022: https://twitter.com/william_g_ray/status/1583574265513017344

2022: https://twitter.com/mtrc/status/1599725875280257024

Common Crawl and the Internet Archive crawls are probably the two most ready sources for this, you just have to define where you want to draw the line.

Common Crawl's first crawl of 2020 contains 3.1B pages, and is around 100TB: https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-05/inde... with their previous and subsequent crawls listed in the dropdown here: https://commoncrawl.org/overview

Internet Archive's crawls are here: https://archive.org/details/web organized by source. Wide Crawl 18 is from mid-2021 and is 68.5TB: https://archive.org/details/wide00018. Wide Crawl 17 was from late 2018 and is 644.4TB: https://archive.org/details/wide00017




Why is wide crawl 18 smaller than 17?


The tumbler purge was worse than I expected…




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: