> It's almost trivial to built a bot to scrape OS code from anywhere on the web....

mewpmewp2 · on May 16, 2023

I would've said you should download only archives, but really I think commits are also very important data since that shows the actual changes in the code which would be very useful to train AI to suggest changes to the code.

marginalia_nu · on May 16, 2023

There are valid non-evil reasons for git hosts to want to throttle and put up obstacles toward scraping as well, both via crawlers or 'git clone' or whatever. These are very expensive operations.

flockonus · on May 16, 2023

It appears to be the exact opposite to me, `git clone --depth 1 ...` will give you a code that you can know exactly how to parse, vs. webpages that have all sorts of semantical issues.

marginalia_nu · on May 16, 2023

Git clone is a very expensive operation. Git hosts generally will try to prohibit mass git clone:ing for this reason.

blowski · on May 16, 2023

What makes it so expensive? I’d always assumed it downloaded the .git directory statically, and the computational bits were down by the local client.

ablob · on May 16, 2023

I'd assume this is in relation to how much other operations cost. With 'git clone' you at least download the whole repository. Compare that to 'git fetch', which is essentially a lookup at the last-modified timestamp.

marginalia_nu · on May 16, 2023

Yeah. Git repositories can grow very large very quickly. A single clone here and there isn't too bad, but if you're scraping tens of thousands of projects, you can easily rack up terabytes in disk and network access.

moneywoes · on May 16, 2023

How so? Can’t someone just download the zip file and make a queue of downloads or does GitHub rate limit?