Hacker News new | past | comments | ask | show | jobs | submit login

This is addressed in the same paragraph - you can't scan/download "whole" github because you'll be throttled.



Are you actually throttled if you try to git clone or is that what the theory is, or is the assumption that it uses API calls to scrape through github?

Has anyone actually tried, because i've cloned lots of repos and have never been throttled. I'd go so far as to say the author of that post has never even tried it.


I'm not arguing for or against whether they are in the dominant position; what I'm doing is pointing out that the grandparent quoted part of the text (and argues against it) without quoting the justification the author provided that is directly relevant to what they say.

> There’s an important notion to address here. Open source code on GitHub might be thought of as “open and freely accessible” but it is not. It’s possible for any person to access and download one single repo from GitHub. It’s not possible for a person to download all repos from Github or a percentage of all repos, they will hit limitations and restrictions when trying to download too many repos. (Unless there’s some special archives or mechanisms I am not aware of).


> Has anyone actually tried, because i've cloned lots of repos and have never been throttled

(Full disclosure: I have some pretty serious data hoarding issues)

When someone says "I've cloned lots of repos and have never been throttled" I'm afraid I immediately start wondering whether "lots" means multiple GB or multiple TB ... or more!


21Tb of data, they might rate limit you! But might be possible via proxies. But only public repos.


Copilot was only trained on public repos. Id be surprised if you were throttled.


I'd be surprised if they didn't throttle anyone trying to download 21TB of data. And I wouldn't judge them for it.


There’s no need to crawl for your own dataset:

https://pile.eleuther.ai/


@article{pile, title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling}, author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor}, journal={arXiv preprint arXiv:2101.00027}, year={2020} }

So if I understand this correctly, the Pile is for code from 2020 backwards? If I wanted anything released in the past 3 years, say something in the SOTA AI space (where a month is a lifetime), I would need the scraper again?

I don't follow how this can compare to direct, live, unrestricted access. I suppose this is just my own hatred of Microsoft shining through. Of course we should accept the status quo, because how dare we suggest Microsoft could operate in a manner that is anti-competitive.

For anyone else trying to catch up, just rent a datacenter, write a crawler, deal with all the intricacies of keeping it in sync in real-time. This sounds trivial, simple even.

I wonder why nobody is doing it? Perhaps everyone doesn't have access to petabytes of storage space, unlimited bandwidth, unlimited proxy-jumps etc.

So the alternative is to buy github?


I wonder why nobody is doing it? Perhaps everyone doesn't have access to petabytes of storage space, unlimited bandwidth, unlimited proxy-jumps etc.

There are multiple private companies and public institutions that are currently training LLMs.

The work that it required to train an LLM is actually in support of fair use, just as it was with regards to Google scanning books.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: