The GitHub Archive dataset was updated as well. Example BigQuery to get the Top Repositories from 2015-2016 YTD, by the number of Stars given during that time:
SELECT repo.id, repo.name, COUNT(*) as num_stars
FROM TABLE_DATE_RANGE([githubarchive:day.], TIMESTAMP('2015-01-01'), TIMESTAMP('2016-12-31'))
WHERE type = "WatchEvent"
GROUP BY repo.id, repo.name
ORDER BY num_stars DESC
LIMIT 1000
SELECT
CONCAT("https://github.com/",repo_name,"/blob/master/",path) AS file_url,
FROM
[bigquery-public-data:github_repos.files]
WHERE
id IN (SELECT id FROM [bigquery-public-data:github_repos.contents]
WHERE NOT binary AND LOWER(content) CONTAINS 'easter egg')
and path not like "%.csv"
GROUP BY 1
LIMIT 1000
36s elapsed, 1.79 TB (so not free). Using github_repos.sample_files and github_repos.sample_contents only costs 31 GB (free) but not as many easter eggs :)
Out of curiosity, how do you automatically(/programatically) "detect" that a project (repo) is open source? The presence of a LICENSE file that contains the text of the BSD|GPL|MIT|... license?
FWIW, the author (Arfon Smith) had a recent Microsoft Research talk on github and open collaboration for the scientific community: https://www.youtube.com/watch?v=7XOuJFwy270
This is super cool. If you want to benefit from this info in your workflow now, we have analyzed some of this same data at Sourcegraph, and you can see (e.g.) all the repos that call http.NewRequest in Go (https://sourcegraph.com/github.com/golang/go/-/info/GoPackag...) or lots of usages of Joda-Time DataTime in Java (https://sourcegraph.com/github.com/JodaOrg/joda-time/-/info/...). You can search for functions/types/etc. on Sourcegraph to cross-reference them globally. We're working on an API to provide this information to anyone else who wants it; email me at sqs@sourcegraph.com if you are interested in using it.
sample_contents only lists contents of 10% sample of all files.
Scanning the full data set may be hard for people new to big query. I managed to query the full data set in https://kozikow.wordpress.com/2016/06/29/top-emacs-packages-... . "Resources exceeded during query execution" are especially hard to debug as may mean many things that could have caused Big query to go out of memory.
Some big query tricks to make it work:
- TOP/COUNT is faster and more memory efficient than GROUP BY/ORDER
- Filtering data prior to join in sub-query reduces memory usage.
- Regexps and globs are expensive. Use LEFT/RIGHT as a faster version.
- Avoid reading all files to get around 1TB freebie scan limit. Only access file contents after filtering some paths.
Since the query only hits 3 columns, it only uses 15.4GB of data (out of a 1TB allowance)
More information on the GitHub Archive changes: https://medium.com/@hoffa/github-archive-fully-updated-notic...