Making open source data more available

minimaxir · on June 29, 2016

The GitHub Archive dataset was updated as well. Example BigQuery to get the Top Repositories from 2015-2016 YTD, by the number of Stars given during that time:

   SELECT repo.id, repo.name, COUNT(*) as num_stars
   FROM TABLE_DATE_RANGE([githubarchive:day.], TIMESTAMP('2015-01-01'), TIMESTAMP('2016-12-31'))
   WHERE type = "WatchEvent"
   GROUP BY repo.id, repo.name
   ORDER BY num_stars DESC
   LIMIT 1000

Which results in this output: https://docs.google.com/spreadsheets/d/16yDS2wDdDOTxjVsjGvWm...

Since the query only hits 3 columns, it only uses 15.4GB of data (out of a 1TB allowance)

More information on the GitHub Archive changes: https://medium.com/@hoffa/github-archive-fully-updated-notic...

bduerst · on June 29, 2016

Github Easter Egg hunt query:

    SELECT
      CONCAT("https://github.com/",repo_name,"/blob/master/",path) AS file_url,
    FROM
      [bigquery-public-data:github_repos.files]
    WHERE
      id IN (SELECT id FROM [bigquery-public-data:github_repos.contents]
      WHERE NOT binary AND LOWER(content) CONTAINS 'easter egg')
      and path not like "%.csv"
    GROUP BY 1
    LIMIT 1000

36s elapsed, 1.79 TB (so not free). Using github_repos.sample_files and github_repos.sample_contents only costs 31 GB (free) but not as many easter eggs :)

parhamn · on June 29, 2016

Just looked at their pricing. ~$9 query right there.

https://cloud.google.com/bigquery/pricing

bduerst · on June 30, 2016

Good point. Here's the results if anyone wants them:

https://raw.githubusercontent.com/bduerst/GithubEasterEgg/ma...

boulos · on June 30, 2016

On my phone, but what if you limit 2000 inside the .contents query? (BigQuery should exit early and touch less data).

minimaxir · on June 30, 2016

Changing the return limit does nothing. Queries are charged on data searched.

niutech · on July 3, 2016

Why do you use BigQuery when Github code search is free?

cdibona · on June 29, 2016

Please note if you are wondering where your project is, we only archived to bigquery open source projects. So add an open source license for reals?

jlgaddis · on June 29, 2016

Out of curiosity, how do you automatically(/programatically) "detect" that a project (repo) is open source? The presence of a LICENSE file that contains the text of the BSD|GPL|MIT|... license?

fhoffa · on June 29, 2016

GitHub licenses API

https://developer.github.com/v3/licenses/

fhoffa · on June 29, 2016

I'm compiling all links and tips I can find at:

- https://medium.com/@hoffa/github-on-bigquery-analyze-all-the...

The Changelog also invited us to record podcast with Arfon Smith (GitHub), Will Curran (Google), and me (Google) - https://changelog.com/209/

Happy to answer any questions!

chickenbane · on June 29, 2016

Here's the announcement from Google

http://google-opensource.blogspot.com/2016/06/github-on-bigq...

seltzered_ · on June 29, 2016

FWIW, the author (Arfon Smith) had a recent Microsoft Research talk on github and open collaboration for the scientific community: https://www.youtube.com/watch?v=7XOuJFwy270

sqs · on June 29, 2016

This is super cool. If you want to benefit from this info in your workflow now, we have analyzed some of this same data at Sourcegraph, and you can see (e.g.) all the repos that call http.NewRequest in Go (https://sourcegraph.com/github.com/golang/go/-/info/GoPackag...) or lots of usages of Joda-Time DataTime in Java (https://sourcegraph.com/github.com/JodaOrg/joda-time/-/info/...). You can search for functions/types/etc. on Sourcegraph to cross-reference them globally. We're working on an API to provide this information to anyone else who wants it; email me at sqs@sourcegraph.com if you are interested in using it.

kozikow · on June 29, 2016

Top emacs packages required on github repos https://kozikow.wordpress.com/2016/06/29/top-emacs-packages-...

hut8 · on June 29, 2016

I made an interface to some of this which can be used for finding a single user's contributions over time: https://githubcontributions.io

kozikow · on June 30, 2016

sample_contents only lists contents of 10% sample of all files. Scanning the full data set may be hard for people new to big query. I managed to query the full data set in https://kozikow.wordpress.com/2016/06/29/top-emacs-packages-... . "Resources exceeded during query execution" are especially hard to debug as may mean many things that could have caused Big query to go out of memory.

Some big query tricks to make it work:

  - TOP/COUNT is faster and more memory efficient than GROUP BY/ORDER
  - Filtering data prior to join in sub-query reduces memory usage.
  - Regexps and globs are expensive. Use LEFT/RIGHT as a faster version.
  - Avoid reading all files to get around 1TB freebie scan limit. Only access file contents after filtering some paths.

anp · on June 29, 2016

I'm curious how frequently this will be updated. It'd be nice to set up weekly/monthly queries which show updated information.

fhoffa · on June 29, 2016

Weekly for now (fingers crossed)

sergames · on June 30, 2016

I created open source webservice which serves data in json format: http://getjson.info

Hope you will find it useful

AznHisoka · on June 29, 2016

Is there a way to get all Ask HN posts from the HN dataset?

minimaxir · on June 29, 2016

Something along the lines of:

   SELECT id, title
   FROM [bigquery-public-data:hacker_news.full_201510]
   WHERE title CONTAINS "Ask HN" AND url=""
   LIMIT 1000

Output: https://docs.google.com/spreadsheets/d/12HZ2DqkR_nl380bpxM0B...

thesandlord · on June 29, 2016

Here is a simple query to get all the titles.

  SELECT title FROM [bigquery-public-data:hacker_news.stories] 
  where title like '%Ask HN%' LIMIT 1000

fiatjaf · on June 29, 2016

https://scholar.google.com/scholar?hl=en&q=github&btnG=&as_s...

People take their time "studying" everything these days, isn't it?

I can't imagine how would that be if the State wasn't paying them to do that.