Hacker News new | past | comments | ask | show | jobs | submit login
Open Football Data (openfootball.github.io)
157 points by vinhnx on May 21, 2014 | hide | past | favorite | 66 comments



This is impressive for the amount of work put in to formatting the data and making it easy to use in different ways.

For details and advanced analytics though, this one is much better: https://github.com/soccermetrics/soccermetrics-client-py


Holy fullbacks, Batman! I didn't know about SoccerMetrics. The Python client seems to be just a wrapper for their REST api: http://soccermetrics.github.io/fmrd-summary-api/started.html

Lots of numbers to crunch there!

This is all very nice but it would be nicer if there was some sort of cheap software that amateur teams could use to gather and then analyze their own data. There's a massive market out there for this sort of thing, the football world is very conservative and tends to move slowly.


Hi,

I'm the founder of Soccermetrics and the creator of the Soccermetrics API. Thanks for the attention.

I've had the site for a little over five years now. I had been working on data models and algorithms for analysis of soccer matches and thought I would have a go at creating a company out of it. I even applied to YC which was a bit of a laugh in retrospect :) Right now I have another job that pays the bills but there are a few projects I do on the side, the API being one of them.

The API is the latest iteration of my data models exposed to the world as well as my attempt to build as close to a REST API as I could. I don't claim perfection and I'm sure others will have their opinion on it, which I welcome.

I wrote the Python client which is as you say a wrapper over the API, which serves well as a starting point. If you have ideas on how to extend it, please fork and contribute.

There are a few software tools out there that do what you wish. Statzpack is one, SportyBird is another. I have my doubts about how big the market really is for this kind of service, but everyone is very early in this space.


Thanks so much for your work and for your awesome suggestions!

My family (everyone except me, lol) sort-of runs an amateur football club, so I have some second-hand knowledge of that world (at least in Italy). They tell me that systematic, professional and data-driven approaches are incredibly scarce but very effective. It's a system that still runs on personal networks and a lot of outdated knowledge and "magic", not unlike the baseball world described in Moneyball [1]. As you said it's early days, but still, most coaches under 40 now bring a tablet with them on the bench, and not to take funny pictures.

[1] http://www.amazon.co.uk/gp/product/0393324818/ref=as_li_ss_t...


What your family says is true, but "systematic, professional, and data-driven" packages that are appealing to semi-pro or amateur clubs tend to be expensive. To further complicate things, each club will insist that certain specific events be tracked in order to verify that their gameplan is being implemented, which leads to "custom" data that are actually a tagged composite of basic data points.

By the way, thanks for the pull requests on the client. Developing the client and the API backend has been a learning experience at every step, so I appreciate contributions from experienced developers.


You, my friend, are the best. As a huge soccer fan and a developer, getting this sort of data is really hard unless you shell out hundreds of dollars a month.

Already thinking about the apps that will use this! Thank you.


And kudos for calling it football :)


Yes this is a stellar idea!


Last I checked it was thousands of dollars a month.

Something around the tune of $25k a year. Anyone actually paying for this now and can provide pricing?


Agreed. Having this for the world cup will be awesome to try toying around with a couple of games or apps.

The more the better!


Does anyone know if this is available for (American)Football?



Author here! That's a pretty nice visualization :-)

I thought I'd just squeeze in a few words about nflgame/nfldb. Both offer access to the same stuff: play-by-play data back to 2009. Both can be used with live games so that they are updated in real time (well, at least as frequently as NFL.com).

nflgame is responsible for pulling the JSON data and provides some rudimentary searching features. But it's slow.

nfldb stores all this data for you in a relational database. It comes with a script that updates the database while games are playing so that you can get access to live data. (It will even migrate the database for you if I've made any changes to the schema.)

Here's a quick example that shows how to get all of Julian Edelman's touchdown plays from last season:

    import nfldb
    
    db = nfldb.connect()
    
    q = nfldb.Query(db)
    q.game(season_year=2013, season_type='Regular')
    q.player(full_name='Julian Edelman').play(offense_tds=1)
    for g in q.as_plays():
        print g
Easy as pie!

There's an extensive wiki (almost 20,000 words) with tons of examples and explanation: https://github.com/BurntSushi/nfldb/wiki

Other features: aggregating data, player meta data (college, height, weight, etc.) and fuzzy player name matching.


The same project seems to offer NFL play-by-play, but only for two seasons:

https://github.com/opensport/american-football.db

The best public football repository I am aware of is this, though:

http://www.advancedfootballanalytics.com/2010/04/play-by-pla...


A very cool project, but I have one question/issue.

The data format seems to be a custom text format which admittedly I could be wrong about. Is it possible to use TSV or CSV instead since it would be infinitely more useful since it could be directly imported into relational databases, Excel, etc.


What about RSSSF? http://www.rsssf.com/


Impressive, thanks for sharing!


Is there an open database for horse racing?


That depends on your definition of open.

The short answer is no. I've searched long and hard, high and low, for free (beer) horse racing databases for UK/IRE and Australia. To a lesser extent I've searched for HK, FR and GER data. I'm yet to find anything that is comprehensive and no cost.

There's a couple that I do use for UK/IRE racing which cost in the region of £35-£45 per month for access. Betwise/Smartform provides an historical database in MySQL, and daily race card/results updates. UKHorseRacing.co.uk provides CVS files with historical race data, their ratings and race results. I take these CVS files, combine them into a SQLite database and interrogate with R.

A slightly longer answer is, sort of. The Betfair API is currently open access for non-commercial and low volume use (as far as I'm aware). This will allow you to retrieve basic racing data - the cards before that race with horse name, jockey, barrier etc and the race results post-race including the Betfair Starting Price. After interrogating the API, you'll need to obviously compile the data into your own database. A bit of work, but feasible. Betfair has a developer programme and their are API bindings available in a number of different languages. I use R (R package developed by Betwise mentioned above), but I know Python is available. One caveat to mention is that Betfair are upgrading their API, so this will obviously have an impact on existing programs using the old one.

If anyone else has additional information or could point me in the direction of something else "free" I'd appreciate it as well.


As well as Betfair's 'live' API, they also provide historical betting data at http://data.betfair.com/

It is free but you need an active account with them to download the CSV files.


At this page you can download all historical Betfair price data in CSV format.

https://promo.betfair.com/betfairsp/prices/index.php



Kinda makes you want to disrupt that whole industry.


The data format bothers me. Why not use a standard one like JSON?


Agreed -- looking at the player data[1], IMO the format type is unrecognizable:

  ## GK / Goalkeepers

  Kawashima|Eiji Kawashima,   20 Mar 1983
  Nishikawa|Shusaku Nishikawa,   18 Jun 1986
  Gonda|Shūichi Gonda,   3 Mar 1989

  ## DF / Defenders

  Inoha|Masahiko Inoha,   28 Aug 1985
  G. Sakai|Gōtoku Sakai,   14 Mar 1991
  Nagatomo|Yuto Nagatomo,   12 Sep 1986
  Uchida|Atsuto Uchida,   27 Mar 1988
  Konno|Yasuyuki Konno,   25 Jan 1983
  Kurihara|Yuzo Kurihara,   18 Sep 1983
  H. Sakai|Hiroki Sakai,   12 Apr 1990
  Yoshida|Maya Yoshida,   24 Aug 1988
  Masato Morishige,      21 May 1987   ## Japan F.C. Tokyo
Comments as a double-hash, key fields are either player last name or occasionally first initial-space-last name, then three different delimiters of pipe, then comma, then tab. Choosing either a consistently delimited format or a more verbose JSON/YAML structure with clear metadata would seem to be a better approach.

[1] https://github.com/openfootball/players/blob/master/asia/jp-...



all the scores are null

how often is the feed updated?


Im guessing "not often enough"


The size of JSON files is huge compared to delimited data. Languages like Python make it equally easy to consume delimited data and JSON, so it shouldn't matter much.


At work I built a system to consume feeds of numerous automotive dealer inventories and the easiest to work with is always comma delimited. There are some people out there who have no business building an XML document, and unfortunately I've had to build adapters for many of them. It takes me a few hours to get set up to consume a new CSV feed and a few days for XML, not counting mapping their industry / category / manufacturer data to ours.


Also have a look at this one: http://www.football-data.co.uk/data.php


Yeah, it's pretty awesome. I built http://test.gmbl.io off that data set to learn to code. Good place to start.

Kickdex is also pretty awesome, they use the Opta data to produce real time indices for teams and players.


I'm curious if this data is actually public domain. Where are they sourcing it from? Are they legally allowed to redistribute? Etc.


Why wouldn't they? It's just raw facts, presented in their own minimal style.


In the UK this rule applies: http://www.ipo.gov.uk/types/copy/c-otherprotect/c-databaseri...

    For copyright protection to apply, the database must
    have originality in the selection or arrangement of
    the contents and for database right to apply, there
    must have been a substantial investment in obtaining,
    verifying or presenting its contents. It is possible
    that a database will satisfy both these requirements
    so that both copyright and database right apply.
They would have a "database right" if they had placed a person at each match to gather the data and verify it, as that is a substantial investment.

How they originally acquired the data is important and shouldn't be presumed.

However that doesn't stop you from implementing your own database and re-acquiring the facts in some trivial way. Just bear in mind that accessing historical data may breach someone else's database right.

Database rights are usually proven by fake data inserted into the database to catch people copying it.

For example you could argue that the Rare Record Price Guide ( http://www.rarerecordpriceguide.com/ ) is just a collection of facts, and decide to copy it... but you'll discover when sued, that a few of the bands in the guide are fictional and designed to demonstrate that the database is theirs, and that it's not trivial to acquire and verify the data.


Great replies joosters and buro9. Thanks.

So, for the sake of argument, if the dataset had no fake data then it would be OK? Or would they still need to demonstrate "substantial investment", no matter the state of the data?

If the latter, then that gets weird quick. How many lines of code is considered substantial? How many hours hunched over a microfiche machine? It sounds like it would ultimately depend on the skill of your lawyer.


You need to think of fake data being a more broad term than you are. If we talk about play by play for american college football you will notice how ncaa.com, espn.com, foxsports.com and others have slight differences in what a play's down/togo/time/etc is. It is not as simple as ESPN inserting an entire fake team or fake game; if you were to compare to the last example it would be a real record with a slightly modified price. I analyze college football data sets and can determine where they came from, so I have no doubt that companies can as well.

If you have enough data sources you could theoretically recreate a play by play from all of them and have a data set that would be difficult to prove was stolen from someplace in particular. I say theoretically because (at least with college football) you are often not given enough information to recreate the game (simple example would be how long a play took to execute to determine drive possession time), so often you are left using a best guess method.


I think you miss the point. If you have to start arguing what is and isn't a substantial effort then you're probably going to fall foul.

This is an open source schema for storing data, why not re-acquire the data from a fresh source and make that open source too? This avoids pulling it from an existing and potentially protected source.

You could have members of the public individually enter historical scores, and each one provide proof of the score (e.g. a photo of a result in a newspaper or a photo of the matchday guide).

You could verify correctness of that acquired data by comparing to a few known data sources (even if they were protected). So long as you were close enough in fact to not alter history it was probably right, and correctable in the future (editable like a wiki).

If you use one of the existing datasources you'll find yourself with a lawsuit if you reach any reasonable size.


There was a court case regarding "database rights" for horse racing data in the UK, where the British Horseracing Board sued William Hill (a bookmaker) and lost the case. The BHB basically wanted to charge for the right to publish the horse racing fields (i.e. details of each race, the runners, etc).

Details of the case law here: http://www.out-law.com/page-392 And of the court case: http://www.out-law.com/page-5055

The lawsuit backfired completely. The BHB had wanted to charge newspapers for publishing horse racing fixtures, but it inspired the newspapers to turn around and question why the BHB wasn't paying them for devoting pages to the sport.

(Looks like the case also involved football fixtures as well)


An entity called Football Dataco representing the English and Scottish professional leagues', claimed these leagues' fixture lists and results fell within the scope of their IP even if sourced independently, and they successfully extracted large licensing fees from bookmakers and publishers whilst sending everyone else c&d letters.

There's been quite a bit of legal back and forth

- http://www.football-dataco.com/ - http://www.bbc.co.uk/news/business-17218968 - http://www.twobirds.com/en/news/articles/2012/football-datac...


Historically, Football DataCo claimed rights to football fixtures in the UK, on grounds of copyright and sui generis database rights, and extracted licensing fees from those wishing to publish them.

That practice ended with the March 2012 ruling in the European Court of Justice [1] that neither rights subsist in relation to fixture lists.

However, in a separate ruling [2] Football DataCo successfully argued they did have a database right over live data concerning matches (e.g. goals, goalscores, cards, etc.)

[1] http://curia.europa.eu/juris/document/document.jsf?docid=119...

[2] http://www.bailii.org/ew/cases/EWHC/Ch/2012/1185.html


I was on the receiving end of a DataCo C&D and had to stop my quite-popular, burgeoning 'prediction game' site - was a huge shame. When you say "goals" in relation to [2], does that mean results cannot be published, or does that only relate to specific times of goals?


Great question. I've long had an idea for a fantasy football type game and shied away from using premiership football results because of this ruling. However thanks to the links UVB-76 provided I've now spent a bit of time going through the details and I'm confident that simple goal and timing data do not fall under this simply because it does not meet the 'substantial' requirement, as made clear in paragraph 76 . It seems uk.practicallaw.com has a great comment on the finding, also supporting this argument.


Yeah, me too. The official statistics of sports leagues are rarely ever in the public domain. Official being the key word here.


This is an interesting lecture [1] at Linuxwochen Wien 2013 that focuses on the usage of football.db. More data should be put into public domain.

https://cfp.linuxwochen.at/en/lww2013/public/events/61


I did something pretty similar, but it seems definitely less comprehensive: https://github.com/llimllib/soccerdata/ . Will be using this, thanks!


This is really cool! Does anyone know if there are similar datasets for other sports out there? Even less clean datasets, as long as they have permissive licensing to allow sanitation and republication.


The gold standard for freely-available sports data is baseball, with the Retrosheet project:

http://retrosheet.org/

The license on the data is a pretty permissive one, simply requiring attribution of the data to the Retrosheet project. Software to process Retrosheet files is available, under the GPL:

http://chadwick.sourceforge.net/doc/index.html


I'll add this. Sean Lahman's Database is also widely used. Though it's mainly whole season statistics, not game by game. Along with post-season, all star games, schools, salaries.

http://www.seanlahman.com/baseball-archive/statistics/

Then of course MLB has a bunch of data here, mainly the PitchF/X data since 2008 is gathered from here.

http://gd2.mlb.com/components/game/mlb/


There's several scrapers to parse the MLB XML data, the most popular (I think) is Baseball On A Stick, in Python:

http://sourceforge.net/projects/baseballonastic/


Thank you for sharing this! I just started looking for MLB Data yesterday. Great timing!


Awesome, thanks for this. Do you know if there are any good sources of historical betting line info for baseball? It would be fun to put together a quantopian-type site for sports betting.


I believe ESPN has NBA play by play data.


ESPN is not going to license or allow you republication rights to their data. ESPN data is also extremely difficult to sanitize (play by play does not match up with box scores, is just plain wrong, different formats, terrible html).


This, this, this. I used their (american) football data for a hobby project and it would take 30 minutes to an hour to clean one week's worth of games. Hypothetically, I found the sporting news to have much cleaner data.


Which is really a shame since their app offerings are awful.


This looks cool. I see Gold Cup and NA Champion's League repos. Is there a plan to add MLS data? I know some people who would be super excited to get baseball-reference.com level data for MLS.


Is there anywhere to get real-time play-by-play data?


There are several, and you'll pay a lot of money for them.


That's true. There are some circumstances in which Opta let you do interesting things with non-realtime data though:

http://www.optasports.com/playground-section.aspx


Betradar / Sportradar has it.


I don't know from where did this came from, but I like open formats. From where do the data come?


its a shame that this is not being done under the wikidata framework. those guys have been thinking about databases like this for a while, and can be reliably trusted to at least keep it up for a reasonable amount of time.


Where's Derby County's stats?! Just kidding this looks great!


awesome, exactly what I need


perfect timing. thanks!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: