Hacker News new | past | comments | ask | show | jobs | submit login
When Sports Betting Is Legal, the Value of Game Data Soars (nytimes.com)
137 points by jbredeche on July 4, 2018 | hide | past | favorite | 69 comments



"Some legal experts, like Ryan Rodenberg, an associate professor of sports law at Florida State University, believe that, as with musical recordings and other copyrighted material, courts will find that real-time sports data is owned by those who produce it: the leagues and their players.

Others dismiss that view. Marc Edelman, a professor of law at Baruch College, said he believed that only "pre-scripted" events were subject to copyright — meaning that while professional wrestling performances might qualify, football, basketball and other true competitions would not."


I don't think either would fall under copyright. As to the first, the Second Circuit has held (in a widely cited case) that NBA game data is not subject to copyright: https://en.wikipedia.org/wiki/National_Basketball_Ass%27n_v..... The guiding rule, under the Supreme Court's Feist case, is that facts cannot be protected. What happened at an NBA game is a fact; the fact itself cannot be copyrighted. Only an expression of that fact (an article or radio segment reporting on it) can be protected.


Fixed link: https://en.wikipedia.org/wiki/National_Basketball_Ass%27n_v....

> The district court held that Motorola and STATS did not infringe NBA's copyright because only facts from the broadcasts, not the broadcasts themselves were transmitted. The Second Circuit Court agreed with the district court's argument that the "[d]efendants provide purely factual information which any patron of an NBA game could acquire from the arena without any involvement from the director, cameramen, or others who contribute to the originality of the broadcast" [939 F. Supp. at 1094].

It’s a really fascinating line that the court drew. The concept that a certain player scores a 3 pointer with around 2 seconds left on the clock is clearly a fact knowable to every patron. But what about the exact location from which they took the shot? What about the number of milliseconds the ball was in the air, or the angle at which the shot was launched? These are facts that mere humans cannot access with accuracy; they require the involvement of cameras, camera operators, and other entities characteristic of a copyrightable expression of the fact. If I were to use these to create a 3D simulation of a game, would that not be a derivative work of the film from which I obtained sufficient facts to make that simulation? Where do we decide that pixels on a video feed are more special than other parts of the content?


> Where do we decide that pixels on a video feed are more special than other parts of the content?

It comes from the Copyright Act: 17 USC 102 states that copyright protects "original works of authorship fixed in any tangible medium of expression." It also states that in "no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work."

The way this has been interpreted is as a dichotomy between ideas/facts and expression. You might use very expensive equipment to, for example, map the ocean floor. That data cannot be copyrighted. But a video visualization of that data could be copyrighted.

The result is, as you point out, somewhat unintuitive in that case, since generating the data is the hard part, not creating the visualization. But that's where Congress chose to draw the line in order to circumscribe the scope of copyright.


> The result is, as you point out, somewhat unintuitive in that case, since generating the data is the hard part, not creating the visualization. But that's where Congress chose to draw the line in order to circumscribe the scope of copyright.

And then the fun part, where data owners intentionally add false data to catch people. As a false item, it's not protected under the fact exclusion, so any inclusion of it is clear infringement.


Wow really? That's devilishly ingenious. Do you have examples of that?


Here's an example in the cartography world:

https://en.wikipedia.org/wiki/Trap_street


Yeah, there are equivalents in the yellow/white pages industry, from what I've heard. That wikipedia page shows that it doesn't always work, but it is an interesting idea.


What if I use the game/player data to make deep learning models for betting predictions? Could I argue that my (as an athlete) decision making process during play is my intellectual property and any attempt to reverse engineer it is prohibited?

To put it another way, if I build a stock-picking algorithm, the “facts” about what trades it makes are public knowledge, but can it be argued that the algorithm can be protected from reverse engineering attempts?


> NBA game data is not subject to copyright

That is pretty fascinating. In horse racing (where paying for past performance information has been standard operating procedure for more than a century), there is (at least in the US) exactly one company that holds and brokers that information, and it is quite a profitable business.

It's interesting that they theoretically can't claim copyright to protect that business.


This sounds like it would be a similar situation to that of maps.

The fact of what physical geography exists cannot be copyrighted. Anyone is free to produce their own map by measuring the physical world. In theory if everyone used accurate measurements, then all maps would be identical. However a map itself can be copyrighted, and producing your own map by copying another for which you don't have permission is illegal.

But how would anyone ever know whether you went out and measured the real world, or just copied someone's map because the underlying data is the same.

Historical horse racing data is the same. You could start gathering all the data by recording the outcome of races, and in 10 years you would have 10 years of historical data. Or you could just pretend you have already done that and actually just copy the data from the other company. The result would be indistinguishable.

'Trap steets' is one way the mapping industry deals with this problem - https://en.wikipedia.org/wiki/Trap_street - So I would not be surprised to find some deliberate mistakes in the past performance data provided by this company, which should they be found in your data set, would be pretty conclusive evidence of the source of your data.

"What colour are your bits" is also relevant and interesting article - http://ansuz.sooke.bc.ca/entry/23 - Despite what techies would like to believe, the source of your data does matter, even if multiple different sources would actually result in an identical block of data.


Today I learned that you don't have a "database right" in the US.


Well the closest analogy that I know of is market data from stock exchanges.

this seems close to sports data, in that its not "pre-scripted" and the company that runs the markets owns that data and sells it, in alot of cases its one of the larger sources of revenue.

I can't see how the stock exchange can own that data but the sports leagues can't own the data about their leagues. Though as someone who used to do sports betting I wish this was the case:)

In the sports analogy the market participants are the same as sports fans watching the game. They get to see the plays unfold (market data) but have to pay for that "privilege".

In that case even if you are a participant of the stock exchange you still need to pay for the market data and historical data. Sports leagues could piggy back off this I guess. Seems like a stretch tome but I'll bet some league tries this.


The stock exchange is literally the only party which even has the order flow data, whereas sporting events happen in arenas where thousands of people witness them. I wouldn't think a stock exchange could sue me for publishing order flow that I was personally party to, since that would seem draconian, but maybe I'm wrong. Likewise, I don't see how somebody can sue me for recording the fact that I saw Altuve go 2 for 3 with a homer and a single the other day. Again, maybe I'm wrong. Intellectual property doesn't really make much inherent sense, since it's all arbitrary and socially constructed, so lawyers and judges can often come to what ordinary people would think of as bonkers conclusions.


Stock market data is probably not protected either, as they are facts (see my discussion of Feist above). The Second Circuit has not addressed stock market data specifically, as far as I am aware, but held in Barclays that banks' assessments regarding the "target price" of various securities are not protectable: https://scholar.google.com/scholar_case?case=390572068401141....


I'd normally just upvote but I do appreciate the link, as I wouldn't have found this on my own, thx!


The stock exchange gather, arrange and publish the data and that's what's copyrighted* , not the actually event itself.

If you were to be able to access and compile the data independently you could publish it and claim copyright.

This is similar to how the guys in the article are being ejected from the stadia ... because they're controlling access to the source data.

* I am not a lawyer.


You never need the disclaimer that you are not a lawyer.


I agree. Even if you're a judge on the supreme court* , there's no such thing as perfect legal advice because no one can guarantee an outcome.

* Not a judge on the supreme court


It's more to protect me from people coming back to me saying here you said X and Y and I had my ass passed around the court before being handed back to me by a judge

I'm not a lawyer, so if you're going to court based on what I say please consult with somebody who is.


This is solid legal advice*

* I am not a lawyer.


Betraying my age ... it’s a 90s thing


especially when it comes to new domains where almost no cases have ever been brought up related to this. No lawyer can tell you the right answer for sure.


Yes, it seems like quite a departure from what copyright has been applied to so far ... I wonder what unforeseen and far-reaching consequences could emerge from pivoting the whole canon of copyright law to ringfence the value in such a niche area.


Maybe people could start claiming copyright over their personal history. If I start running around in artistically meaningful patterns, can I claim that all of the people who track my location and sell it are infringing? I could start a game of Calvinball and be protected until it finished.


You probably signed a contract somewhere allowing location trackers to use your data. Buried somewhere in the terms of service.


My signature is a copyrightable artistic expression and not evidence of consent.


They're way ahead of you. As such you're not entitled to use their service.

Remember when Spotify sought a blank cheque to do what they wanted with your personal data?


There has been a similar debate in professional chess as well. Grandmaster Evegny Sveshnikov has spoken out about game scores--the sheets of moves made in a game that the players write during the game--being the IP of the players. His arguments haven't really caught on, and so far high level chess games have essentially been in the public domain as soon as they're played.


Surely ancillary data I can glean from watching sports broadcasts is open?

For example, if I use convolutional neural networks to parse broadcast video and derive statistics from that data, how could I not own that information?


As it stands, I believe you own copyright for that data. You've gathered it, composed it and published it. That's what's copyrighted.

Somebody else could independently gather that data, produce alternative, even identical results and their work would be copyright.

I am not a lawyer but that's how I understand it.


This is how other factual data copyrights tend to work (such as maps and phone books).

It's not a copyright violation to collect and share the same information. It _is_ a copyright violation to simply copy and republish the data from a given source.

So how can a map-maker tell that someone copied them instead of collecting their own information? They insert watermarks in the form of inaccurate data (such as inserting a small city that doesn't actually exist). See: https://www.google.com/amp/s/gizmodo.com/the-fake-places-tha... and https://en.m.wikipedia.org/wiki/Fictitious_entry.

I wonder what kind of fictious entry you could inject into sports data that wouldn't compromise the quality of the data too severely?


Sports data would be pretty easy to watermark - a fictional athlete here, a score for a game never played there, etc. none of it would ever be fully guaranteed to not cause someone a problem some day - “no dude, I swear the 1968 lakers played a pre-season game against the pistons and lost 123-98” - but there’s enough data that you could watermark it.


Well there's also watermarking for images too right? The idea is that you embed a subtle pattern to the data which only becomes apparent when you apply a particular filter. This apparently operates in such a way as not to compromise the visible image so presumably sports data could be jimmied similarly.


This annoying navigation of copyright law is why I simply ignore it entirely. Copyright is irrelevant in my world view.


broadcast video is delayed by 10 seconds. Apparently, that's enough to make the non-delayed stream extra valuable.


The delay on broadcast video makes it excellent for anyone who can get ahead of it. Everyone else bets at 10 second delay, but that's plenty of time for the odds to change.


Funny, I was just complaining about how I find pro baseball dull because they play with such skill and training that it basically is scripted. And on the other hand, I've seen documentaries about serendipitous happenings making movie scenes work. So it would seem that perhaps sports is theater and film-making is luck.

Maybe Marc Edelman thinks we've got this copyright thing all backwards.


They are wrong, or at least they are right only in an extremely narrow context. The data, as recorded, is subject to copyright. But the facts behind that data is not. The league can copyright it's realtime news feed, which includes things like scores. But those scores are not copyrightable alone. People watching the game are allowed to tell other people the score.

Now if the game is truly rigged (ie pro wrestling) then the "score" is part of a prewritten script. That's different. But even there, fair use is a thing. One cannot stop someone from paraphrasing the plot of a script, such as in a review. So while experts like Rodenberg may be correct, they are correct in such a narrow way that it doesn't matter.


This is not new. Sports betting has been legal in most parts of the World for some time. Football data in Europe is so expensive, very good businesses like Opta can make a very nice living from it. Go get a quote for them for Premier League games alone: thousands a year for only some of the data.

For years, people getting into sports analytics have done so via baseball (and the Sabremetrics community), and the NBA because the data has not been seen of commercial utility. It's been collected by fans.

That will change dramatically, but it should be resisted. Leagues and players should embrace open data because it will in the long-term lead to analysis that helps them, but more importantly, fosters a deeper interest in their game and therefore makes their own careers more valuable.


> Football data in Europe is so expensive

Depending on which data you need, there are already some good sources of free football data.[1][2][3]

Someone has also conveniently wrapped much of this in an R library.[4]

Football is actually one of the better sports in terms of easily obtainable data at no cost. Rugby is much more difficult to find extensive datasets, although there are some interesting attempts.[5]

Decent cricket data also exists in a few places[6], but generally requires faster and more regular updating. However, there are R libraries for cricket data too.[7] This one scrapes from the ESPN Cricinfo site.

It is possible to obtain horse racing data for the UK and Ireland at a reasonable price, for personal use[8] and Hong Kong does a great job of making a huge volume of horse racing data available at no cost, but not in a particularly machine usable format (extensive scraping required). Sadly, other large racing jurisdictions such as Australia and the US don't have anything free, or even reasonably priced, as far as I'm aware. Ray Paulick has covered this as a general problem for the sport for a few years now.[9]

[1]http://www.football-data.co.uk/data.php

[2]https://github.com/openfootball

[3]https://github.com/jokecamp/FootballData

[4]https://github.com/dashee87/footballR

[5]http://api.drop22.net/

[6]https://cricsheet.org/

[7]https://github.com/tvganesh/cricketr

[8]https://www.betwise.co.uk/

[9] https://www.paulickreport.com/news/the-biz/gardner-horse-rac...


I would argue that almost all of that information in your post is stats not data.

The type of data that people in this thread are talking about would be more in-line with detailed positional information about each of the players on a football pitch over 90 minutes. In a cricket context, it would be more along the lines of the exact release angle and speed for each of the bowlers.

This type of information is clearly available, as Michael Caley is able to quickly generate xG maps for an entire game[1], but I do not believe it's public.

Your [9] link points out that much more information is available to baseball betters, but even baseball has a significant walled garden in terms of data. For example, the raw data used to generate the stats in [2] is not open to the public.

[1] https://twitter.com/caley_graphics

[2] https://www.youtube.com/watch?v=tzPKlQXo6hk


You make a good point and my post requires clarity.

My links were all to post-event data, not live in-play data sources. I still wouldn't call, for example in a cricket match, the number of wickets taken by a bowler a stat. It's just data. A stat is derived from the data, for example bowling stike rate or economy. Or that a trainer had a winner at a certain race track. That's just the post-event data. If you want to derive further statistics, you have to calculate it yourself.[1]

The links above just have, for the most part, raw event data.

[1] https://blog.betwise.net/2018/06/19/loops-with-r-creating-a-...


The number of wickets taken is a stat. The raw data that informs it is the collective set of all balls bowled by a bowler.

I'm not being needlessly pedantic, it's an important distinction when considering the level of analysis that one is able to perform. If you are doing major cricket analytics, you need ball-by-ball information, including as much information about the bowler's position, movement and arm motion, batter's position, movement and stroke information, how the field is set up, conditions of the pitch, situation in the match, etc.

For example, consider a situation where we're attempting to compare two bowlers. Bowler A may have got a wicket off a shot that 95% of batters would not play, whereas Bowler B did not get a wicket despite bowling a ball that achieves a wicket 10% of the time. The stats suggest that bowler A is in better form, but a data-driven view of the game suggests that bowler B is actually in better form.

As it stands, stats are available in abundance for every major sport, but detailed data is not. If a better had access to the latter, and they were were able to parse it with an in-depth understanding of the sport, they'd be at a huge advantage versus betters that did not, and they would reap the benefits.


This is a good list, but as others have said, is not the level of detail I'm talking about.

Take the NBA for example. Let's look at this: http://toddwschneider.com/posts/ballr-interactive-nba-shot-c... - this is able to give you super detailed analysis thanks to the NBA's stats API.

The equivalent from Opta is thousands a year per competition. I was fortunate enough to get to play with detailed Opta data and ChyronHego data as part of a Man City hack day a couple of years ago. The latter data simply isn't commercially available.

For cricket, you can do something interesting with ball by ball data, but ideally you want ball tracking data. You want to know speed of release, length, speed and movement after the ball has pitched, and speed after interaction with the batsman along with angles, etc. - and that's just to get started. Ideally you want positional data on fielders, etc. too.

Don't get me wrong, this is a great starting set to get people interested, but there's a way to go for high-quality data being accessible to the hobbyist or academic researcher (although I believe Opta gives academics discounts to help make them "the" standard for clubs, etc.)


Opta data cost is nothing compared to RunningBall (also part of the Perform group).

RunningBall is all the real time data - you pretty much can't run an in-game book without it. It practically runs the in-game betting world.


I think it will also discourage the type of cheating that is almost certainly coming with our society’s new, open embrace of gambling.


This ... doesn't make a lot of sense. American athletes that matchfix will not be using American gambling sites that work heavily in tandem with the FBI.

It would be an extremely dumb cheater that only began to cheat because they could conveniently bet on a site in a state that has jurisdiction over them.


Insider trading happens all of the time, despite it being pretty easy for authorities to track down the participants when they care to do so.

You don’t even have to match fix to alter behavior in a way that has a financial return for things like fantasy sports that are stat based.


I'm not suggesting that match fixing(or point shaving or spot fixing) does not happen. There's a mountain of evidence to the contrary.

I just don't think there will be a massive increase because Americans sites allow gambling. Asian sites move billions of dollars a year in largely unregulated betting markets. Athletes that wanted to cheat could already do so in relative safety. They would be foolish to start cheating in a situation where both the likelihood of them being caught is increased and the consequences of them being caught is worse.


Most people generally don't have great opsec and forsight when committing crimes. I guarantee you or me would get caught in an attempted scheme because of a single mistake.


You are more than welcome to join in. I've started two open sports data initiatives. The world cup is the world's biggest sport event (3+ billion fans) but open data (or data services) are hard to find. See the football.db [1] or football.csv [1] projects for more. Enjoy the beautiful game with open data :-). [1] https://github.com/openfootball [2] https://github.com/footballcsv


This [1] was an interesting article from a 3 years ago about Tennis "Court Siders" - people paid to transmit results of games back to betting syndicates faster than the official bookmaking results services. Kind of low-latency sports betting...

[1] https://www.bbc.co.uk/news/magazine-32402945


“These courtsiders and scrapers operate in the shadows, compromise the legal market, fuel the illegal market and have no vested interest in the integrity of sports.”

It is amusing to see integrity and sports betting in the same sentence. You created a game where people are incentivized to make bets on the next action. What do you expect? Of course people will want to gain an edge anyway possible.

Oh this is gambling. The House has to win everytime. Sorry I forgot.


Yeah, it's kind of weird. All you are doing is reporting the "market data" faster than the official feed.


I wonder about the technical details. Data gathering hardware, just one button or two or more? What data is important to collect? Just points? What is the edge in doing this? Data quality? Latency? How much faster are you than TV feeds?


It is likely used with live betting systems(common on many gambling sites) where you can bet on the outcome of a single game of tennis. The line for the game will move with each point to reflect the new situation. If a better is able to get information about points seconds ahead of the other betters, they are able to make bets on the updated odds before the line moves as the rest of the betters react to the new information.

For example, consider a game with evenly matched opponents. At 0-0, the server might have odds at ~1.5, with the non-server having odds at ~2.7[1]. When the score moves to 0-15, the odds might move closer to even, eg 1.9/1.9. If you're able to get information about the first point ahead of the crowd and place a bet on the non-server at 2.7 when the true "predicted outcome" is closer to 1.9, you obviously have a massive advantage.

This can also be used to bet on sets or matches, but the advantage is much smaller. Still, a better with any sort of advantage will always win over the long term.

[1] Yes, those don't add up to 100%, welcome to the vig :)


In this particular instance (I know the guys who ran that tennis betting syndicate), they had a tennis model that, if you told it who was about to score a point, would tell you roughly what the prices in the market should move to. If you can get that "about to score" info before the rest of the market, you slam in as many bets as possible based on the reasonable chance that the market will move in the correct direction and end up getting 1.4 for something that should be 1.35. Do that all day on every tennis match you can and that's a lot of profit. You don't need spectacular statistical ability, you need minimum latency and execution skill.


I knew someone who was paid for this. He attended tennis matches and transmitted by cell phone things like set-defining points (change real time betting odds instantly) and player injuries.


Several of the answers to the questions you’re asking are given in the linked article.


Sort of like the big con in the movie The Sting, but real.


I think it's absurd to try to control "data scraping" at events. Just like you should be able to scrape the public facing pages of any website (as recent legal cases have shown) you should should be able to attend an event and collect any data you like.


I'm excited to see what decentralized technologies like Augur and Gnosis do to disrupt some of these conversations. The question raised in the article about "who adjudicates data disputes" is one of the main features of Augur with it's decentralized oracles. Also if people can anonymously use a decentralized system to change the odds, it decreases the value of data from these paid sources.


I am a consumer of such real time data and I can tell you from experience that where there is no competition quality drops and prices increase. I really, really hope they leave the market open.


> Data on the second-by-second action — exactly when a goal is scored, where it landed in the net, who had the assist — creates manifold betting opportunities.

Will there be high-frequency sports betting, like HFT on financial markets? Why not?


Go look at the betfair.com exchange (if you are able). They handle several thousand bets/sec at high load. Plenty of high frequency traders there employing the same tricks as the city boys.


On an aside, my friend is a full time sports better who has an extremely high ROI during football season on one of those Fantasy sites. He has difficulty making any money on other sports. I wonder why.


Variance. Unless your friend has sustained this performance over a number of years, it's probably just variance.

Most successful long-time sports betters are not playing on a level playing field. In order to beat the vig over the long term, a successful better must be privy to information that is unavailable to other betters, or at least the majority of betters. This either means a novel analytical technique, which is rare, or inside information.

The number of people who simply watch games closely and are able to discern information that allows them to bet successfully on future games over the long term is exceptionally rare in the real world.


Similarly, my friend is good at driving cars but isn’t much of a helicopter pilot.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: