The Empty Promise of Data Moats

jandrewrogers · on May 12, 2019

Most of the leverage is in the efficiency at exploiting data, and not just in technical terms but also operational economics. This is far more important than the data itself for a fundamental reason that has not been broadly internalized by the industry at large, and which makes it difficult to build a data moat.

Almost any proprietary data model based on proprietary data can be reconstructed with sufficient fidelity from unrelated external data sources to be competitive with that proprietary data model, at the cost of being somewhat more expensive to engineer all things being equal. I've demonstrated this many times in practice. As a corollary, a company with sufficiently efficient end-to-end data infrastructure could, in principle, commoditize all proprietary data model companies. This company does not exist, yet, and it would require some very specialized talent but every necessary ingredient exists.

This is a realizable endgame due to the reality that virtually all data companies, for good and practical reasons, are strongly incentivized to build on infrastructures that are literally orders of magnitude less efficient than is possible in principle. A company that was purpose-built on exceptional end-to-end data infrastructure engineering could capture much of the revenue in these markets in a surprisingly short time by commoditizing the data model and arbitraging efficiency.

brianchu · on May 13, 2019

This is really interesting and thought-provoking, but I'm skeptical.

1. In my experience, each data source and each data format requires a lot of custom work. Each kind of prediction task requires additional custom work. I don't see this going away. Even if a company develops a solid core of reusable engineering infrastructure, it will always need to be adapted to the problem at hand. At this point, this company would seem more like a consultancy, with non-trivial marginal/variable costs. This reminds me of Palantir, which operates this way - core set of tools and infrastructure, consultants implement and apply these tools/infra at each client company with a lot of custom integration work.

2. Assuming this is not a problem and the shared infrastructure is able to generalize enough of the custom work to be feasible, this thesis actually seems like an argument for the big tech companies dominating all data companies. Google, Microsoft, and Amazon have the engineering talent and resouces to develop this hypothetical infrastructure. They also have the internal political will because they can then expose this as cloud APIs. Indeed, it appears they are already attempting this in certain domains.

3. Superior engineering infrastructure is indeed a competitive advantage, but isn't enough of a moat for a single company to dominate this space. Yes, great engineers are hard to find, but there are enough of them for more than one company to feasibly develop this infrastructure, with a lot of money. You can't say the same about trying to buy the social network of Facebook or Instagram.

huac · on May 13, 2019

> A company that was purpose-built on exceptional end-to-end data infrastructure engineering could capture much of the revenue in these markets in a surprisingly short time by commoditizing the data model and arbitraging efficiency.

do you have any examples?

untilHellbanned · on May 13, 2019

I was with you at the beginning. Can you ELI5?

jandrewrogers · on May 13, 2019

I can give it a try. :)

There is an assumption that unique data enables unique insights, which can presumably can be monetized in some fashion. As long as no one else has access to your unique data, you have pricing power for the unique insights. There are loads of companies, big and small, trying to execute this business model.

A problem with this model is that for practical purposes there are no such things as "unique insights". The only thing unique data grants you is a cheap path to a specific set of insights. For every set of "unique" insights, there is almost always many sets of unrelated data sources that can be analytically combined to deliver the same insights. In the slightly seedy underworld of data model brokers, I've seen some very impressive examples of this. The way these alternative data models make money, despite the creation process being more expensive, is that they are positioned as a cheaper alternative to companies that think "unique data" means they can extract monopoly rent. The explosion in data source availability has slowly made these clever alternatives more prevalent.

Currently, the alternative to having access to the unique data is to do significantly more expensive computations on what are typically larger and more diverse data sets. This keeps it from becoming a runaway race to the bottom due to the higher cost. Note that in both cases, the parties are using conventional data infrastructure stacks with their implied limitations.

In recent years I've done extensive studies of the cost structures of these types of businesses. It turns out that if you can reduce the end-to-end data infrastructure costs by an order of magnitude for the reconstructive approach then your total costs will be far below the break even point for the conventional "unique data" approach. Furthermore, the computational work required to replicate one high-value data model is substantially reusable for other data models, so the more data models you reconstruct, the lower the marginal cost of reconstructing additional data models this way. Done at scale across enough data models, the amortized cost of reconstructing unique data can be less than using the unique data! People have been idly thinking about what it would take to do this for a few years. Collecting the rare skillsets required to pull it off is a major hurdle for any company.

The notion that it is possible to reduce end-to-end data infrastructure costs by (at least) an order of magnitude relative to conventional data infrastructures is well-supported but it raises another question: why can't the "unique data" companies do the advanced engineering required to have such an infrastructure themselves (it isn't something you can do with open source currently)? The simplest answer is that it is difficult to justify extremely expensive engineering efforts outside of an organization's core expertise solely to prevent erosion of the market value of their unique data. Fundamentally, it is a shift to competing on infrastructure instead of data, which is an improbable transition for companies.

ntoshev · on May 14, 2019

> For every set of "unique" insights, there is almost always many sets of unrelated data sources that can be analytically combined to deliver the same insights. In the slightly seedy underworld of data model brokers, I've seen some very impressive examples of this.

If you could share a couple of examples it would help a lot to get your point across.

FWIW, I agree with the thesis that data intensive computation could be one or two orders of magnitude more efficient than it currently is, with sufficient engineering. Probably Cassandra vs ScyllaDB is a good public example, and ScyllaDB is likely not close to the theoretical optimum at all. But I'm not sure about deriving data from alternative sources. How do you derive movement data for everyone with an Android phone if you're not Google?

untilHellbanned · on May 13, 2019

Pretty interesting. Thanks. Is this relevant to all sectors of the economy? I can see it being relevant in e-commerce but how about in a field where the company has lots of data but also a proprietary technology? For example, a medical device or pharma company.

dswalter · on May 13, 2019

Let's say you have a company that tries to make online checkout more efficient by predicting which payment the user will prefer, and research shows that if the user is first presented with a payment method they use, they are more likely to buy.

If I subscribe to your API and get the outputs of your predictions, since I can see the kinds of inputs that you tell me are associated with preferring PayPal, I can approximate your PayPal-preference model.

This works for basically any machine learning model-as-service. So if I invest heavily in approximating and then serving models more at lower cost than anyone else, that might be a viable businese.

PaulHoule · on May 12, 2019

The best analogy from "data is the new oil" is that a data breach or privacy event is like the Exxon Valdez.

sdoering · on May 12, 2019

Working daily with clients who expect a lot from data. Especially when they quote that sentence, I tend to ask them what their car uses as fuel. Do they really fill their cars with raw oil or a product that was refined and turned into usable gasoline.

Data might be the new oil (I doubt it). But you can't use it. You need to work with this raw material and turn it into an endproduct - depending on the use case.

charleyma · on May 12, 2019

"Most discussions around data defensibility actually boil down to scale effects, a dynamic that fits a looser definition of network effects in which there is no direct interaction between nodes."

Good distinction between scale vs network effects, not every company with scale has a network effect...

walterbell · on May 12, 2019

> Generating synthetic data is another approach to catch up with incumbents housing large tracks of data. We know of a startup that produced synthetic data to train their systems in the enterprise automation space; as a result, a team with only a handful of engineers was able to bootstrap their minimum viable corpus. That team ultimately beat two massive incumbents relying on their existing data corpuses collected over decades at global scale, neither of which was well-suited for the problem at hand.

Generated data is being reverse-engineered by machine learning? With both the generation and ML written by the same team?

ummwhat · on May 12, 2019

Just to give a practical example. I've been working on a product and I need sales data but I can't get sales data until I have the product. So what do I do?

I made a basic market simulation. In my simulation there is a demand curve which dictates how many buyers exist at or below a given price. From the demand curve I build a population of people each of which has a maximum they are willing to pay. Then I pick people out of this population at random and I pick time intervals between sales from a poisson distribution. In this way I can test many many different scenarios in terms of equilibrium price, sales volume and market size. In about an hour I can have well over a million simulated sales across an exhaustive range of product types.

I don't know for a fact that my product works outside the simulation, but I'm confident enough in the assumptions I made that it is worth trying. For simulated data, the question is always "will the expected loss from a bad assumption cost more than the cost of acquiring a real data set".

skybrian · on May 12, 2019

Generated data can be used to increase variety as a way to avoid overfitting. A simple example might be translating or rotating images.

Not an expert, but I'm guessing it must be easier to do this than to improve how the machine learning generalizes from less data.

licyeus · on May 15, 2019

Rotation, etc. of images is known more generally as data augmentation and you're correct in that it's a way to reduce amount of training data to cold start a data product.

For "synthetic data", my understanding is that it's referring to use of ML to generate brand new training samples (e.g. through GANs, though there are limitations [1]).

1 - https://openreview.net/forum?id=rJMw747l_4

hahajk · on May 12, 2019

Usually when someone says “synthetic data” they mean CGI, not simply transformations of existing data. Using synthetic data is fraught (and presumptuous), as you are assuming you understand the problem domain 100% and are also extremely good at reproducing it. There’s a chance the model is using something specific to the CGI (and not the general reality) to produce its results.

For winning a computer vision competition it’s probably ok but I’d be very careful about using synthetic data for systems I cared about.

TeMPOraL · on May 12, 2019

I thought "synthetic data" it's something that rarely shows in training image recognition, and is more like randomly generated user data (name, surname, etc.) or data generated from simulations of some processes?

johnnycab · on May 12, 2019

>I thought "synthetic data" it's something that rarely shows in training image recognition

On the contrary, it is used to train models but it cannot adequately capture the long tail of weird events in the real world. Hence, it cannot be relied upon, as alluded to by the parent commenter. With reference to using data collected from a simulated environment vs real world ─ this subject was discussed at some length by Elon Musk and Andrej Karpathy, at the Tesla Autonomy Day event a few weeks ago.

skybrian · on May 13, 2019

I generally agree that there is no substitute for experience running in production and more data is better - or at least should be, if you can figure out how to take advantage of it.

The thing is, when it comes to weird events, historical data can't be relied on either. The next weird thing may never have happened before.

Predicting the future is hard no matter what you do. Gathering more data and learning more efficiently from what you have are both important. Training on artificial challenges can also be useful.

motohagiography · on May 13, 2019

I'm really glad a16z put this issue to rest because it's the kind of problem that causes insane wheel spin at a certain kind of company. The "Data Acquisition Cost", and "Incremental data value," made me laugh because I had to solve that problem before.

The most interesting related work on this is a paper covered on HN at some point (http://blog.mldb.ai/blog/posts/2016/01/ml-meets-economics/) about economics' indifference curves and ML's ROC curves.

The big mistake I've seen data companies make is they approach their market based on customer vertical, assuming that say a Bank will be like other banks, and a health care provider will be like other health care providers, and this is the fatal error. The bank with the same sensitivity to fp/fn/tp/tn rates will have more in common with the health care company with that same sensitivity than it will with other banks.

Basic problem with any data product is the customer's ROC curve, or where they economically benefit from using your data service. Different customers have different sensitivities to false positive/negative, true postive/negative rates - and the customer categories themselves are defined by this sensitivity. e.g. What they have in common is not their vertical but their risk appetite. I have a blog post 3/4 written on this specific topic.

That sensitivity is specifically an artifact of the customers growth stage as a company, which determines their risk appetite and the economics of the asymmetrical value that the effective ROC curve of your data product describes. (see above link).

This is the fundamental problem for an ML/AI company, where they will go bankrupt trying to find their 2nd or 3rd insurance company customer because they think the value of their product is because their next customer is in the same vertical - not because they have the same sensitivity to fp/fn/tp/tn.

Slight aside, it's so important that an investor like these can weigh on in these issues and other technical economic factors, because IMO when I listen to every technical person I know, the #1 cause of internal suffering at companies is caused by people trying to bullshit their investors, and blog posts like these just wipe away a big source of that temptation.

naveen99 · on May 12, 2019

The good thing is there are infinite functions, not just infinite data. So even if you limit yourself to finite data, you can do interesting things.

evrydayhustling · on May 12, 2019

Amen to this. Enterprises are aware their data is valuable and increasingly have top of the line devops and pipeline tools to manage it. Pendulum is swinging towards data living with vertically integrated brands rather than horizontal services.

sandGorgon · on May 12, 2019

how does this reconcile with stuff like this ? https://factordaily.com/indian-data-labellers-powering-the-g...

is this very contextual to business space (like enterprise startups) which A16z clearly mentions ?

naveen99 · on May 12, 2019

the nice thing is that computer power is outpacing the growth in human population finally. An enterprise data startup can actually collect, store, and process some finite amount of data on all 7.5 billion people on the planet. Just find an interesting angle and be better at processing that data then your competitors.

maxxxxx · on May 12, 2019

I would call this very scary. Soon a single entity can do full surveillance of the whole planet at reasonable cost. what could go wrong?

johnsimer · on May 12, 2019

Yes, but we should also acknowledge the upside. Soon (or now) a single entity can benefit/innovate for the entire planet at a reasonable cost. What could go right?

wwweston · on May 12, 2019

As the saying goes, power makes people benevolent, absolutely power makes people absolutely benevolent, right?

userbinator · on May 12, 2019

That's how a lot of dystopian sci-fi starts...

maxxxxx · on May 12, 2019

You may as well call for a worldwide benevolent dictator. A lot of things could go right.

mindslight · on May 12, 2019

And getting positive consent from each of us 7.5 billion people?

ci5er · on May 12, 2019

Why would anyone bother doing that?

mindslight · on May 12, 2019

The comment I'm responding to contains unstated assumptions promoting the surveillance industry, at a time when many people are realizing they don't want to be part of it. We should be clear that this distinction exists, lest we keep giving a pass to misleading corporate statements in the spirit of Google's "don't be evil" that tricks naive developers into building more surveillance systems.

ci5er · on May 12, 2019

I was joking, but there is little in terms of life- /commerce- /click-flow (outside of maybe Europe, and maybe healthcare in the US) that prevents me from surveilling you any way I can. If I have a popular health app, I can surveil even more if you click "accept" to a ToS.

abacadaba · on May 12, 2019

an additional data point to use