Hacker News new | past | comments | ask | show | jobs | submit login
Datasets and Data-driven Startups (measuringmeasures.com)
79 points by mattyb on April 29, 2010 | hide | past | favorite | 16 comments



"Why would you want to start a data-driven startup?"

In our case (http://timetric.com) - it was at least in part due to having spent several years beavering away trying to build products for the management of scientific data, before discovering that outside of science:

a) there's lots of other interesting data can benefit from better management

b) people will pay for it :-)

The article's a very good overview of lots of the issues we've faced in building a business around data.


To present an alternative viewpoint on this matter, I would say for people looking to start data-driven startups to first think about your application. What do you want to make? Then think about how you can glean information from the normal usage of that product.

For example, I think its much more risky to think "hmm, what datasets are out there and how can I build a business off of them?" than "hmm, what business do I want to create, and can I monetize through data acquisition?"

In the first case, you are going to be held to the availability of someone else's data. You don't know how it was collected, and you have no idea how it could have been doctored. In the second case, you create your business with a data-driven mindset, and you control all those parameters yourself. Then you are just at the mercy of how well your data reporting tools are working.


I think data is a detail and decisions are solid gold.

"We'll tell you the status of this customer's last forty-seven credit card payments" is a detail to your customers. An enormously non-trivial undertaking, but ultimately a detail.

"Hiya. We're Fair Isaac and we are going to purge the profession of underwriters from the face of the earth, saving you billions of dollars, decreasing your loan processing time from weeks to literally seconds, and allowing your consumer lending to scale in ways you cannot even imagine. By the way, some numbers are involved." is a wee bit more compelling.


Actually I think it depends who your customers of that data are. If you're supplying that data to existing customers as "nice-to-know" details, you're right on.

But perhaps you run an advertising-based business and you're looking to increase your CPM. Well then, those details are suddenly quite valuable for advertisers if they can target a very specific audience.


Exactly. Though I would go even farther than that. Similar to "People don't pay for drills, they pay for holes", "People don't pay for data, they pay for decisions". Data without actionable decisions are worthless.


Bloomberg and Reuters would like to disagree. The collection of reliable cleaned data in the real world is such a chore and has such a strong network effect that it is a good model on which to build a business upon. Eg Mint might have cashed out because of the buyout but yodlee is going to see a check every year from intuit for a long long time to come.


Yes, with emphasis on reliable, cleaned. That data is worthwhile because it is actionable. The value-add in FlightCaster or Fair Isaac is that it converts unactionable to actionable information.


I havent used either Flightcaster or Fair Isaac extensively. A cursory glance at their websites suggests that both companies actually transform the available data substantially, almost to the point where it is fundamentally new information. Flightcaster does this with their Machine Learning voodoo, while Fair Isaac does it by summarising and analysing existing data to create a credit score / report. In either case there is a worthwhile value add. From my point of view what these two companies do is very different from Reuters and Bloomberg who take great pains to not do analysis (that would be a conflict of interest with their clients). In short - the post I replied to made the assertion that data by itself has no value. I quote - "People don't pay for data, they pay for decisions". The two examples I provided (Reuters and Bloomberg) provide merely scrubbed data, they do not provide analysis. Hence there appears to be a sizeable market for data.


New data is usually costly to obtain. Most application ideas would be unapproachable by a garage-style startup due to the expense of data collection and the lack of proof-of-concept results.

Same idea as exploiting an underpopulated niche in an existing market rather than trying to create an entirely new market.


Well, not exactly. Developers can make use of Scribe which is Facebook's opensource logging server, or roll their own dead-simple log server. Really its all about firing events everytime a user does something on your site and recording that data in a text file. Then you data-mine later.


I think this is a point of confusion because "data-driven businesses" is used elsewhere to mean a business in any industry which uses analytics to optimize operations. In this article it's referring to a company for which the product is the data itself. For the latter type of company, you need external data sources, and their availability is crucial for the viability of the company.


I'm looking to collect or download a dataset for music consisting of info such as artist/album/song. Any idea where to grab this from? There used to be a list hosted on Google around 2 yrs ago but I can't find anymore.


Yes. Lots of info out there. You can use our API at Grooveshark available here: http://tinysong.com/api

You can also bulk download Discogs here: http://www.discogs.com/data/

And MusicBrainz has one here: http://musicbrainz.org/

You can also bulk download wikipedia, cross reference page titles with a set of artist names, and grab whatever information you want if your Regex skills are magical.

A general word of caution: User-sourced music information is very very messy. Be prepared for a lot of mispellings, bad metadata, missing information, etc.


re: extracting wikipedia data

someone's already done this for you. see: dbpedia.org


Cool thanks.


[...] the unglamourous quest to get the data to a point where [...]

That, and once there are results, carting them back into the 'production line' and plugging them where they'll do any good.

Logistics.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: