DeepDive – System that enables developers to analyze data on a deeper level

nl · on Dec 11, 2014

It does probabilistic inference![1]

So many open source "Knowledge Graph"-y type projects concentrate on building them like databases, with a query language that assumes the data in them is correct. You see this in things like Freebase, DBPedia and Wikidata, where they typically end up in a triple store and you query using SPARQL.

This isn't how the real world works, and there isn't a lot of publicly available software that takes this into account. There aren't even than many papers about it (the Microsoft Probase paper is one, and there is work from Florida University(?) about using Markov chains to reason while taking probabilities into about).

I'm excited to take a look at this.

[1] http://deepdive.stanford.edu/doc/general/inference.html

anonetal · on Dec 12, 2014

Aside from the work on probabilistic inference, there is also many papers on "probabilistic databases" in the last 10 years (Chris did his PhD on that topic). That work has looked at SQL-style query processing over "uncertain"/"probabilistic" data.

These were some of the major projects: https://homes.cs.washington.edu/~suciu/project-mystiq.html, http://maybms.sourceforge.net/, http://infolab.stanford.edu/trio/, http://www.cs.umd.edu/~amol/PrDB/, http://dl.acm.org/citation.cfm?id=1376686.

nl · on Dec 12, 2014

This is a fair point. In a similar vein there is also BayesDB http://probcomp.csail.mit.edu/bayesdb/

peterlvilim · on Dec 12, 2014

An HDFS oriented one (with SQL style queries):

https://www.cs.berkeley.edu/~sameerag/blinkdb_eurosys13.pdf

https://github.com/sameeragarwal/blinkdb

nl · on Dec 12, 2014

Is it?

I thought BlinkDB was data warehousing on hdfs? I dont see any mention of inference-like features in the docs?

anonetal · on Dec 12, 2014

It's not similar. BlinkDB builds upon the work on sampling-driven approximate SQL query processing (an early project in that space was AQUA@Bell Labs), and extends it to cloud/HDFS setting.

Although some terms come up in both places (e.g., confidence bounds, noise, etc), BlinkDB and probabilistic databases are fundamentally different from each other (I have worked on both topics).

tensor · on Dec 11, 2014

There is actually a huge body of work on probabilistic reasoning. Just do a google scholar search for the "probabilistic reasoning" or "probabilistic inference" or "probabilistic logic." You might have to dig around in the results, but there are definitely a lot of papers on the topic.

I hadn't heard about DeepDive until now, but I did previously come across another project that does probabilistic reasoning: http://alchemy.cs.washington.edu. I cannot speak to how they compare since I haven't looked into either in great depth yet.

nl · on Dec 11, 2014

Yes, I should clarify.

There is a lot of work on probabilistic inference. Huge volumes.

But then if you are interested on papers on using probabilistic inference on unstructured information there is a lot less information.

If you are looking for actual implementation reports there isn't much. The most interesting are the previously mentioned Probase paper, The Google Knowledge Graph paper and assorted IBM Watson stuff.

I've just realised that DeepDive is also the Wisci project[1], which is something I've been watching for a while

That Alchemy implementation requires a structured database of information, and doesn't really deal with how to get that.

[1] http://i.stanford.edu/hazy/home

pvnick · on Dec 12, 2014

> there is work from Florida University(?) about using Markov chains to reason while taking probabilities into about.

You're talking about the work that Prof. Daisy Zhe Wang and her students are doing over at the DSR lab. Go gators!

[1] http://dsr.cise.ufl.edu/?page_id=250

chapulin · on Dec 13, 2014

It's also being used to aid paleontology research: http://fusion.net/story/30751/paleo-deep-dive-machine-learni...

phreeza · on Dec 11, 2014

I am wondering what a ballpark figure would be how long it would take to set up an instance of this for a given scientific field for example. Days? Months? Years? I fear it is probably the latter.

batbomb · on Dec 11, 2014

I've sat in on Chris' class at Stanford.

I think the answer is probably closer to weeks to months if working with field experts, depending on how deep you want to go.

The core of it is open source.

I think the most exciting thing about it is it brings more sophisticated computation to the more qualitative sciences.

walterbell · on Dec 12, 2014

> The core of it is open source.

What's included in the non-core parts? Are there patents to be licensed?

tlmr · on Dec 12, 2014

What about for just a smallish single machine corpus of document (1000)?

nl · on Dec 12, 2014

The size of the corpus isn't the issue (apart from processing time of course).

The key issue in estimating how big a job it is is how complex your entity extraction and inference rulesets are.

rstoner · on Dec 11, 2014

This could be a huge value-add to the groups that have invested heavily in human-directed knowledge graph construction (e.g. Project Halo/Aristo at the Allen Institute for AI).

polskibus · on Dec 12, 2014

I'm mostly interested in how much does it differ from what IBM Watson does. Does IBM only rely on probabilistic inference or does it do other data mining as well?

nl · on Dec 12, 2014

It's (very) roughly comparable to parts of it.

Firstly: IBM is increasingly using the Watson brand for things that don't appear directly related to the Jeopardy winning system (eg, Watson Analytics). When I talk about Watson here I mean the Question Answering (QA) system.

At a very high level DeepDive consists of a Knowledge Graph construction tool, and a probabilistic querying tool. Compared to Watson it is missing a natural language question parsing tool, and any way of dealing with questions that aren't in the KG.

Watson has (very strong) natural language understanding for multi-claused questions, and the Jeopardy version can do things like understand puns. Deepdive doesn't have anything comparable. In the open source space, the closest thing I'm aware of is SEMPRE[1][2].

Watson also has a evidence scoring module, and my understanding is that this can work against unstructured data. Deepdive doesn't have this, and instead relies on probabilistic inference. This is an excellent approach, but relies on doing content extraction first (ie, extract entities and relationships from text and/or other sources). The Microsoft Probase[3] group has published lots in this area.

[1] http://www-nlp.stanford.edu/joberant/homepage_files/publicat...

[2] https://github.com/percyliang/sempre

[3] http://research.microsoft.com/en-us/projects/probase/

signa11 · on Dec 12, 2014