Open source effort to bring open biomedical data together – anyone interested?

jerven · on Dec 13, 2015

I am very supportive of the idea in general. I think that the author underestimates some of the reasons for the diversity in the data sources schemas etc... They are different often because of the data being different.

The linked data/semantic web approaches are slowly eating away at the unneeded data diversity. With small shared standards for common data, and unique schemas for unique data. The single endpoint solution unfortunately does not scale; it would be the AOL of biomedical data or entrez as you prefer. To be truly open and free anyone should be able to contribute their data and tools. This means decentralized infrastructure which means confusion and difficult to find information. However, as acedemia and research is decentralized their IT infrastructure must match that reality. This leads to infrastructure that can integrate on demand, such as made possible by the SPARQL service keyword.

Open source has always been an important part of bio-IT and that is not going to change. But a single source is not the answer to our problems. We need to make it easier to find information, but most importantly need to make it easier to answer questions with the data that is available.

nikolamilosevic · on Dec 13, 2015

Linked data and SPARQL are definitely very possible solution and infrastructure can be decentralized. There is a bit of resistance in one part of the community from these technologies, because people are not used to them and compared to some other data storages they tend to be a bit slower, but thats other discussing. I do not have anything against these technoogies. What I currently don't like is that there are a lot of resources that are technically open source and free, but they are burried somewhere on the internet and sometimes hard to find and it takes quite a lot of time to review all existing resources. What I wanted to recommend is one central umbrella organization that will be (1) platform for collaboration in biomedical field, (2) central endpoint to all major existing project, possibly with some maturity level of projects and internal review in order to arrange projects into maturity levels, so it can be relatively easy to review how much you can "trust" that project of data, (3) central repository for open source NLP, data curation and semantic web tools, (4) some relevant body that would be able to propose and work on standards for data curation that would take in account all field specific needs.

x1k · on Dec 13, 2015

You have seriously underestimated 1) efforts needed to develop and to maintain such resources – your best hope is to work with government-funded institutes; 2) resistance from the convention of a particular research field – you can rarely bend how people in a field work on things; 3) culture differences between biologists/doctors and programmers – biologists/doctors think very differently, which is frequently overlooked by programmers; 4) bureaucracy – everyone thinks he/she is the best; when you work with top groups to make things happen, you will find how problematic it is; 5) technical challenges – as you care about pheonotype data: there are no good ways to integrate various pheonotypes from multiple sources.

Everyone in biomedical research dreams about integrated resources. I have heard multiple people advocating SPARQL as well. If it had been that easy, this would have occurred years ago. In the real world, no one is even close. If you want to attract collaborators, learn Linus: say you have a working prototype and demonstrate how wonderful it is. Your ideas are cheap. The difficult part is a clear roadmap to make it happen.

dekhn · on Dec 13, 2015

I agree strongly with this. I started out in biomedicine many years ago with the same aims as the OP, but after a lot of experience, I think that announcing the database resource is just the first step, it's an easy one, and all the hard problems are the ones listed by x1k.

Based on what I see happening in large orgs with lots of machine learning resources is the development of new techniques to generate large amounts of homogenous phenotypic data across many measurement modalities. These large orgs have biologist/doctors: the small number of people cross-trained well enough to move between the two fields with ease. These orgs have gathered enough resources to compel the leading researchers to work with them, and they're starting to publish interesting papers.

x1k · on Dec 13, 2015

Yes, I think those large companies may finally have a slim chance to revolutionize data integration, but it is too early to tell yet. We will see.

hackuser · on Dec 13, 2015

Could the largest funding organizations, in order to get a much greater return on their investments (i.e., much wider use and re-use of the results), require use of some standard data format for projects they fund?

(I know very little about the issue, but this seems to be a problem in many fields of academia.)

crypt1d · on Dec 13, 2015

I have no knowledge of the field, so apologizes in advance if this idea does not make sense.

Would it be possible to make a compromise here? Perhaps by maintaining the decentralized structure, but at the same time introducing standards to categorize data on a global level and allow it to be 'mined' by a centralized entity that has API access to this data. Kind of like Google does indexing and searching, just with more cooperation of those being indexed.

nikolamilosevic · on Dec 13, 2015

Yes, this is kinda what I wanted to propose. An umbrella organization that would make an infrastructure using which it would be possible to query, index and integrate data acros the web. And I am not running away from decentralized structure or infrastructure.

crypt1d · on Dec 13, 2015

Glad to see that I got the idea after all.

As I've said I'm no expert, but I'd be interested to use a project like this to learn more about the field and to find ways to contribute. Atm, I can help setup the initial infrastructure for the project and cover the AWS costs for the first few months until you get some funding. Feel free to reach out if you think this could help, email is in my profile.

jerven · on Dec 13, 2015

Have you ever looked at the sadiframework? The HCLS W3C note on data set descriptions is also a good starting point.

michaelmachine · on Dec 13, 2015

Hey, Mike from DrugBank here. Send me an email at mike@drugbank.ca to chat more. You should take a look at https://www.openphacts.org/ which has a similar goal to this project. I think one problem is this: http://xkcd.com/927/

pgroth · on Dec 13, 2015

Mike - drug bank is brilliant.

One of the coarchitects of openphacts here. A pointer for developers is dev.openphacts.org. All the source and data is open.

While I agree that standards are hard I think the major issue is sustaining these things. You need some set of people to code and curate even if it's small and you need good uptime/support to gain community trust.

nairboon · on Dec 13, 2015

Quite an interesting idea, so it'd be like extending the scope of GA4GH beyond 'just' genomics? https://genomicsandhealth.org/work-products-demonstration-pr...

eggie · on Dec 13, 2015

The global alliance for genomics and health is a similar idea but not designed to be completely open nor linked. They don't acknowledge the semantic web. Practically, it is implemented as CORBA using JSON. Everything is an API and after that an implicit data model (in JSON) is being produced for each type of concept. AFAIK security is a huge limitation here. The idea of GA4GH is data silos can communicate some things with each other, but not personally identifiable information.

I work on the project but find it pretty uninspiring. It presents a dark image of the future in which a handful of large tech companies control all of our biomedical data and we have to beg them to allow us to share it. I guess that sounds pretty similar to the present. Just switch bio and social and here we are.

dekhn · on Dec 13, 2015

I don't think that GA4GH is literally using CORBA. The data model is not implicit, it's explicit (there is a schema). The "data silos can communicate some things ... but not personally identifiable information" is a constraint placed on the alliance by legal system. As for the semantic web, every bio project I've seen which adopted the semantic web ultimately failed -- the semantic web seems like a great idea, but attempts to fully implement to the point where it's useful for research always fail. So I think they're focusing on areas where they are likely to succeed (collection and processing of large amounts of raw and derived data using pretty conventional processes, but at a much larger scale, with a solid authentication and access mechanism).

eggie · on Dec 13, 2015

> I don't think that GA4GH is literally using CORBA.

It's not literally CORBA, but people who spent time implementing literal CORBA in a bioinformatics context (for instance, the original author of https://github.com/bioperl/bioperl-corba-server) have noted that the design pattern and discussions followed by the GA4GH are pretty similar to those had in the EBI when they attempted to unify everything using CORBA.

> The data model is not implicit, it's explicit (there is a schema).

There is a schema, but the semantics of the data model are encoded in the comments of the schema. Without hooking into some kind of ontological basis it doesn't seem possible to avoid this.

> As for the semantic web, every bio project I've seen which adopted the semantic web ultimately failed -- the semantic web seems like a great idea, but attempts to fully implement to the point where it's useful for research always fail.

I'm aware of at least one group in the GA4GH that uses RDF internally, then converts into the custom schemas produced by the group in order to maintain compatibility with the top-down designs of the project. I believe this is the phenotype group. These are the people who are most interested what the author of the linked page is describing, they have decided to use the technology you believe is doomed to fail. But, they aren't failing. As far as I can tell from their presentations they are one of two or three groups in the project that have produced a functioning system.

It's very easy so say that hard things are impossible. This tends to keep them that way. I doubt we have any other viable option for building large distributed knowledge systems. The fact that these don't exist does not mean they are impossible to construct, but simply that no one has managed to do so yet. People leveled the same kinds of arguments against neural networks up until a few years ago, saying that they were a nice idea but destined to fail because they were too hard.

> So I think they're focusing on areas where they are likely to succeed (collection and processing of large amounts of raw and derived data using pretty conventional processes, but at a much larger scale, with a solid authentication and access mechanism).

The scales we're talking about are not even an order of magnitude above that which existing techniques allow. So I agree that they will succeed insofar as they simply adopt these existing community-driven standards and slap access control on top. However, in terms of generating new data models for genomics, I'm not so convinced that the centralized design and API-based approach which they are taking will work. I guess we will have to meet back here in a few years and see what happened.

x1k · on Dec 14, 2015

For NN, we have a clear target: for example, beat HMM on speech recognition. To achieve that, you write a tool on some standard test data sets. You don't need to interact with many parties. NN is only technically hard. For GA4GH, things are quite different. Technically, it is hard, but it is much simpler than NN in my view. What is hard is 1) we lack a target and 2) the communications between developers and users. People don't know what we need, don't know what is the right approach and don't know how to evaluate the success. They have changed course back and forth, and still have clashes between programmers and those with more biological background.

heuermh · on Dec 15, 2015

Curious of your affiliation, if you're willing to provide it. I recently joined the Big Data Genomics team at UC Berkeley AMPLab, a GA4GH contributor.

I was initially quite impressed with the GA4GH effort, because it was transparent and producing useful things, in terms of schema and code in github repositories. I am afraid now it has all gone working groups and private email threads and very little is happening in the open any more.

I wish I understood semantic web. I've built more than one system on it but have never found a use case that sells it for me.

nikolamilosevic · on Dec 18, 2015

Since this sparked quite an interest, I have created a mailing list and wiki. For more information about idea, you can read here http://inspiratron.org/blog/2015/12/18/starting-an-effort-to... where you can find the links to mailing list and wiki. I believe that would be the more appropriate place to collect all efforts that currently exist, index them and maybe try to integrate them and make them interoperable. Please join the mailing list.

eggie · on Dec 13, 2015

I'm curious if the author knows about DisGeNET: https://en.m.wikipedia.org/wiki/DisGeNET.

The whole idea of the semantic web is exactly what the author is getting at. I'm curious why this is rarely regarded as a serious basis for such an effort like that which the author is promoting.

IndianAstronaut · on Dec 13, 2015

Didn't Galaxy want to do this as well?

afandian · on Dec 13, 2015

ContentMine.org seems to have similar aims.