Launch HN: Centaur Labs (YC W19) – Labeling Medical Images at Scale

skwb · on April 30, 2019

What incentive is there for medical students to do this work for free? I don't know of any doctor that would just give away all their work for learning. They get plenty of that during their lecture.

Is this supposed to be a product to help ai radiology startups curate and manage their data? If so, are we talking about semantic segmentation, localization, or what sort of label? A lot of the time providing explicit data information will require fewer studies to generalize, but require much more work from labeling side.

I would also wonder about your data sourcing. Just because you don't have an FDA product, doesn't mean you're clear of HIPPA rules. Medical images may contain PII, especially scans that include the face.

Edit: couple of typos...

erikduhaime · on April 30, 2019

Our users participate to win cash prizes, to learn, and compete with others. Everyone loves seeing their name on a leaderboard, especially when they can improve their skills and win cash prizes at the same time!

Today we’re building out our annotation tools. We have bounding boxes, localization, and more depending on the task.

Today we use publicly available datasets and depend on our clients to only provide deidentified images. We’re also working to verify certain users to enable them to see cases with PII.

_wk3u · on April 30, 2019

Tangentially related, it kind of reminded me of Snapshot Serengeti which labels images with animals. Interestingly, these labels aren't used to train models. They're just used to filter pictures taken by automated motion sensor based cameras. The pictures with actual animals in them are used to study animal migrations.

https://www.zooniverse.org/projects/zooniverse/snapshot-sere...

erikduhaime · on April 30, 2019

Yes we love zooniverse. Of course, we could do this with some non-medical tasks also. In fact, if you download the app now try out the “study break” category to classify dog breeds!

alexpotato · on April 30, 2019

I would think you could go a long way with just creating an ESP style game: https://en.wikipedia.org/wiki/ESP_game

In this talk, the creator also goes into how they created a version to actually locate the key points in the image: https://www.youtube.com/watch?v=tx082gDwGcM

erikduhaime · on April 30, 2019

Yes I am a huge fan of his work! He went on to found DuoLingo, which has inspired us a lot. I recommend everyone watch the talk you linked and also his TED

tricky · on May 1, 2019

Really glad to hear you're working on this problem. I've done a bit of grooming medical imaging datasets for AI projects. A big chunk of time is spent working on pipelines to properly de-identify the images. Everything from PHI hidden deep in the dicom headers to patient name burned into the image by the scanner or some workstation that opened it. How are you dealing with those challenges?

erikduhaime · on May 2, 2019

That is certainly a challenge. Automated approaches of removing PHI often miss some things for the reasons you mentioned, and at the end of the day you need a person to verify that the image is free of PHI. Right now we depend on our clients to remove PHI, but we’re also working on a process where we verify some users credentials and have those users review cases for PHI before we release a potentially sensitive case to the crowd.

colinmcd · on April 30, 2019

Great concept! Basically human-powered ensemble learning...clever. How many images are you able to process daily?

erikduhaime · on April 30, 2019

Thanks!! Human-ensemble learning is exactly how we think of it. Currently our crowd already does tens of thousands of reads per day, and we expect many more in the future.

cxr_ai · on May 2, 2019

We're a medical imaging AI startup and one of the toughest aspects of getting quality training data is peer review of ground truth labeling. I love that your platform solves this problem with more than usual 3 annotators! Will be in touch shortly

avip · on April 30, 2019

I was under the impression companies engaged in medical image diagnostics train their algos on verified clinical data (i.e for radiology the training set would have Biopsy results). Is crowd labeling valuable for medical diagnosis?

erikduhaime · on April 30, 2019

Absolutely. Biopsy proven is great, but often not available. This is especially true when you want a very large dataset, or if it’s something like a fracture or a bleed and not cancer.

craze3 · on April 30, 2019

Awesome. I love to see small teams tackling big problems like this. Good luck guys!

erikduhaime · on April 30, 2019

thank you!!

tmamic · on April 30, 2019

>they could reliably distinguish cancerous moles from benign ones better than individual dermatologists

Do you have anything published about this?

erikduhaime · on April 30, 2019

Not yet! I’ll have a paper on skin cancer diagnosis out soon though, stay tuned or reach out directly for more info!

tarun_anand · on May 1, 2019

Erik we would be interested. How can we connect?

erikduhaime · on May 1, 2019

Reach out! Contact@centaurlabs.io

prklmn · on April 30, 2019

It's incredibly concerning that we could have "medical students and other health professionals" reading images and providing labeling data that would actually get used to teach AI...

erikduhaime · on April 30, 2019

The way we handle this is that we only trust our users based on their past performance on "gold standard" cases, regardless of their credentials. So if a med student (or even a layperson!) looks at cases on the app, they aren't contributing to labeling data that we provide to customers until we've learned to trust them based on their performance on hundreds or thousands of past cases. At the same time, we don't trust people just because they're a doctor if they don't perform well at the task :)

Scoundreller · on April 30, 2019

We already have kids identifying cars, storefronts and traffic lights. What’s the worst that could happen?

gigantum · on April 30, 2019

At Gigantum (https://gigantum.com), we've been working with brain imaging researchers on enabling exploratory analyses that are easy to share or reproduce. For now, our collaborators have focused on large public datasets like the Healthy Brain Network (http://fcon_1000.projects.nitrc.org/indi/cmi_healthy_brain_n...) and OpenNeuro (https://openneuro.org/).

I'm curious to know how you're managing the AI engineering side of things - I know there's nothing close to "the right answer" yet in terms of pipelines for brain images. And of course I'd be interested how folks could collaborate on developing better algorithms for understanding these images (with Gigantum and otherwise).

Certainly, if you have a collaborative project and would like to try Gigantum for coordinating code, data, and computational environments, we'd be happy to support that! We provide a one-click solution to publish a project so that someone else can pick up exactly where you left off.

erikduhaime · on April 30, 2019

Hi Gigantum. We actually don't do the computer vision ourselves -- we label the data for companies that do. But your product looks very interesting. Do you work exclusively with brain imaging?

gigantum · on April 30, 2019

Makes sense. The founding team came out of large scale brain imaging research, but the goal is that anything you'd do in a notebook / web UI environment (like Jupyter, RStudio, etc.) you should be able to do with the Gigantum Client.

The difference from the standard approach for those tools is that we automate some command line operations (Git, Docker, etc.), and provide UI for the rest. We provide a stable foundation for how to organize data using Git LFS, along with an optimized S3 storage back-end if you need to cherry pick large datasets.

Our main goal is to improve the quality of "academic" science, but we're open to anything that fits!