Launch HN: Aquarium (YC S20) – Improve Your ML Dataset Quality

tmshapland · on July 13, 2020

I'm an Aquarium user. There are two ways Aquarium provides value to my company. First, we improved our model performance. Second, I spent less time and less clicks curating my dataset.

Regarding model performance, I used Aquarium to improve the AUC for my model by 18 percentage points (i.e., comparing the AUC for the first model trained on my new dataset to the AUC for my production model).

Regarding dataset curation efficiency, I spent much less time curating my dataset using Aquarium than I would have spent using our own in-house tooling. For example, the embedding-based point cloud allowed me to identify lots of images with an issue at once, rather than image by image, click by click.

This thread has been mostly focused on improving model performance (i.e., my first point), but Aquarium is also valuable for improving model curation labor efficiency (i.e., my second point). For the business owner, dataset curation labor efficiency means less money wasted on having some of your most expensive employees, ML data scientists, clicking around and writing ad-hoc scripts. For the ML practitioner, dataset curation labor efficiency means fewer clicks and less wear on your carpal tunnels.

The founders, Peter and Quinn, didn't ask me to write this. I chose to write it because it's a great product that I think can help a lot of businesses and people.

__sy__ · on July 13, 2020

To second your comment, I think non ML folks don't understand how much of an impact dataset curation can have on model performance. More high-quality data will outshine clever network architectures with less data. I've seen it again and again. But the thing is, it's so hard to really curate your data once the dataset has a lot of "dimensionality" to it (sorry couldn't think of a better word...). To be honest, if I were to pick an area of dev-tool I'm most excited about over the next 5 years, this area is probably it.

__sy__ · on July 13, 2020

Btw, for anyone interested, here's a good/quick talk by Andrej Kapathy on what it will take to build the next software stack. https://www.youtube.com/watch?v=y57wwucbXR8&t=3s

shcheklein · on July 13, 2020

Hey! DVC maintainer and co-founder here. First of all, congrats and let me know if we can help you or you have some collaboration in mind! A few questions - how does workflow look like - do you expect users to upload all data to your service? How can data then be consumed from the platform?

pgao · on July 13, 2020

Thanks!

We don't expect users to upload all data to our service - the type of data we're interested in is "metadata." URLs to the raw data, labels, inferences, embeddings, and any additional attributes for their dataset. Users can POST this to our API and we'll ingest it that way.

If users don't provide their own embeddings, we need access to the raw data so we can run our pretrained models on the data to generate embeddings.

However, if users do provide their own embeddings, we would never need access to the raw data - Aquarium operates on embeddings, so the raw data URLs would be purely for visualization within the UI. This is really nice because it means that we can access restrict URLs so only customers can visualize it (via URL signing endpoints, only authorizing IP addresses within customer VPNs, Okta integration) and Aquarium would operate on relatively anonymized embeddings and metadata.

stev3 · on July 13, 2020

Thanks for all the hard work and congrats on your launch!

I will definitely check this service out for a side project I'm working on that combines basketball and AI (https://www.myshotcount.com/)

pgao · on July 13, 2020

Yup, I saw your form submission through our site! I reached out to you over email, I'm confident we can help out :)

masio12 · on July 13, 2020

I think is a great idea because as you mentioned quality Datasets can make your model work or not at all. However this is not addressing the big elephant in the room. Which is: no matter how much you curate or clean the data, you are limited to the dataset that you have. The big answer would be, how can you get more and better datasets. I think tooling is super important, but the big difference will be, how to collect/generate/capture reliable, defendable, datasets moving forward. I think your idea is complementary to this other project: https://delegate.dev

pgao · on July 13, 2020

I absolutely, 120% agree on the importance of adding the right data. Aquarium helps you with: "what data should I be collecting to improve my model" and "where do I find that data?"

For the latter, Aquarium treats the problem of smart data sampling as a search and retrieval problem. You want to find more examples of a "target" from a large stream of unlabeled data. Aquarium does this by comparing embeddings of the unlabeled data to your "target set" and then sending examples to labeling if they're within a defined distance threshold in embedding space. We don't actually do the labeling, but we wrap around common labeling providers and can integrate into in-house flows with our API.

quinnhj · on July 13, 2020

Other founder here! For a high level overview of this framing of the problem, I recommend reading this Waymo blog post [1].

One nice feature is that by using embeddings produced by a user's model, which has been trained in the context of their domain, we can do this sort of smart sampling in domains we've never seen before. Embeddings are also naturally anonymized, so we can do this without access to a user's potentially private raw data streams.

[1] https://blog.waymo.com/2020/02/content-search.html

TuringNYC · on July 13, 2020

Dear @pgao thank you for the long intro with references and explanations. I went to your website and noticed the "getting started" is a contact form. Curious -- are you making a product to do this, or is it more consulting/advisory? I'm currently creating some fun datasets for public usage and i'd love to be a test rat for your software.

pgao · on July 13, 2020

Hey there, it's a product right now! Our goal is to make it self serve, but we're currently onboarding people one-by-one manually until we can streamline the onboarding flow and build out a self serve process. Feel free to DM me or fill out the form and I can send you our public demo!

hughpeters · on July 13, 2020

Thanks for sharing @pgao! This tool looks really valuable.

> Since embeddings can be extracted from most neural networks, this makes our platform very general. We have successfully analyzed dataset + models operating on images, 3D point clouds from depth sensors, and audio.

Are there any types of datasets/models that this tool would not work well with that you're aware of?

pgao · on July 13, 2020

Thanks a bunch!

I think the biggest issues with this approach is the requirement for embeddings. It's hard sometimes for a customer to understand what layer to pull out of their net to send to us, so sometimes we just use a pretrained net to generate embeddings. One net for audio, one net for imagery, one net for pointclouds, etc.

I'd say that it's harder for this tool to work with structured/tabular data for a few reasons.

One, most structured datasets are domain-specific, so it's not easy to pull a pretrained model off the shelf to generate embeddings - typically we would need a customer to give us the embeddings from their own model in these cases.

Two, neural nets actually aren't the best for certain structured data tasks. Tree-based techniques often get better performance on simpler tasks, which means there's no obvious embedding to pull from the model.

Three, an alternate interpretation is that a feature vector input for structured data tasks is already an embedding! When the input data is low dimensional, you can do anomaly detection and clustering just by histogramming and other basic population statistics on your data, so it's a lot easier than dealing with unstructured data like imagery.

So I wouldn't say that our tooling wouldn't work for structured data, but more that in those types of cases, maybe there's something simpler that works just as well.

fractionalhare · on July 13, 2020

If I understand correctly, it sounds like your platform is primarily intended for improving awareness and understanding of the data a team has, so they know which features to focus on and emphasize.

Do you think you'll get into synthetic data generation as well? In other words, improving dataset quality additively, not just curatively.

pgao · on July 13, 2020

Yes, your interpretation is correct. I don't think we're going to get into synthetic data generation in the near term, mainly due to the amount of effort required + questions about domain transfer. However, we do improve dataset quality additively by sampling the best data to label + retrain on to get the best performance.

Said another way: once you've found "I do badly on green cones," we use similarity search on the embeddings of known green cone examples to find more instances of green cones in the wild. We pick the right examples from streams of unlabeled data, then send it to labeling + add to your dataset so it does better the next time you retrain.

mlthoughts2018 · on July 13, 2020

I like this much better than synthetic data augmentation actually. I think synthetic augmentation, like with GANs is actually a failed concept.

There has long been theoretical limits around how much you can gain by ensembling with a model of known limitations, and this is all that synthetic training data is at root.

You can’t “make up” training data that allows you to escape the ceiling of performance implied by whatever generator process you use for the synthetic data, no differently than you can’t learning a better regression just by bootstrapping a large sample of data from your existing training set.

Algorithmic synthetic data is a big type of fool’s gold.

jononor · on July 13, 2020

Have tested the tool a little bit for audio, and see a potential here. Especially useful for anyone who has a relatively large amount of unlabeled data, and want to be efficient in terms of what samples to spend resources labeling.

pgao · on July 13, 2020

Thanks for the shoutout! We got connected to jononor through our previous r/machinelearning launch: https://www.reddit.com/r/MachineLearning/comments/hjbl4h/p_l...