Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Aquarium (YC S20) – Improve Your ML Dataset Quality
167 points by pgao on July 13, 2020 | hide | past | favorite | 19 comments
Hi everyone! I’m Peter from Aquarium (https://www.aquariumlearning.com/). We help deep learning developers find problems in their datasets and models, then help fix them by smartly curating their datasets. We want to build the same high-power tooling for data curation that sophisticated ML companies like Cruise, Waymo, and Tesla have and bring it to the masses.

ML models are defined by a combination of code and the data that the code trains on. A programmer must think hard about what behavior they want from their model, assemble a dataset of labeled examples of what they want their model to do, and then train their model on that dataset. As they encounter errors in production, they must collect and label data for the model to train on to fix these errors, and verify they're fixed by monitoring the model’s performance on a test set with previous failure cases. See Andrej Karpathy’s Software 2.0 article (https://medium.com/@karpathy/software-2-0-a64152b37c35) for a great description of this workflow.

My cofounder Quinn and I were early engineers at Cruise Automation (YC W14), where we built the perception stack + ML infrastructure for self driving cars. Quinn was tech lead of the ML infrastructure team and I was tech lead for the Perception team. We frequently ran into problems with our dataset that we needed to fix, and we found that most model improvement came from improvement to a dataset’s variety and quality. Basically, ML models are only as good as the datasets they’re trained on.

ML datasets need variety so the model can train on the types of data that it will see in production environments. In one case, a safety driver noticed that our car was not detecting green construction cones. Why? When we looked into our dataset, it turned out that almost all of the cones we had labeled were orange. Our model had not seen many examples of green cones at training time, so it was performing quite badly on this object in production. We found and labeled more green cones into our training dataset, retrained the model, and it detected green cones just fine.

ML datasets need clean and consistent data so the model does not learn the wrong behavior. In another case, we retrained our model on a new batch of data that came from our labelers and it was performing much worse on detecting “slow signs” in our test dataset. After days of careful investigation, we realized it was due to a change to our labeling process that caused our labelers to label many “speed limit signs” as “slow signs,” which was confusing the model and causing it to perform badly on detecting “slow signs.” We fixed our labeling process, did an additional QA pass over our dataset to fix the bad labels, retrained our model on the clean data, and the problems went away.

While there’s a lot of tooling out there to debug and improve code, there’s not a lot of tooling to debug and improve datasets. As a result, it’s extremely painful to identify issues with variety and quality and appropriately modify datasets to fix them. ML engineers often encounter scenarios like:

Your model’s accuracy measured on the test set is at 80%. You abstractly understand that the model is failing on the remaining 20% and you have no idea why.

Your model does great on your test set but performs disastrously when you deploy it to production and you have no idea why.

You retrain your model on some new data that came in, it’s worse, and you have no idea why.

ML teams want to understand what’s in their datasets, find problems in their dataset and model performance, and then edit / sample data to fix these problems. Most teams end up building their own one-off tooling in-house that isn’t very good. This tooling typically relies on naive methods of data curation that are really manual and involve “eyeballing” many examples in your dataset to discover labeling errors / failure patterns. This works well for small datasets but starts to fail as your dataset size grows above a few thousand examples.

Aquarium’s technology relies on letting your trained ML model do the work of guiding what parts of the dataset to pay attention to. Users can get started by submitting their labels and corresponding model predictions through our API. Then Aquarium lets users drill into their model performance - for example, visualize all examples where we confused a labeled car for a pedestrian from this date range - so users can understand the different failure modes of a model. Aquarium also finds examples where your model has the highest loss / disagreement with your labeled dataset, which tends to surface many labeling errors (ie, the model is right and the label is wrong!).

Users can also provide their model's embeddings for each entry, which are an anonymized representation of what their model “thought” about the data. The neural network embeddings for a datapoint (generated by either our users’ neural networks or by our stable of pretrained nets) encode the input data into a relatively short vector of floats. We can then identify outliers and group together examples in a dataset by analyzing the distances between these embeddings. We also provide a nice thousand-foot-view visualization of embeddings that allows users to zoom into interesting parts of their dataset. (https://youtu.be/DHABgXXe-Fs?t=139)

Since embeddings can be extracted from most neural networks, this makes our platform very general. We have successfully analyzed dataset + models operating on images, 3D point clouds from depth sensors, and audio.

After finding problems, Aquarium helps users solve them by editing or adding data. After finding bad data, Aquarium integrates into our users’ labeling platforms to automatically correct labeling errors. After finding patterns of model failures, Aquarium samples similar examples from users’ unlabeled datasets (green cones) and sends those to labeling.

Think about this as a platform for interactive learning. By focusing on the most “important” areas of the dataset that the model is consistently getting wrong, we increase the leverage of ML teams to sift through massive datasets and decide on the proper corrective action to improve their model performance.

Our goal is to build tools to reduce or eliminate the need for ML engineers to handhold the process of improving model performance through data curation - basically, Andrej Karpathy’s Operation Vacation concept (https://youtu.be/g2R2T631x7k?t=820) as a service.

If any of those experiences speak to you, we’d love to hear your thoughts and feedback. We’ll be here to answer any questions you might have!




I'm an Aquarium user. There are two ways Aquarium provides value to my company. First, we improved our model performance. Second, I spent less time and less clicks curating my dataset.

Regarding model performance, I used Aquarium to improve the AUC for my model by 18 percentage points (i.e., comparing the AUC for the first model trained on my new dataset to the AUC for my production model).

Regarding dataset curation efficiency, I spent much less time curating my dataset using Aquarium than I would have spent using our own in-house tooling. For example, the embedding-based point cloud allowed me to identify lots of images with an issue at once, rather than image by image, click by click.

This thread has been mostly focused on improving model performance (i.e., my first point), but Aquarium is also valuable for improving model curation labor efficiency (i.e., my second point). For the business owner, dataset curation labor efficiency means less money wasted on having some of your most expensive employees, ML data scientists, clicking around and writing ad-hoc scripts. For the ML practitioner, dataset curation labor efficiency means fewer clicks and less wear on your carpal tunnels.

The founders, Peter and Quinn, didn't ask me to write this. I chose to write it because it's a great product that I think can help a lot of businesses and people.


To second your comment, I think non ML folks don't understand how much of an impact dataset curation can have on model performance. More high-quality data will outshine clever network architectures with less data. I've seen it again and again. But the thing is, it's so hard to really curate your data once the dataset has a lot of "dimensionality" to it (sorry couldn't think of a better word...). To be honest, if I were to pick an area of dev-tool I'm most excited about over the next 5 years, this area is probably it.


Btw, for anyone interested, here's a good/quick talk by Andrej Kapathy on what it will take to build the next software stack. https://www.youtube.com/watch?v=y57wwucbXR8&t=3s


Hey! DVC maintainer and co-founder here. First of all, congrats and let me know if we can help you or you have some collaboration in mind! A few questions - how does workflow look like - do you expect users to upload all data to your service? How can data then be consumed from the platform?


Thanks!

We don't expect users to upload all data to our service - the type of data we're interested in is "metadata." URLs to the raw data, labels, inferences, embeddings, and any additional attributes for their dataset. Users can POST this to our API and we'll ingest it that way.

If users don't provide their own embeddings, we need access to the raw data so we can run our pretrained models on the data to generate embeddings.

However, if users do provide their own embeddings, we would never need access to the raw data - Aquarium operates on embeddings, so the raw data URLs would be purely for visualization within the UI. This is really nice because it means that we can access restrict URLs so only customers can visualize it (via URL signing endpoints, only authorizing IP addresses within customer VPNs, Okta integration) and Aquarium would operate on relatively anonymized embeddings and metadata.


Thanks for all the hard work and congrats on your launch!

I will definitely check this service out for a side project I'm working on that combines basketball and AI (https://www.myshotcount.com/)


Yup, I saw your form submission through our site! I reached out to you over email, I'm confident we can help out :)


I think is a great idea because as you mentioned quality Datasets can make your model work or not at all. However this is not addressing the big elephant in the room. Which is: no matter how much you curate or clean the data, you are limited to the dataset that you have. The big answer would be, how can you get more and better datasets. I think tooling is super important, but the big difference will be, how to collect/generate/capture reliable, defendable, datasets moving forward. I think your idea is complementary to this other project: https://delegate.dev


I absolutely, 120% agree on the importance of adding the right data. Aquarium helps you with: "what data should I be collecting to improve my model" and "where do I find that data?"

For the latter, Aquarium treats the problem of smart data sampling as a search and retrieval problem. You want to find more examples of a "target" from a large stream of unlabeled data. Aquarium does this by comparing embeddings of the unlabeled data to your "target set" and then sending examples to labeling if they're within a defined distance threshold in embedding space. We don't actually do the labeling, but we wrap around common labeling providers and can integrate into in-house flows with our API.


Other founder here! For a high level overview of this framing of the problem, I recommend reading this Waymo blog post [1].

One nice feature is that by using embeddings produced by a user's model, which has been trained in the context of their domain, we can do this sort of smart sampling in domains we've never seen before. Embeddings are also naturally anonymized, so we can do this without access to a user's potentially private raw data streams.

[1] https://blog.waymo.com/2020/02/content-search.html


Dear @pgao thank you for the long intro with references and explanations. I went to your website and noticed the "getting started" is a contact form. Curious -- are you making a product to do this, or is it more consulting/advisory? I'm currently creating some fun datasets for public usage and i'd love to be a test rat for your software.


Hey there, it's a product right now! Our goal is to make it self serve, but we're currently onboarding people one-by-one manually until we can streamline the onboarding flow and build out a self serve process. Feel free to DM me or fill out the form and I can send you our public demo!


Thanks for sharing @pgao! This tool looks really valuable.

> Since embeddings can be extracted from most neural networks, this makes our platform very general. We have successfully analyzed dataset + models operating on images, 3D point clouds from depth sensors, and audio.

Are there any types of datasets/models that this tool would not work well with that you're aware of?


Thanks a bunch!

I think the biggest issues with this approach is the requirement for embeddings. It's hard sometimes for a customer to understand what layer to pull out of their net to send to us, so sometimes we just use a pretrained net to generate embeddings. One net for audio, one net for imagery, one net for pointclouds, etc.

I'd say that it's harder for this tool to work with structured/tabular data for a few reasons.

One, most structured datasets are domain-specific, so it's not easy to pull a pretrained model off the shelf to generate embeddings - typically we would need a customer to give us the embeddings from their own model in these cases.

Two, neural nets actually aren't the best for certain structured data tasks. Tree-based techniques often get better performance on simpler tasks, which means there's no obvious embedding to pull from the model.

Three, an alternate interpretation is that a feature vector input for structured data tasks is already an embedding! When the input data is low dimensional, you can do anomaly detection and clustering just by histogramming and other basic population statistics on your data, so it's a lot easier than dealing with unstructured data like imagery.

So I wouldn't say that our tooling wouldn't work for structured data, but more that in those types of cases, maybe there's something simpler that works just as well.


If I understand correctly, it sounds like your platform is primarily intended for improving awareness and understanding of the data a team has, so they know which features to focus on and emphasize.

Do you think you'll get into synthetic data generation as well? In other words, improving dataset quality additively, not just curatively.


Yes, your interpretation is correct. I don't think we're going to get into synthetic data generation in the near term, mainly due to the amount of effort required + questions about domain transfer. However, we do improve dataset quality additively by sampling the best data to label + retrain on to get the best performance.

Said another way: once you've found "I do badly on green cones," we use similarity search on the embeddings of known green cone examples to find more instances of green cones in the wild. We pick the right examples from streams of unlabeled data, then send it to labeling + add to your dataset so it does better the next time you retrain.


I like this much better than synthetic data augmentation actually. I think synthetic augmentation, like with GANs is actually a failed concept.

There has long been theoretical limits around how much you can gain by ensembling with a model of known limitations, and this is all that synthetic training data is at root.

You can’t “make up” training data that allows you to escape the ceiling of performance implied by whatever generator process you use for the synthetic data, no differently than you can’t learning a better regression just by bootstrapping a large sample of data from your existing training set.

Algorithmic synthetic data is a big type of fool’s gold.


Have tested the tool a little bit for audio, and see a potential here. Especially useful for anyone who has a relatively large amount of unlabeled data, and want to be efficient in terms of what samples to spend resources labeling.


Thanks for the shoutout! We got connected to jononor through our previous r/machinelearning launch: https://www.reddit.com/r/MachineLearning/comments/hjbl4h/p_l...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: