Love it; one of the biggest things we've preached to users of our computer vision dev-tool is: it doesn't matter how much time you spend on your model if your data is bad.
When you're trying to get a first version of your CV project launched, you can usually pretty much ignore the model (and pick one up off the shelf) because the time you spend on improving your data is orders of magnitude higher leverage.
Someday the model might become the limiting re-agent but it's much, much later than people assume (if at all). But the model is the "fun" part for technical people to play with (and academia encourages treating your data as an immutable constant) so it gets a lot of attention at the expense of the data.
I have come to see it as a sort of "Maslow's Hierarchy" for Data-needs
where the base is getting data and the top is producing "results".
Where in academia "results" is publishing papers in other industries it may be something else.
Somewhere between the top (glorious presentation) and bottom (no data) we can draw a fuzzy line that partitions the "glam stack" and the "drudge stack".
The more the reward structure is skewed toward the top,
the greater the incentive is to just cook the data.
I think what was surprising for me dealing with junior data scientists was that I had to make this point. Why are you killing yourself trying to get insane accuracy on dirty data? Your energy is better spent getting better data.
Cleaning data is usually very low on the totem pole of priorities for many companies. It is often a very manual and time consuming process. Support from management to fix it can be very poor.
Relational tables with millions or billions of rows can contain many different kinds of errors. Finding and fixing them can be like finding a few dozen needles in a field of haystacks.
I was working on a tool to make it much easier to find and fix anomalies. I used the Chicago crime data (available for free download on the city's open data portal) to show how easy it was to find and fix problems. Here is a short video of some of the things it found: https://www.youtube.com/watch?v=kqkNeU1LYEQ
I contacted the people responsible for managing the data to show them some of the problems and how to fix them. They seemed like they could not be any less interested.
oh but it certainly can be depending on the industry you are in.
It can just be frustrating because you want to deliver results and this realization can get you stuck in the mud of "We need better measuring processes."
However you also get to advocate for why these measuring practices matter.
It’s not a junior data scientist thing, it’s a bad DS thing. Just shows that they can’t see the forest for the trees and if somehow you make them realize this problem there will be the next problem with the same result.
Some possible insight: For some of us there is a particular pull to tasks that are significant, complicated, and seem to have a finite solution. Cleaning/parsing existing data strikes me as one of those.
I’ve read the first chapter and the mosquito was… beautiful. Actually lol’ed at that.
Anyway very good material. Correctly identifies lack of mistrust in the data in most engineers. Doesn’t go far enough in saying that folks like to play with toys (models) instead of doing tireless and boring but important work.
I have an ML project I started that involved manually labeling around 10,000 still frames of my hand on a guitar fretboard playing various chord shapes. I made a little web app with a keyboard interface for quickly adding labels to images. I got up to around an image a second when I got in the zone. I finished the dataset, got distracted by the birth of my son, and have literally done none of the “fun stuff” yet!
If anyone wants to have a crack at the data, it’s in git-lfs and here:
If you ever get back into it, I bet you can 10x to 20x that speed with augmented labeling.
Once you have >100 frames labeled you can put a classifier in the loop and only have to label the % of frames it gets wrong.
I usually set up a view with 10x10 samples only containing samples labeled as a single class by the classifier, then I mark those it got wrong as unlabeled and move on to the next batch. With an 80% accurate classifier you can get 80 samples labeled every 5 seconds or so.
And if you retrain the classifier regularly on the newly labeled samples you can improve its accuracy and the speed of labeling with it.
A data scientist friend of mine had some success with Figure8 but I haven't used it myself.
Honestly I always roll my own, it's dead fast to throw a simple GUI together in tkinter and it makes it easy to integrate your own models and custom sample rendering/plotting.
That is if you're doing simple discrete class labeling, as opposed to more complex labeling like box-labeling for image segmentation, or text2speech labeling for example.
Thx for sharing the link to your work. Interesting idea. A few thoughts:
1) Congrats on the birth of your child. I imagine you have zero time now but as they get older, you start getting your time back. I went through this and now I can sneak in personal projects while the kids are in their activities, late at night, early am. Be aware your body and stamina declines as you age. I am getting close to mid 40s and I can feel it.
2) The previous poster presented an interesting idea about putting a basic classifier in the loop. The challenge is how do you if the classifier gets it wrong. Confidence scores you get from the logits are extremely flakey. I think one solution is to metric learning methods (contrastive loss instead of cross entropy). I have seen some papers that dance around this but have not seen anything fully baked from a scientific perspective.
3) Your task is an interesting action recognition task. You should seriously consider putting it on kaggle or write a paper on it (and release the dataset). The easy off-the-shelf model you could try on this data for a video classification task like this is possibly X3D. But there are a variety of other methods (I'm a researcher in the field).
> Anyone is welcome to take this course, regardless of background. To get the most out of this course, we recommend that you:
> Completed an introductory course in machine learning (like 6.036 / 6.390). To learn this on your own, check out Andrew Ng’s ML course, fast.ai’s ML course, or Dive into Deep Learning.
Does anyone have a recommendation for one of these options over the others?
Depends on your background and what you're most interested in learning.
- if you want to really understand the mathematical foundations, you should take Andrew Ng's course (the full Stanford course not the Coursera one)
- If you want a quick sampler of ML and to know the big ideas without investing too much time, take Andrew Ng's coursera course. I'm not sure what his deeplearning.ai course is like but it seems like that one is in python.
- if you want to learn by getting your hands dirty, either fast.ai's ML course or Dive into Deep Learning is good. Fast.ai's course will get you into the nitty gritty parts of training neural networks.
Great to see data-centric AI receiving more attention lately! It's really exciting to see more emphasis being placed on error analysis and model evaluation too. Back in my grad school days, one of my professors always emphasized the importance of mastering error analysis, even more so than building models. We've actually been working on some tools to make error analysis more user-friendly and less boring at https://openlayer.com, if anyone's interested in checking them out. I'm always happy to chat and geek out about data science, so feel free to hit me up!
Scale AI does this by getting high quality labels.
They also have built some tooling to inspect your labels using your models:
https://dashboard.scale.com/nucleus
just please remember, only clean data is not enough, as Nobel Prize-winning economist Ronald Coase said, "if you torture the data long enough, it will confess to anything"... This is something we see quite often in our company working with financial data, so I am sure it will be applicable to other fields as well...
When you're trying to get a first version of your CV project launched, you can usually pretty much ignore the model (and pick one up off the shelf) because the time you spend on improving your data is orders of magnitude higher leverage.
Someday the model might become the limiting re-agent but it's much, much later than people assume (if at all). But the model is the "fun" part for technical people to play with (and academia encourages treating your data as an immutable constant) so it gets a lot of attention at the expense of the data.