Hacker News new | past | comments | ask | show | jobs | submit login

Generated data can be used to increase variety as a way to avoid overfitting. A simple example might be translating or rotating images.

Not an expert, but I'm guessing it must be easier to do this than to improve how the machine learning generalizes from less data.




Rotation, etc. of images is known more generally as data augmentation and you're correct in that it's a way to reduce amount of training data to cold start a data product.

For "synthetic data", my understanding is that it's referring to use of ML to generate brand new training samples (e.g. through GANs, though there are limitations [1]).

1 - https://openreview.net/forum?id=rJMw747l_4


Usually when someone says “synthetic data” they mean CGI, not simply transformations of existing data. Using synthetic data is fraught (and presumptuous), as you are assuming you understand the problem domain 100% and are also extremely good at reproducing it. There’s a chance the model is using something specific to the CGI (and not the general reality) to produce its results.

For winning a computer vision competition it’s probably ok but I’d be very careful about using synthetic data for systems I cared about.


I thought "synthetic data" it's something that rarely shows in training image recognition, and is more like randomly generated user data (name, surname, etc.) or data generated from simulations of some processes?


>I thought "synthetic data" it's something that rarely shows in training image recognition

On the contrary, it is used to train models but it cannot adequately capture the long tail of weird events in the real world. Hence, it cannot be relied upon, as alluded to by the parent commenter. With reference to using data collected from a simulated environment vs real world ─ this subject was discussed at some length by Elon Musk and Andrej Karpathy, at the Tesla Autonomy Day event a few weeks ago.


I generally agree that there is no substitute for experience running in production and more data is better - or at least should be, if you can figure out how to take advantage of it.

The thing is, when it comes to weird events, historical data can't be relied on either. The next weird thing may never have happened before.

Predicting the future is hard no matter what you do. Gathering more data and learning more efficiently from what you have are both important. Training on artificial challenges can also be useful.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: