Hacker News new | past | comments | ask | show | jobs | submit login

Usually when someone says “synthetic data” they mean CGI, not simply transformations of existing data. Using synthetic data is fraught (and presumptuous), as you are assuming you understand the problem domain 100% and are also extremely good at reproducing it. There’s a chance the model is using something specific to the CGI (and not the general reality) to produce its results.

For winning a computer vision competition it’s probably ok but I’d be very careful about using synthetic data for systems I cared about.




I thought "synthetic data" it's something that rarely shows in training image recognition, and is more like randomly generated user data (name, surname, etc.) or data generated from simulations of some processes?


>I thought "synthetic data" it's something that rarely shows in training image recognition

On the contrary, it is used to train models but it cannot adequately capture the long tail of weird events in the real world. Hence, it cannot be relied upon, as alluded to by the parent commenter. With reference to using data collected from a simulated environment vs real world ─ this subject was discussed at some length by Elon Musk and Andrej Karpathy, at the Tesla Autonomy Day event a few weeks ago.


I generally agree that there is no substitute for experience running in production and more data is better - or at least should be, if you can figure out how to take advantage of it.

The thing is, when it comes to weird events, historical data can't be relied on either. The next weird thing may never have happened before.

Predicting the future is hard no matter what you do. Gathering more data and learning more efficiently from what you have are both important. Training on artificial challenges can also be useful.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: