Hacker News new | past | comments | ask | show | jobs | submit login

Many computer problems are probabilistic in nature with very low occurrence, and it can be very challenging to come to any real conclusions about the cause and outcome of various process failures. While I believe cosmic rays are a problem, the issue I'm dealing with now is bad silicon- when you make a bunch of chips (millions), some fraction of them will compute a few operations incorrectly sometimes. Post-manufacture validation doesn't catch all problems and ultimately some machines slip in the serving fleet.

ML People who train on this hardware report more NaNs causing training failure than expected due to software bugs. It's extremely challenging to debug because most ML codes are very robust to small amounts of injected noise, especially gaussian independent noise (there's literature showing that introducing random numbers in training often helps training go faster).

This is a fascinating area and there aren't a lot of people who can really make forward progress in it.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: