Hacker News new | past | comments | ask | show | jobs | submit login

Forgetting to seed your RNG is a really classic bug. IMHO RNGs should auto seed unless explicitly set not to, but since the opposite behaviour was baked into C so many years ago it's kind of the default. The worst part is how easy a bug this is to miss unless you're explicitly printing out the first set of random numbers for some strange reason.



NumPy does auto-seed the RNG if you don't pass a seed yourself, using platform-specific code to pull some entropy from the OS. So that common case is handled reasonably well, unlike with C. In fact if you want exactly reproducible results (e.g. in testcases), you have to seed with a known seed, to avoid that default behavior.

The issue here is a little more subtle: if you fork 10 copies of your Python process, all 10 inherit the current RNG state, and will thereafter produce identical random number sequences. If you were manually forking, you might guess that was a potential problem, and re-seed the RNGs after forking. But PyTorch's data loaders fork a bunch of processes to do things in parallel, so users might not realize that they're using duplicate copies of their RNG state.


It’s even slightly more subtle than that.

Python multiprocessing doesn’t use fork on Windows. It starts a new process and so shouldn’t be affected by this.

So to trigger this you need to have num_processes != 0 on your DataLoader and be running on a non-Windows platform.


I get the desire to be pedantic, but does anyone at all train DL models on Windows? (barring toy projects for fun and perhaps debugging) The same can be said about num_workers > 0. You _have to_ fork worker threads unless you train something super tiny like MNIST and you load the whole dataset on GPU.


> does anyone at all train DL models on Windows?

Yes. My last job was at a financial shop that was all Windows. They were doing ML with Python on Windows. Azure has boxes available for this.


Starting with Python 3.8, multiprocessing will also use new processes by default on MacOS (due to some system libraries not being fork-safe).

IMHO cross-platform Python projects should call `multiprocessing.set_start_method('spawn')` to get the same behavior everywhere.


I’m of the opposite opinion and would get away from all auto RNG seeding:

1) this will help reproducibility a great deal, which is a pain so often.

2) forcing users to actually understand the seeding of RNGs from the point that they are novice programmers could help allay bugs of the sort seen in this post, which I believe stems from having too much faith that RNGs will simply work out of the box as substitutions for ‘real’ random variables.


I have again a different opinion. Allow both: srand() - explicit seed initialization, as well as autoseed.

But you really need to change selected known bad seeds, which destroy the PRNG statistical properties. Most PRNG's have a couple of known bad seeds, but nobody does anything against it. Same for hash functions.


Indeed. Almost every time (like now) when you think you need a random number, you actually need a low discrepancy sequence.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: