My intuition here is that there is always a fair bit of noise in the (data -> la...

My intuition here is that there is always a fair bit of noise in the (data -> label) pairs. Models tend to air on the side of being _too_ expressive to compensate (just add another layer right?). Assuming our model is too expressive, we must turn off training before we get too far down and start memorizing training datasets. Put another way, one input pair will want to go down one path, another will want to go down a totally different path. Assuming the model is too expressive, we actually don't want to reach a minima, because that essentially guarantees we've overfit.

What the author is saying is that very quickly optimization becomes a maze, and can quickly turn into a combinatorial game. Each input starts down its own set of corridors (parameters at certain values) and it can take an _extremely_ long time for this maze to end. Any noise in the (data, label) pairs can make this maze have no end at all if the model is too small. If the model is too big, it's a moot point because it will have overfit at this point.