Consider trying classifying shapes into either square or circle. The model outpu...

Consider trying classifying shapes into either square or circle. The model outputs probability of circle. The absolute best you can do is to completely learn the training set. Assign 1 to all the circles and 0 to all the squares.

It is typical to squish the set of all real numbers into the interval 0 to 1. Any finite value will be squished to a value less than 1. So the model tries to make the number go higher and higher. No matter which model you have, you can always go a bit higher. Thus there is no optimal model.

The author mentions regularization, but strangely he then proceeds as if it didn't exist. Because with regularization, you can prove that there is a minimum. Basically (don't worry if you don't understand, I just include for other readers): loss goes to infinity as parameter norm goes to infinity, loss has a lower bound so it has a highest lower bound. Take a sequence of points in parameter space with loss converging to this bound. By Bolzano Weirstass it has a convergent sub sequence. Loss is continuous function of parameters, so the loss of the limit point is the limit of the loss. I.e. it's a minimum.