Just a few things: in general case it's better not to use MSE after sigmoid due to slow convergence.
And "logits" variable is not logits actually, it's probabilities. Logits is what you have before applying sigmoid activation.
Just a few things: in general case it's better not to use MSE after sigmoid due to slow convergence.
And "logits" variable is not logits actually, it's probabilities. Logits is what you have before applying sigmoid activation.