What does the Learning rate determine in Gradient Descent?
What does the Learning rate determine in Gradient Descent?
How big the steps should be. Nonlinear activations will have local minima, SGD works better in practice for optimizing non-convex functions.