What does the Learning rate determine in Gradient Descent?

How big the steps should be. Nonlinear activations will have local minima, SGD works better in practice for optimizing non-convex functions.

Google Interview Hacks