What might be the reasons of the Neural Network model is unable to decrease loss during training period?
Ans: Large Learning rate will cause the optimization to diverge, small learning rate will prevent you from making real improvement, and possibly allow the noise inherent in SGD to overcome your gradient estimates. Scheduling the Learning rate can decrease the learning rate over the course of training.
Choosing a good minibatch size can influence the learning process indirectly because a larger mini-batch size will tend to have a smaller variance than a smaller mini-batch size.
There are a number of variants on stochastic gradient descent which use momentum, adaptive learning rates, Nesterov updates and so on to improve upon vanilla SGD.
The scale of the data can make a larger difference on training the Neural Network. Layer normalization can improve training by keeping a running mean and standard deviation for neurons activations.
However, at the time that your network is struggling to decrease the loss on the training data — when the network is not learning — regularization can obscure what the problem is.