In deep learning, the error is reduced using backpropagation which, as a result of the chain rule of derivatives, must multiply the gradient of successive layers together. If one of the layers has already maxed out, and reached a gradient of zero, while the other layers are not yet fully trained, then you are in trouble, because the GD’s improvement of whole layer series cannot progress, since the product is zero.
Next: Long Short-Term Memory Unit (LSTM)