Stochastic Gradient Descent (SGD) is the work horse of modern machine learning, and is commonly used in training in several classifiers including deep neural networks. This talk will focus on a common variant of
SGD implemented in distributed systems known as local, distributed SGD, where the model is updated locally at different distributed processing components and averaged periodically. Since averaging is done locally, this variant incurs smaller communication cost but at the cost of possibly slower convergence and a residual training error. We provide a thorough convergence analysis of local SGD and describe insights on key factors that govern the speed and error of training. We also present some techniques that can improve speed of convergence in practice.