What: Parallelising the training data and splitting it up amongst GPU’s Each GPU computes gradients (loss) independently Gradients are averaged to update model Allows scaling with multiple GPUs and minimal code changes