Distributed model parallel¶

Amazon SageMaker Distributed Model Parallel (SMP) is a model parallelism library for training large deep learning models that were previously difficult to train due to GPU memory limitations. SMP automatically and efficiently splits a model across multiple GPUs and instances and coordinates model training, allowing you to increase prediction accuracy by creating larger models with more parameters.

You can use SMP to automatically partition your existing TensorFlow and PyTorch workloads across multiple GPUs with minimal code changes. The SMP API can be accessed through the Amazon SageMaker SDK.

Use the following sections to learn more about the model parallelism and the SMP library.

It is recommended to use this documentation alongside SageMaker Distributed Model Parallel in the Amazon SageMaker developer guide. This developer guide documentation includes:

An overview of model parallelism and the SMP library core features

Instructions on how to modify TensorFlow and PyTorch training scripts

Configuration tips and pitfalls

How to Use this Guide

The SMP library contains a Common API that is shared across frameworks, as well as APIs that are specific to supported frameworks, TensorFlow and PyTroch. To use SMP, reference the Common API documentation alongside framework specific API documentation.