The SageMaker Distributed Model Parallel Library Overview

The Amazon SageMaker distributed model parallel library is a model parallelism library for training large deep learning models that were previously difficult to train due to GPU memory limitations. The library automatically and efficiently splits a model across multiple GPUs and instances and coordinates model training, allowing you to increase prediction accuracy by creating larger models with more parameters.

You can use the library to automatically partition your existing TensorFlow and PyTorch workloads across multiple GPUs with minimal code changes. The library’s API can be accessed through the Amazon SageMaker SDK.

Tip

We recommended using this API documentation with the conceptual guide at SageMaker’s Distributed Model Parallel in the Amazon SageMaker developer guide. This developer guide documentation includes:

Important

The model parallel library only supports training jobs using CUDA 11. When you define a PyTorch or TensorFlow Estimator with modelparallel parameter enabled set to True, it uses CUDA 11. When you extend or customize your own training image you must use a CUDA 11 base image. See Extend or Adapt A Docker Container that Contains the Model Parallel Library for more information.