Distributed Training APIs¶

SageMaker distributed training libraries offer both data parallel and model parallel training strategies. They combine software and hardware technologies to improve inter-GPU and inter-node communications. They extend SageMaker’s training capabilities with built-in options that require only small code changes to your training scripts.

The SageMaker Distributed Data Parallel Library¶

The SageMaker Distributed Data Parallel Library Overview
Use the Library to Adapt Your Training Script
- For versions between 1.4.0 and 1.8.0 (Latest)
- Documentation Archive
Launch a Distributed Training Job Using the SageMaker Python SDK
Release Notes
- SageMaker Distributed Data Parallel 1.8.0 Release Notes
- Release History

The SageMaker Distributed Model Parallel Library¶

Note

Since the release of the SageMaker model parallelism (SMP) version 2 in December 2023, this documentation is no longer supported for maintenence. The live documentation is available at SageMaker model parallelism library v2 in the Amazon SageMaker User Guide.

The documentation for the SMP library v1.x is archived and available at Run distributed training with the SageMaker model parallelism library in the Amazon SageMaker User Guide, and the SMP v1.x API reference is available in the SageMaker Python SDK v2.199.0 documentation.