Distributed Training APIs

SageMaker distributed training libraries offer both data parallel and model parallel training strategies. They combine software and hardware technologies to improve inter-GPU and inter-node communications. They extend SageMaker’s training capabilities with built-in options that require only small code changes to your training scripts.

The SageMaker Distributed Model Parallel Library

Note

Since the release of the SageMaker model parallelism (SMP) version 2 in December 2023, this documentation is no longer supported for maintenence. The live documentation is available at SageMaker model parallelism library v2 in the Amazon SageMaker User Guide.

The documentation for the SMP library v1.x is archived and available at Run distributed training with the SageMaker model parallelism library in the Amazon SageMaker User Guide, and the SMP v1.x API reference is available in the SageMaker Python SDK v2.199.0 documentation.