Release Notes¶
New features, bug fixes, and improvements are regularly made to the SageMaker model parallelism library.
SageMaker Distributed Model Parallel 1.15.0 Release Notes¶
Date: Apr. 27. 2023
Currency Updates
Added support for PyTorch v2.0.0. Note that the library does not support
torch.compile
in this release.
New Features
Using sharded data parallelism with tensor parallelism together is now available for PyTorch 1.13.1. It allows you to train with smaller global batch sizes while scaling up to large clusters. For more information, see Sharded data parallelism with tensor parallelism in the Amazon SageMaker Developer Guide.
Added support for saving and loading full model checkpoints when using sharded data parallelism. This is enabled by using the standard checkpointing API,
smp.save_checkpoint
withpartial=False
. Before, full checkpoints needed to be created by merging partial checkpoint files after training finishes.DistributedTransformer now supports the ALiBi position embeddings. When using DistributedTransformer, you can set the
use_alibi
parameter toTrue
to use the Triton-based flash attention kernels. This helps evaluate sequences longer than those used for training.
Bug Fixes
When using tensor parallelism, parameters were initialized multiple times unncessarily. This release fixed the multiple initialization of parameters so that each parameter is initialized exactly once. It not only saves time, but also ensures that the random generator behavior is similar to the non-tensor parallelism case.
Known issues
Model initialization might take longer with PyTorch 2.0 than that with PyTorch 1.13.
Migration to AWS Deep Learning Containers
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
SageMaker training container for PyTorch v2.0.0
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker
SageMaker training container for PyTorch v1.13.1
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker
Binary file of this version of the library for custom container users:
For PyTorch v2.0.0
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-2.0.0/build-artifacts/2023-04-14-20-14/smdistributed_modelparallel-1.15.0-cp310-cp310-linux_x86_64.whl
For PyTorch v1.13.1
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.13.1/build-artifacts/2023-04-17-15-49/smdistributed_modelparallel-1.15.0-cp39-cp39-linux_x86_64.whl
Release History¶
SageMaker Distributed Model Parallel 1.14.0 Release Notes¶
Date: Jan. 30. 2023
Currency Updates
Added support for PyTorch v1.13.1
Improvements
Upgraded the flash-attention (https://github.com/HazyResearch/flash-attention) library to v0.2.6.post1
Migration to AWS Deep Learning Containers
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
SageMaker training container for PyTorch v1.13.1
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker
Binary file of this version of the library for custom container users:
For PyTorch 1.13.1
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.13.1/build-artifacts/2023-01-19-18-35/smdistributed_modelparallel-1.14.0-cp39-cp39-linux_x86_64.whl
SageMaker Distributed Model Parallel 1.13.0 Release Notes¶
Date: Dec. 15. 2022
New Features
Sharded data parallelism now supports a new backend for collectives called SMDDP Collectives. For supported scenarios, SMDDP Collectives are on by default for the AllGather operation. For more information, see Sharded data parallelism with SMDDP Collectives in the Amazon SageMaker Developer Guide.
Introduced FlashAttention for DistributedTransformer to improve memory usage and computational performance of models such as GPT2, GPTNeo, GPTJ, GPTNeoX, BERT, and RoBERTa.
Bug Fixes
Fixed initialization of
lm_head
in DistributedTransformer to use a provided range for initialization, when weights are not tied with the embeddings.
Improvements
When a module has no parameters, we have introduced an optimization to execute such a module on the same rank as its parent during pipeline parallelism.
Migration to AWS Deep Learning Containers
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
SageMaker training container for PyTorch v1.12.1
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker
Binary file of this version of the library for custom container users:
For PyTorch 1.12.1
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.12.1/build-artifacts/2022-12-08-21-34/smdistributed_modelparallel-1.13.0-cp38-cp38-linux_x86_64.whl
SageMaker Distributed Model Parallel 1.11.0 Release Notes¶
Date: August. 17. 2022
New Features
The following new features are added for PyTorch.
The library implements sharded data parallelism, which is a memory-saving distributed training technique that splits the training state of a model (model parameters, gradients, and optimizer states) across data parallel groups. With sharded data parallelism, you can reduce the per-GPU memory footprint of a model by sharding the training state over multiple GPUs. To learn more, see Sharded Data Parallelism in the Amazon SageMaker Developer Guide.
Migration to AWS Deep Learning Containers
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
DLC for PyTorch 1.12.0
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker
Binary file of this version of the library for custom container users:
For PyTorch 1.12.0
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.12.0/build-artifacts/2022-08-12-16-58/smdistributed_modelparallel-1.11.0-cp38-cp38-linux_x86_64.whl
SageMaker Distributed Model Parallel 1.10.1 Release Notes¶
Date: August. 8. 2022
Currency Updates
Added support for Transformers v4.21.
Migration to AWS Deep Learning Containers
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
DLC for PyTorch 1.11.0
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.11.0-gpu-py38-cu113-ubuntu20.04-sagemaker
Binary file of this version of the library for custom container users:
For PyTorch 1.11.0
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.11.0/build-artifacts/2022-07-28-23-07/smdistributed_modelparallel-1.10.1-cp38-cp38-linux_x86_64.whl
SageMaker Distributed Model Parallel 1.10.0 Release Notes¶
Date: July. 19. 2022
New Features
The following new features are added for PyTorch.
Added support for FP16 training by implementing smdistributed.modelparallel modification of Apex FP16_Module and FP16_Optimizer. To learn more, see FP16 Training with Model Parallelism.
New checkpoint APIs for CPU memory usage optimization. To learn more, see Checkpointing Distributed Models and Optimizer States.
Improvements
The SageMaker distributed model parallel library manages and optimizes CPU memory by garbage-collecting non-local parameters in general and during checkpointing.
Changes in the GPT-2 translate functions (
smdistributed.modelparallel.torch.nn.huggingface.gpt2
) to save memory by not maintaining two copies of weights at the same time.
Migration to AWS Deep Learning Containers
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
DLC for PyTorch 1.11.0
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.11.0-gpu-py38-cu113-ubuntu20.04-sagemaker
DLC for PyTorch 1.12.0
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker
Binary file of this version of the library for custom container users:
For PyTorch 1.11.0
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.11.0/build-artifacts/2022-07-11-19-23/smdistributed_modelparallel-1.10.0-cp38-cp38-linux_x86_64.whl
For PyTorch 1.12.0
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.12.0/build-artifacts/2022-07-11-19-23/smdistributed_modelparallel-1.10.0-cp38-cp38-linux_x86_64.whl
SageMaker Distributed Model Parallel 1.9.0 Release Notes¶
Date: May. 3. 2022
Currency Updates
Added support for PyTorch 1.11.0
Migration to AWS Deep Learning Containers
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
PyTorch 1.11.0 DLC
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.11.0-gpu-py38-cu113-ubuntu20.04-sagemaker
Binary file of this version of the library for custom container users:
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.11.0/build-artifacts/2022-04-20-17-05/smdistributed_modelparallel-1.9.0-cp38-cp38-linux_x86_64.whl
SageMaker Distributed Model Parallel 1.8.1 Release Notes¶
Date: April. 23. 2022
New Features
Added support for more configurations of the Hugging Face Transformers GPT-2 and GPT-J models with tensor parallelism:
scale_attn_weights
,scale_attn_by_inverse_layer_idx
,reorder_and_upcast_attn
. To learn more about these features, please refer to the following model configuration classes in the Hugging Face Transformers documentation:Added support for activation checkpointing of modules which pass keyword value arguments and arbitrary structures in their forward methods. This helps support activation checkpointing with Hugging Face Transformers models even when tensor parallelism is not enabled.
Bug Fixes
Fixed a correctness issue with tensor parallelism for GPT-J model which was due to improper scaling during gradient reduction for some layer normalization modules.
Fixed the creation of unnecessary additional processes which take up some GPU memory on GPU 0 when the
smp.allgather
collective is called.
Improvements
Improved activation offloading so that activations are preloaded on a per-layer basis as opposed to all activations for a micro batch earlier. This not only improves memory efficiency and performance, but also makes activation offloading a useful feature for non-pipeline parallelism cases.
Migration to AWS Deep Learning Containers
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:
HuggingFace 4.17.0 DLC with PyTorch 1.10.2
763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04
The binary file of this version of the library for custom container users
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.10.0/build-artifacts/2022-04-14-03-58/smdistributed_modelparallel-1.8.1-cp38-cp38-linux_x86_64.whl
SageMaker Distributed Model Parallel 1.8.0 Release Notes¶
Date: March. 23. 2022
New Features
Added tensor parallelism support for the GPT-J model. When using the GPT-J model of Hugging Face Transformers v4.17.0 with tensor parallelism, the SageMaker model parallel library automatically replaces the model with a tensor parallel distributed GPT-J model. For more information, see Support for Hugging Face Transformer Models in the Amazon SageMaker Model Parallel Training developer guide.
Migration to AWS Deep Learning Containers
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:
HuggingFace 4.17.0 DLC with PyTorch 1.10.2
763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04
The binary file of this version of the library for custom container users:
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.10.0/build-artifacts/2022-03-12-00-33/smdistributed_modelparallel-1.8.0-cp38-cp38-linux_x86_64.whl
SageMaker Distributed Model Parallel 1.7.0 Release Notes¶
Date: March. 07. 2022
Currency Updates
Support for PyTorch 1.10.2
Support for Hugging Face Transformers 4.16.2
Improvements
Additional support for the PyTorch API for Tensor Parallelism.
Added support for FP32 residual addition to avoid overflow (NaN loss values) for large models with more than 100 billion parameters when using FP16. This is integrated to the following module:
smp.nn.DistributedTransformerOutputLayer
Added support for the following two NVIDIA Megatron fused kernels:
Fusion of attention masking and softmax (
fused_softmax
)Fusion of bias addition and Gelu activation (
fused_bias_gelu
)
To learn more about these options and how to use them, see the
smp.tensor_parallelism
context manager.
Migration to AWS Deep Learning Containers
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:
PyTorch 1.10.2
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.10.2-gpu-py38-cu113-ubuntu20.04-sagemaker
SageMaker Distributed Model Parallel 1.6.0 Release Notes¶
Date: December. 20. 2021
New Features
PyTorch
Added extended memory-saving features for PyTorch 1.8.1:
For more information, see the following documentation:
Migration to AWS Deep Learning Containers
This version passed benchmark testing and is migrated to the following AWS Deep Learning Container(s):
Deep Learning Container for PyTorch 1.8.1:
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04
SageMaker Distributed Model Parallel 1.5.0 Release Notes¶
Date: November. 03. 2021
New Features
PyTorch
Currency update for PyTorch 1.10.0
Migration to AWS Deep Learning Containers
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:
Deep Learning Container for PyTorch 1.10.0:
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.10.0-gpu-py38-cu113-ubuntu20.04-sagemaker
SageMaker Distributed Model Parallel 1.4.0 Release Notes¶
Date: June. 29. 2021
New Features
TensorFlow
Added support for TensorFlow v2.5.0.
Added support for
keras.model.fit()
.
Migration to AWS Deep Learning Containers
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:
Deep Learning Container for TensorFlow 2.5.0:
763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.5.0-gpu-py37-cu112-ubuntu18.04-v1.0
Deep Learning Container for PyTorch 1.9.1:
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.9.1-gpu-py38-cu111-ubuntu20.04
SageMaker Distributed Model Parallel 1.3.1 Release Notes¶
New Features
Bug Fixes
Known Issues
New Features
TensorFlow
Exposes a new decorator
register_post_partition_hook
. This allows invoking the decorated methods just after model partition but before executing the first step. For example loading a checkpoint. Refer to the SageMaker distributed model parallel API documentation for more information.
Bug Fixes
PyTorch
Improved memory efficiency when using active microbatches by clearing activations at end of each microbatch.
TensorFlow
Fixed issue that caused hangs when training some models with XLA enabled.
Known Issues
PyTorch
A crash was observed when
optimizer.step()
was called for certain optimizers such as AdaDelta, when the partition on which this method was called has no local parameters assigned to it after partitioning. This is due to a bug in PyTorch which has since been fixed. Till that makes its way to the next release of PyTorch, only calloptimizer.step()
on processes which have at least one local parameter. This can be checked like thislen(list(model.local_parameters())) > 0
.A performance regression still exists when training on SMP with PyTorch 1.7.1 compared to 1.6. The rootcause was found to be the slowdown in performance of
.grad
method calls in PyTorch 1.7.1 compared to 1.6. See the related discussion: https://github.com/pytorch/pytorch/issues/50636. This issue does not exist with PyTorch 1.8.
SageMaker Distributed Model Parallel 1.3.0 Release Notes¶
New Features
Bug Fixes
Known Issues
New Features
PyTorch
Add support for PyTorch 1.8
Adds a new method to DistributedModel
register_comm_hook
(for PyTorch 1.8 and newer only). This method behaves the same as the corresponding method with the same name intorch.DistributedDataParallel
API. Refer to the SageMaker distributed model parallel API documentation for more information.
Improvements
Adds a configuration
active_microbatches
to the SageMaker SDK API for launching jobs, to control the number of active microbatches during training. This helps limit memory usage in cases where the number of microbatches is high. Refer to the SageMaker Python SDK parameters API documentation for more information.Adds a configuration
deterministic_server
to the SageMaker SDK API for launching jobs, which ensures that the execution server for pipeline parallelism processes requests in a deterministic order across data parallel ranks. Refer to the SageMaker Python SDK parameters API documentation for more information.Parameter passing is now supported in
module.forward
methods for DistributedModel and its submodules. This removes the restriction of having to passnn.Parameter
to the__init__
call and making it a member of the module to use it. ## Bug Fixes
PyTorch
Fixed a case where training hangs due to a module having computation which requires grads that is not used by the final output of the module. Now such a situtation raises an error with suggestions on making such computation compatible.
Fixed an issue with buffers which caused the buffers to not be on the correct device after a model is partitioned, and not be synchronized across steps (when
broadcast_buffers
is True). This could have caused correctness issues in models with buffers.
Known Issues
PyTorch
mp_barrier
andget_mp_process_group
are wrongly marked as deprecated methods. Ignore the deprecation warning.A crash was observed when
optimizer.step()
was called for certain optimizers such as AdaDelta, when the partition on which this method was called has no local parameters assigned to it after partitioning. This is due to a bug in PyTorch which has since been fixed. Till that makes its way to the next release of PyTorch, only calloptimizer.step()
on processes which have at least one local parameter. This can be checked like thislen(list(model.local_parameters())) > 0
.A performance regression still exists when training on SMP with PyTorch 1.7.1 compared to 1.6. The rootcause was found to be the slowdown in performance of
.grad
method calls in PyTorch 1.7.1 compared to 1.6. See the related discussion: https://github.com/pytorch/pytorch/issues/50636. This issue does not exist with PyTorch 1.8.
SageMaker Distributed Model Parallel 1.2.0 Release Notes¶
New Features
Bug Fixes
Known Issues
New Features
PyTorch
Add support for PyTorch 1.7.1
Adds support for
gradient_as_bucket_view
(PyTorch 1.7.1 only),find_unused_parameters
(PyTorch 1.7.1 only) andbroadcast_buffers
options tosmp.DistributedModel
. These options behave the same as the corresponding options (with the same names) intorch.DistributedDataParallel
API. Refer to the SageMaker distributed model parallel API documentation for more information.Adds support for
join
(PyTorch 1.7.1 only) context manager, which is to be used in conjunction with an instance ofsmp.DistributedModel
to be able to train with uneven inputs across participating processes.Adds support for
_register_comm_hook
(PyTorch 1.7.1 only) which will register the callable as a communication hook for DDP. NOTE: Like in DDP, this is an experimental API and subject to change.
Tensorflow
Adds support for Tensorflow 2.4.1
Bug Fixes
PyTorch
Serialization
: Fix a bug with serialization/flattening where instances of subclasses of dict/OrderedDicts were serialized/deserialized or internally flattened/unflattened as regular dicts.
Tensorflow
Fix a bug that may cause a hang during evaluation when there is no model input for one partition.
Known Issues
PyTorch
A performance regression was observed when training on SMP with PyTorch 1.7.1 compared to 1.6.0. The rootcause was found to be the slowdown in performance of
.grad
method calls in PyTorch 1.7.1 compared to 1.6.0. See the related discussion: https://github.com/pytorch/pytorch/issues/50636.
SageMaker Distributed Model Parallel 1.1.0 Release Notes¶
New Features
Bug Fixes
Improvements
Performance
Known Issues
New Features
The following sections describe new feature releases that are common across frameworks and that are framework specific.
Common across frameworks*
Custom slicing support (
smp_slice
method) for objects passed tosmp.step
decorated functionsTo pass an object to
smp.step
that contains tensors that needs to be split across microbatches and is not an instance of list, dict, tuple or set, you should implementsmp_slice
method for the object.Below is an example of how to use this with PyTorch:
class CustomType: def __init__(self, tensor): self.data = tensor # SMP will call this to invoke slicing on the object passing in total microbatches (num_mb) # and the current microbatch index (mb). def smp_slice(self, num_mb, mb, axis): dim_size = list(self.data.size())[axis] split_size = dim_size // num_mb sliced_tensor = self.data.narrow(axis, mb * split_size, split_size) return CustomType(sliced_tensor, self.other) custom_obj = CustomType(torch.ones(4,)) @smp.step() def step(custom_obj): loss = model(custom_obj) model.backward(loss) return loss
PyTorch
Add support for smp.DistributedModel.cpu()
smp.DistributedModel.cpu()
allgathers parameters and buffers across allmp_ranks
and moves them to the CPU.Add
trace_memory_usage
option tosmp.DistributedModel
to measure memory usage per moduleAdds
trace_memory_usage
option tosmp.DistributedModel
. This attempts to measure memory usage per module during tracing. If this is disabled, memory usage is estimated through the sizes of tensors returned from the module. This option is disabled by default.
Bug Fixes
PyTorch
torch.nn.Sequential
: Fix a bug withtorch.nn.Sequential
which causes a failure with the error message :shouldnt go less than 0, there is a bug
when the inputs to the first module don’t require grads.smp.DistributedModel
: Fix a bug withDistributedModel
execution when a module has multiple parents. The bug surfaces with the error message:actual_parent should be different than module_execution_stack parent only for torch.nn.ModuleList
apex.optimizers.FusedNovoGrad
: Fix a bug withapex.optimizers.FusedNovoGrad
which surfaces with the error message:KeyError: 'exp_avg_sq'
Improvements
Usability
PyTorch
smp.DistributedModel
: Improve the error message when the forward pass onsmp.DistributedModel
is called outside thesmp.step
decorated function.smp.load
: Add user friendly error messages when loading checkpoints withsmp.load
.
Partitioning Algorithm
PyTorch
Better memory balancing by taking into account the existing modules already assigned to the parent, while partitioning the children of a given module.
Performance
Tensorflow
Addresses long pre-processing times introduced by SMP XLA optimizer when dealing with large graphs and large number of microbatches. BERT (large) preprocessing time goes down from 40 minutes to 6 minutes on p3.16xlarge.
Known Issues
PyTorch
Serialization for Torch in SMP overwrites instances of dict subclass to be dict itself, instead of the instances of subclass. One of the use cases which fails because of this issue is when a user implements a subclass of OrderedDict with the
__getitem__
method. After serialization/deserialization in SMP, indexing on the object will lead to errors. A workaround is to use the dict keys to access the underlying item.