DJL Classes¶
DJLModel¶
- class sagemaker.djl_inference.model.DJLModel(model_id, *args, **kwargs)¶
Bases:
FrameworkModel
A DJL SageMaker
Model
that can be deployed to a SageMakerEndpoint
.Initialize a DJLModel.
- Parameters
model_id (str) – This is either the HuggingFace Hub model_id, or the Amazon S3 location containing the uncompressed model artifacts (i.e. not a tar.gz file). The model artifacts are expected to be in HuggingFace pre-trained model format (i.e. model should be loadable from the huggingface transformers from_pretrained api, and should also include tokenizer configs if applicable).
role (str) – An AWS IAM role specified with either the name or full ARN. The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access model artifacts. After the endpoint is created, the inference code might use the IAM role, if it needs to access an AWS resource.
djl_version (str) – DJL Serving version you want to use for serving your model for inference. Defaults to None. If not provided, the latest available version of DJL Serving is used. This is not used if
image_uri
is provided.task (str) – The HuggingFace/NLP task you want to launch this model for. Defaults to None. If not provided, the task will be inferred from the model architecture by DJL.
dtype (str) – The data type to use for loading your model. Accepted values are “fp32”, “fp16”, “bf16”, “int8”. Defaults to “fp32”.
number_of_partitions (int) – The number of GPUs to partition the model across. The partitioning strategy is determined by the selected backend. If DeepSpeed is selected, this is tensor parallelism. If HuggingFace Accelerate is selected, this is a naive sharding strategy that splits the model layers across the available resources. Defaults to None. If not provided, no model partitioning is done.
min_workers (int) – The minimum number of worker processes. Defaults to None. If not provided, dJL Serving will automatically detect the minimum workers.
max_workers (int) – The maximum number of worker processes. Defaults to None. If not provided, DJL Serving will automatically detect the maximum workers.
job_queue_size (int) – The request job queue size. Defaults to None. If not specified, defaults to 1000.
parallel_loading (bool) – Whether to load model workers in parallel. Defaults to False, in which case DJL Serving will load the model workers sequentially to reduce the risk of running out of memory. Set to True if you want to reduce model loading time and know that peak memory usage will not cause out of memory issues.
model_loading_timeout (int) – The worker model loading timeout in seconds. Defaults to None. If not provided, the default is 240 seconds.
prediction_timeout (int) – The worker predict call (handler) timeout in seconds. Defaults to None. If not provided, the default is 120 seconds.
entry_point (str) – This can either be the absolute or relative path to the Python source file that should be executed as the entry point to model hosting, or a python module that is installed in the container. If
source_dir
is specified, thenentry_point
must point to a file located at the root ofsource_dir
. Defaults to None.image_uri (str) – A docker image URI. Defaults to None. If not specified, a default image for DJL Serving will be used based on
djl_version
. Ifdjl_version
is not specified, the latest available container version will be used.predictor_cls (callable[str, sagemaker.session.Session]) – A function to call to create a predictor with an endpoint name and SageMaker
Session
. If specified,deploy()
returns the result of invoking this function on the created endpoint name.**kwargs – Keyword arguments passed to the superclass
FrameworkModel
and, subsequently, its superclassModel
.
Tip
Instantiating a DJLModel will return an instance of either
DeepSpeedModel
orHuggingFaceAccelerateModel
based on our framework recommendation for the model type.If you want to use a specific framework to deploy your model with, we recommend instantiating that specific model class directly. The available framework specific classes are
DeepSpeedModel
orHuggingFaceAccelerateModel
- package_for_edge(**_)¶
Not implemented.
DJLModels do not support SageMaker edge.
- Raises
- compile(**_)¶
Not implemented.
DJLModels do not support SageMaker Neo compilation.
- Raises
- transformer(**_)¶
Not implemented.
DJLModels do not support SageMaker Batch Transform.
- Raises
- right_size(**_)¶
Not implemented.
DJLModels do not support SageMaker Inference Recommendation Jobs.
- Raises
- partition(instance_type, s3_output_uri=None, s3_output_prefix='aot-partitioned-checkpoints', job_name=None, volume_size=30, volume_kms_key=None, output_kms_key=None, use_spot_instances=False, max_wait=None, enable_network_isolation=False)¶
Partitions the model using SageMaker Training Job. This is a synchronous API call.
- Parameters
instance_type (str) – The EC2 instance type to partition this Model. For example, ‘ml.p4d.24xlarge’.
s3_output_uri (str) – S3 location for saving the training result (model artifacts and output files). If not specified, results are stored to a default bucket. If the bucket with the specific name does not exist, it will be created.
s3_output_prefix (str) – Name of the prefix where all the partitioned checkpoints to be uploaded. If not provided, the default value is aot-partitioned-checkpoints.
job_name (str) – Training job name. If not specified, a unique training job name will be created.
volume_size (int) – Size in GB of the storage volume to use for storing input and output data during training (default: 30).
volume_kms_key (str) – Optional. KMS key ID for encrypting EBS volume attached to the training instance (default: None).
output_kms_key (str) – Optional. KMS key ID for encrypting the training output (default: None).
use_spot_instances (bool) –
Specifies whether to use SageMaker Managed Spot instances for training. If enabled then the
max_wait
arg should also be set.More information: https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html (default:
False
).max_wait (int) – Timeout in seconds waiting for spot training job (default: None). After this amount of time Amazon SageMaker will stop waiting for managed spot training job to complete (default: None).
enable_network_isolation (bool) – Specifies whether container will run in network isolation mode (default:
False
). Network isolation mode restricts the container access to outside networks (such as the Internet). The container does not make any inbound or outbound network calls. Also known as Internet-free mode.
- Returns
None
- deploy(instance_type, initial_instance_count=1, serializer=None, deserializer=None, endpoint_name=None, tags=None, kms_key=None, wait=True, data_capture_config=None, volume_size=None, model_data_download_timeout=None, container_startup_health_check_timeout=None, **kwargs)¶
Deploy this
Model
to anEndpoint
and optionally return aPredictor
.Create a SageMaker
Model
andEndpointConfig
, and deploy anEndpoint
from thisModel
. Ifself.predictor_cls
is not None, this method returns the result of invokingself.predictor_cls
on the created endpoint name.The name of the created model is accessible in the
name
field of thisModel
after deploy returnsThe name of the created endpoint is accessible in the
endpoint_name
field of thisModel
after deploy returns.- Parameters
instance_type (str) – The EC2 instance type to deploy this Model to. For example, ‘ml.p4d.24xlarge’.
initial_instance_count (int) – The initial number of instances to run in the
Endpoint
created from thisModel
. It needs to be at least 1 ( default: 1)serializer (
BaseSerializer
) – A serializer object, used to encode data for an inference endpoint (default: None). Ifserializer
is not None, thenserializer
will override the default serializer. The default serializer is set by thepredictor_cls
.deserializer (
BaseDeserializer
) – A deserializer object, used to decode data from an inference endpoint (default: None). Ifdeserializer
is not None, thendeserializer
will override the default deserializer. The default deserializer is set by thepredictor_cls
.endpoint_name (str) – The name of the endpoint to create (default: None). If not specified, a unique endpoint name will be created.
tags (Optional[Tags]) – The list of tags to attach to this specific endpoint.
kms_key (str) – The ARN of the KMS key that is used to encrypt the data on the storage volume attached to the instance hosting the endpoint.
wait (bool) – Whether the call should wait until the deployment of this model completes (default: True).
data_capture_config (sagemaker.model_monitor.DataCaptureConfig) – Specifies configuration related to Endpoint data capture for use with Amazon SageMaker Model Monitoring. Default: None.
volume_size (int) – The size, in GB, of the ML storage volume attached to individual inference instance associated with the production variant. Currenly only Amazon EBS gp2 storage volumes are supported.
model_data_download_timeout (int) – The timeout value, in seconds, to download and extract model data from Amazon S3 to the individual inference instance associated with this production variant.
container_startup_health_check_timeout (int) – The timeout value, in seconds, for your inference container to pass health check by SageMaker Hosting. For more information about health check see: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code .html#your-algorithms-inference-algo-ping-requests
- Returns
- Invocation of
self.predictor_cls
on the created endpoint name, ifself.predictor_cls
is not None. Otherwise, return None.
- Return type
callable[string, sagemaker.session.Session] or None
- prepare_container_def(instance_type=None, accelerator_type=None, serverless_inference_config=None, accept_eula=None)¶
A container definition with framework configuration set in model environment variables.
- generate_serving_properties(serving_properties=None)¶
Generates the DJL Serving configuration to use for the model.
The configuration is generated using the arguments passed to the Model during initialization. If a serving.properties file is found in
self.source_dir
, those configuration as merged with the Model parameters, with Model parameters taking priority.- Parameters
serving_properties – Dictionary containing existing model server configuration
None. (obtained from self.source_dir. Defaults to) –
- Returns
The model server configuration to use when deploying this model to SageMaker.
- Return type
DeepSpeedModel¶
- class sagemaker.djl_inference.model.DeepSpeedModel(model_id, *args, **kwargs)¶
Bases:
DJLModel
A DJL DeepSpeed SageMaker
Model
that can be deployed to a SageMakerEndpoint
Initialize a DeepSpeedModel
- Parameters
model_id (str) – This is either the HuggingFace Hub model_id, or the Amazon S3 location containing the uncompressed model artifacts (i.e. not a tar.gz file). The model artifacts are expected to be in HuggingFace pre-trained model format (i.e. model should be loadable from the huggingface transformers from_pretrained api, and should also include tokenizer configs if applicable).
role (str) – An AWS IAM role specified with either the name or full ARN. The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access model artifacts. After the endpoint is created, the inference code might use the IAM role, if it needs to access an AWS resource.
tensor_parallel_degree (int) – The number of gpus to shard a single instance of the model across via tensor_parallelism. This should be set to greater than 1 if the size of the model is larger than the memory available on a single GPU on the instance. Defaults to None. If not set, no tensor parallel sharding is done.
max_tokens (int) – The maximum number of tokens (input + output tokens) the DeepSpeed engine is configured for. Defaults to None. If not set, the DeepSpeed default of 1024 is used.
low_cpu_mem_usage (bool) – Whether to limit CPU memory usage to 1x model size during model loading. This is an experimental feature in HuggingFace. This is useful when loading multiple instances of your model in parallel. Defaults to False.
enable_cuda_graph (bool) – Whether to enable CUDA graph replay to accelerate inference passes. This cannot be used with tensor parallelism greater than 1. Defaults to False.
triangular_masking (bool) – Whether to use triangular attention mask. This is application specific. Defaults to True.
return_tuple (bool) – Whether the transformer layers need to return a tuple or a Tensor. Defaults to True.
**kwargs – Keyword arguments passed to the superclasses
DJLModel
,FrameworkModel
, andModel
Tip
You can find additional parameters for initializing this class at
DJLModel
,FrameworkModel
, andModel
.- generate_serving_properties(serving_properties=None)¶
Generates the DJL Serving configuration to use for the model.
The configuration is generated using the arguments passed to the Model during initialization. If a serving.properties file is found in
self.source_dir
, those configuration as merged with the Model parameters, with Model parameters taking priority.- Parameters
serving_properties – Dictionary containing existing model server configuration
None. (obtained from self.source_dir. Defaults to) –
- Returns
The model server configuration to use when deploying this model to SageMaker.
- Return type
- partition(instance_type, s3_output_uri=None, s3_output_prefix='aot-partitioned-checkpoints', job_name=None, volume_size=30, volume_kms_key=None, output_kms_key=None, use_spot_instances=False, max_wait=None, enable_network_isolation=False)¶
Partitions the model using SageMaker Training Job. This is a synchronous API call.
- Parameters
instance_type (str) – The EC2 instance type to partition this Model. For example, ‘ml.p4d.24xlarge’.
s3_output_uri (str) – S3 location for saving the training result (model artifacts and output files). If not specified, results are stored to a default bucket. If the bucket with the specific name does not exist, it will be created.
s3_output_prefix (str) – Name of the prefix where all the partitioned checkpoints to be uploaded. If not provided, the default value is aot-partitioned-checkpoints.
job_name (str) – Training job name. If not specified, a unique training job name will be created.
volume_size (int) – Size in GB of the storage volume to use for storing input and output data during training (default: 30).
volume_kms_key (str) – Optional. KMS key ID for encrypting EBS volume attached to the training instance (default: None).
output_kms_key (str) – Optional. KMS key ID for encrypting the training output (default: None).
use_spot_instances (bool) –
Specifies whether to use SageMaker Managed Spot instances for training. If enabled then the
max_wait
arg should also be set.More information: https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html (default:
False
).max_wait (int) – Timeout in seconds waiting for spot training job (default: None). After this amount of time Amazon SageMaker will stop waiting for managed spot training job to complete (default: None).
enable_network_isolation (bool) – Specifies whether container will run in network isolation mode (default:
False
). Network isolation mode restricts the container access to outside networks (such as the Internet). The container does not make any inbound or outbound network calls. Also known as Internet-free mode.
- Returns
None
HuggingFaceAccelerateModel¶
- class sagemaker.djl_inference.model.HuggingFaceAccelerateModel(model_id, *args, **kwargs)¶
Bases:
DJLModel
A DJL Hugging Face SageMaker
Model
that can be deployed to a SageMakerEndpoint
.Initialize a HuggingFaceAccelerateModel.
- Parameters
model_id (str) – This is either the HuggingFace Hub model_id, or the Amazon S3 location containing the uncompressed model artifacts (i.e. not a tar.gz file). The model artifacts are expected to be in HuggingFace pre-trained model format (i.e. model should be loadable from the huggingface transformers from_pretrained api, and should also include tokenizer configs if applicable).
role (str) – An AWS IAM role specified with either the name or full ARN. The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access model artifacts. After the endpoint is created, the inference code might use the IAM role, if it needs to access an AWS resource.
number_of_partitions (int) – The number of GPUs to partition the model across. The partitioning strategy is determined by the device_map setting. If device_map is not specified, the default HuggingFace strategy will be used.
device_id (int) – The device_id to use for instantiating the model. If provided, the model will only be instantiated once on the indicated device. Do not set this if you have also specified data_parallel_degree. Defaults to None.
device_map (str or dict) – The HuggingFace accelerate device_map to use. Defaults to None.
load_in_8bit (bool) – Whether to load the model in int8 precision using bits and bytes quantization. This is only supported for select model architectures. Defaults to False. If
dtype
is int8, then this is set to True.low_cpu_mem_usage (bool) – Whether to limit CPU memory usage to 1x model size during model loading. This is an experimental feature in HuggingFace. This is useful when loading multiple instances of your model in parallel. Defaults to False.
**kwargs – Keyword arguments passed to the superclasses
DJLModel
,FrameworkModel
, andModel
Tip
You can find additional parameters for initializing this class at
DJLModel
,FrameworkModel
, andModel
.- generate_serving_properties(serving_properties=None)¶
Generates the DJL Serving configuration to use for the model.
The configuration is generated using the arguments passed to the Model during initialization. If a serving.properties file is found in
self.source_dir
, those configuration as merged with the Model parameters, with Model parameters taking priority.- Parameters
serving_properties – Dictionary containing existing model server configuration
None. (obtained from self.source_dir. Defaults to) –
- Returns
The model server configuration to use when deploying this model to SageMaker.
- Return type
- partition(instance_type, s3_output_uri=None, s3_output_prefix='aot-partitioned-checkpoints', job_name=None, volume_size=30, volume_kms_key=None, output_kms_key=None, use_spot_instances=False, max_wait=None, enable_network_isolation=False)¶
Partitions the model using SageMaker Training Job. This is a synchronous API call.
- Parameters
instance_type (str) – The EC2 instance type to partition this Model. For example, ‘ml.p4d.24xlarge’.
s3_output_uri (str) – S3 location for saving the training result (model artifacts and output files). If not specified, results are stored to a default bucket. If the bucket with the specific name does not exist, it will be created.
s3_output_prefix (str) – Name of the prefix where all the partitioned checkpoints to be uploaded. If not provided, the default value is aot-partitioned-checkpoints.
job_name (str) – Training job name. If not specified, a unique training job name will be created.
volume_size (int) – Size in GB of the storage volume to use for storing input and output data during training (default: 30).
volume_kms_key (str) – Optional. KMS key ID for encrypting EBS volume attached to the training instance (default: None).
output_kms_key (str) – Optional. KMS key ID for encrypting the training output (default: None).
use_spot_instances (bool) –
Specifies whether to use SageMaker Managed Spot instances for training. If enabled then the
max_wait
arg should also be set.More information: https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html (default:
False
).max_wait (int) – Timeout in seconds waiting for spot training job (default: None). After this amount of time Amazon SageMaker will stop waiting for managed spot training job to complete (default: None).
enable_network_isolation (bool) – Specifies whether container will run in network isolation mode (default:
False
). Network isolation mode restricts the container access to outside networks (such as the Internet). The container does not make any inbound or outbound network calls. Also known as Internet-free mode.
- Returns
None
FasterTransformerModel¶
- class sagemaker.djl_inference.model.FasterTransformerModel(model_id, *args, **kwargs)¶
Bases:
DJLModel
A DJL FasterTransformer SageMaker
Model
This can be deployed to a SageMaker
Endpoint
.Initialize a FasterTransformerModel.
- Parameters
model_id (str) – This is either the HuggingFace Hub model_id, or the Amazon S3 location containing the uncompressed model artifacts (i.e. not a tar.gz file). The model artifacts are expected to be in HuggingFace pre-trained model format (i.e. model should be loadable from the huggingface transformers from_pretrained api, and should also include tokenizer configs if applicable).
role (str) – An AWS IAM role specified with either the name or full ARN. The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access model artifacts. After the endpoint is created, the inference code might use the IAM role, if it needs to access an AWS resource.
tensor_parllel_degree (int) – The number of gpus to shard a single instance of the model across via tensor_parallelism. This should be set to greater than 1 if the size of the model is larger than the memory available on a single GPU on the instance. Defaults to None. If not set, no tensor parallel sharding is done.
**kwargs – Keyword arguments passed to the superclasses
DJLModel
,FrameworkModel
, andModel
Tip
You can find additional parameters for initializing this class at
DJLModel
,FrameworkModel
, andModel
.
DJLPredictor¶
- class sagemaker.djl_inference.model.DJLPredictor(endpoint_name, sagemaker_session=None, serializer=<sagemaker.base_serializers.JSONSerializer object>, deserializer=<sagemaker.base_deserializers.JSONDeserializer object>, component_name=None)¶
Bases:
Predictor
A Predictor for inference against DJL Model Endpoints.
This is able to serialize Python lists, dictionaries, and numpy arrays to multidimensional tensors for DJL inference.
Initialize a
DJLPredictor
- Parameters
endpoint_name (str) – The name of the endpoint to perform inference on.
sagemaker_session (sagemaker.session.Session) – Session object that manages interactions with Amazon SageMaker APIs and any other AWS services needed. If not specified, the estimator creates one using the default AWS configuration chain.
serializer (sagemaker.serializers.BaseSerializer) – Optional. Default serializes input data to json format.
deserializer (sagemaker.deserializers.BaseDeserializer) – Optional. Default parses the response from json format to dictionary.
component_name (str) – Optional. Name of the Amazon SageMaker inference component corresponding the predictor.