DJL Classes

DJLModel

class sagemaker.djl_inference.model.DJLModel(model_id, *args, **kwargs)

Bases: FrameworkModel

A DJL SageMaker Model that can be deployed to a SageMaker Endpoint.

Initialize a DJLModel.

Parameters
  • model_id (str) – This is either the HuggingFace Hub model_id, or the Amazon S3 location containing the uncompressed model artifacts (i.e. not a tar.gz file). The model artifacts are expected to be in HuggingFace pre-trained model format (i.e. model should be loadable from the huggingface transformers from_pretrained api, and should also include tokenizer configs if applicable).

  • role (str) – An AWS IAM role specified with either the name or full ARN. The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access model artifacts. After the endpoint is created, the inference code might use the IAM role, if it needs to access an AWS resource.

  • djl_version (str) – DJL Serving version you want to use for serving your model for inference. Defaults to None. If not provided, the latest available version of DJL Serving is used. This is not used if image_uri is provided.

  • task (str) – The HuggingFace/NLP task you want to launch this model for. Defaults to None. If not provided, the task will be inferred from the model architecture by DJL.

  • dtype (str) – The data type to use for loading your model. Accepted values are “fp32”, “fp16”, “bf16”, “int8”. Defaults to “fp32”.

  • number_of_partitions (int) – The number of GPUs to partition the model across. The partitioning strategy is determined by the selected backend. If DeepSpeed is selected, this is tensor parallelism. If HuggingFace Accelerate is selected, this is a naive sharding strategy that splits the model layers across the available resources. Defaults to None. If not provided, no model partitioning is done.

  • min_workers (int) – The minimum number of worker processes. Defaults to None. If not provided, dJL Serving will automatically detect the minimum workers.

  • max_workers (int) – The maximum number of worker processes. Defaults to None. If not provided, DJL Serving will automatically detect the maximum workers.

  • job_queue_size (int) – The request job queue size. Defaults to None. If not specified, defaults to 1000.

  • parallel_loading (bool) – Whether to load model workers in parallel. Defaults to False, in which case DJL Serving will load the model workers sequentially to reduce the risk of running out of memory. Set to True if you want to reduce model loading time and know that peak memory usage will not cause out of memory issues.

  • model_loading_timeout (int) – The worker model loading timeout in seconds. Defaults to None. If not provided, the default is 240 seconds.

  • prediction_timeout (int) – The worker predict call (handler) timeout in seconds. Defaults to None. If not provided, the default is 120 seconds.

  • entry_point (str) – This can either be the absolute or relative path to the Python source file that should be executed as the entry point to model hosting, or a python module that is installed in the container. If source_dir is specified, then entry_point must point to a file located at the root of source_dir. Defaults to None.

  • image_uri (str) – A docker image URI. Defaults to None. If not specified, a default image for DJL Serving will be used based on djl_version. If djl_version is not specified, the latest available container version will be used.

  • predictor_cls (callable[str, sagemaker.session.Session]) – A function to call to create a predictor with an endpoint name and SageMaker Session. If specified, deploy() returns the result of invoking this function on the created endpoint name.

  • **kwargs – Keyword arguments passed to the superclass FrameworkModel and, subsequently, its superclass Model.

Tip

Instantiating a DJLModel will return an instance of either DeepSpeedModel or HuggingFaceAccelerateModel based on our framework recommendation for the model type.

If you want to use a specific framework to deploy your model with, we recommend instantiating that specific model class directly. The available framework specific classes are DeepSpeedModel or HuggingFaceAccelerateModel

package_for_edge(**_)

Not implemented.

DJLModels do not support SageMaker edge.

Raises

NotImplementedError

compile(**_)

Not implemented.

DJLModels do not support SageMaker Neo compilation.

Raises

NotImplementedError

transformer(**_)

Not implemented.

DJLModels do not support SageMaker Batch Transform.

Raises

NotImplementedError

right_size(**_)

Not implemented.

DJLModels do not support SageMaker Inference Recommendation Jobs.

Raises

NotImplementedError

partition(instance_type, s3_output_uri=None, s3_output_prefix='aot-partitioned-checkpoints', job_name=None, volume_size=30, volume_kms_key=None, output_kms_key=None, use_spot_instances=False, max_wait=None, enable_network_isolation=False)

Partitions the model using SageMaker Training Job. This is a synchronous API call.

Parameters
  • instance_type (str) – The EC2 instance type to partition this Model. For example, ‘ml.p4d.24xlarge’.

  • s3_output_uri (str) – S3 location for saving the training result (model artifacts and output files). If not specified, results are stored to a default bucket. If the bucket with the specific name does not exist, it will be created.

  • s3_output_prefix (str) – Name of the prefix where all the partitioned checkpoints to be uploaded. If not provided, the default value is aot-partitioned-checkpoints.

  • job_name (str) – Training job name. If not specified, a unique training job name will be created.

  • volume_size (int) – Size in GB of the storage volume to use for storing input and output data during training (default: 30).

  • volume_kms_key (str) – Optional. KMS key ID for encrypting EBS volume attached to the training instance (default: None).

  • output_kms_key (str) – Optional. KMS key ID for encrypting the training output (default: None).

  • use_spot_instances (bool) –

    Specifies whether to use SageMaker Managed Spot instances for training. If enabled then the max_wait arg should also be set.

    More information: https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html (default: False).

  • max_wait (int) – Timeout in seconds waiting for spot training job (default: None). After this amount of time Amazon SageMaker will stop waiting for managed spot training job to complete (default: None).

  • enable_network_isolation (bool) – Specifies whether container will run in network isolation mode (default: False). Network isolation mode restricts the container access to outside networks (such as the Internet). The container does not make any inbound or outbound network calls. Also known as Internet-free mode.

Returns

None

deploy(instance_type, initial_instance_count=1, serializer=None, deserializer=None, endpoint_name=None, tags=None, kms_key=None, wait=True, data_capture_config=None, volume_size=None, model_data_download_timeout=None, container_startup_health_check_timeout=None, **kwargs)

Deploy this Model to an Endpoint and optionally return a Predictor.

Create a SageMaker Model and EndpointConfig, and deploy an Endpoint from this Model. If self.predictor_cls is not None, this method returns the result of invoking self.predictor_cls on the created endpoint name.

The name of the created model is accessible in the name field of this Model after deploy returns

The name of the created endpoint is accessible in the endpoint_name field of this Model after deploy returns.

Parameters
  • instance_type (str) – The EC2 instance type to deploy this Model to. For example, ‘ml.p4d.24xlarge’.

  • initial_instance_count (int) – The initial number of instances to run in the Endpoint created from this Model. It needs to be at least 1 ( default: 1)

  • serializer (BaseSerializer) – A serializer object, used to encode data for an inference endpoint (default: None). If serializer is not None, then serializer will override the default serializer. The default serializer is set by the predictor_cls.

  • deserializer (BaseDeserializer) – A deserializer object, used to decode data from an inference endpoint (default: None). If deserializer is not None, then deserializer will override the default deserializer. The default deserializer is set by the predictor_cls.

  • endpoint_name (str) – The name of the endpoint to create (default: None). If not specified, a unique endpoint name will be created.

  • tags (Optional[Tags]) – The list of tags to attach to this specific endpoint.

  • kms_key (str) – The ARN of the KMS key that is used to encrypt the data on the storage volume attached to the instance hosting the endpoint.

  • wait (bool) – Whether the call should wait until the deployment of this model completes (default: True).

  • data_capture_config (sagemaker.model_monitor.DataCaptureConfig) – Specifies configuration related to Endpoint data capture for use with Amazon SageMaker Model Monitoring. Default: None.

  • volume_size (int) – The size, in GB, of the ML storage volume attached to individual inference instance associated with the production variant. Currenly only Amazon EBS gp2 storage volumes are supported.

  • model_data_download_timeout (int) – The timeout value, in seconds, to download and extract model data from Amazon S3 to the individual inference instance associated with this production variant.

  • container_startup_health_check_timeout (int) – The timeout value, in seconds, for your inference container to pass health check by SageMaker Hosting. For more information about health check see: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code .html#your-algorithms-inference-algo-ping-requests

Returns

Invocation of

self.predictor_cls on the created endpoint name, if self.predictor_cls is not None. Otherwise, return None.

Return type

callable[string, sagemaker.session.Session] or None

prepare_container_def(instance_type=None, accelerator_type=None, serverless_inference_config=None, accept_eula=None)

A container definition with framework configuration set in model environment variables.

Returns

A container definition object usable with the CreateModel API.

Return type

dict[str, str]

generate_serving_properties(serving_properties=None)

Generates the DJL Serving configuration to use for the model.

The configuration is generated using the arguments passed to the Model during initialization. If a serving.properties file is found in self.source_dir, those configuration as merged with the Model parameters, with Model parameters taking priority.

Parameters
  • serving_properties – Dictionary containing existing model server configuration

  • None. (obtained from self.source_dir. Defaults to) –

Returns

The model server configuration to use when deploying this model to SageMaker.

Return type

dict

serving_image_uri(region_name)

Create a URI for the serving image.

Parameters

region_name (str) – AWS region where the image is uploaded.

Returns

The appropriate image URI based on the given parameters.

Return type

str

DeepSpeedModel

class sagemaker.djl_inference.model.DeepSpeedModel(model_id, *args, **kwargs)

Bases: DJLModel

A DJL DeepSpeed SageMaker Model that can be deployed to a SageMaker Endpoint

Initialize a DeepSpeedModel

Parameters
  • model_id (str) – This is either the HuggingFace Hub model_id, or the Amazon S3 location containing the uncompressed model artifacts (i.e. not a tar.gz file). The model artifacts are expected to be in HuggingFace pre-trained model format (i.e. model should be loadable from the huggingface transformers from_pretrained api, and should also include tokenizer configs if applicable).

  • role (str) – An AWS IAM role specified with either the name or full ARN. The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access model artifacts. After the endpoint is created, the inference code might use the IAM role, if it needs to access an AWS resource.

  • tensor_parallel_degree (int) – The number of gpus to shard a single instance of the model across via tensor_parallelism. This should be set to greater than 1 if the size of the model is larger than the memory available on a single GPU on the instance. Defaults to None. If not set, no tensor parallel sharding is done.

  • max_tokens (int) – The maximum number of tokens (input + output tokens) the DeepSpeed engine is configured for. Defaults to None. If not set, the DeepSpeed default of 1024 is used.

  • low_cpu_mem_usage (bool) – Whether to limit CPU memory usage to 1x model size during model loading. This is an experimental feature in HuggingFace. This is useful when loading multiple instances of your model in parallel. Defaults to False.

  • enable_cuda_graph (bool) – Whether to enable CUDA graph replay to accelerate inference passes. This cannot be used with tensor parallelism greater than 1. Defaults to False.

  • triangular_masking (bool) – Whether to use triangular attention mask. This is application specific. Defaults to True.

  • return_tuple (bool) – Whether the transformer layers need to return a tuple or a Tensor. Defaults to True.

  • **kwargs – Keyword arguments passed to the superclasses DJLModel, FrameworkModel, and Model

Tip

You can find additional parameters for initializing this class at DJLModel, FrameworkModel, and Model.

generate_serving_properties(serving_properties=None)

Generates the DJL Serving configuration to use for the model.

The configuration is generated using the arguments passed to the Model during initialization. If a serving.properties file is found in self.source_dir, those configuration as merged with the Model parameters, with Model parameters taking priority.

Parameters
  • serving_properties – Dictionary containing existing model server configuration

  • None. (obtained from self.source_dir. Defaults to) –

Returns

The model server configuration to use when deploying this model to SageMaker.

Return type

dict

partition(instance_type, s3_output_uri=None, s3_output_prefix='aot-partitioned-checkpoints', job_name=None, volume_size=30, volume_kms_key=None, output_kms_key=None, use_spot_instances=False, max_wait=None, enable_network_isolation=False)

Partitions the model using SageMaker Training Job. This is a synchronous API call.

Parameters
  • instance_type (str) – The EC2 instance type to partition this Model. For example, ‘ml.p4d.24xlarge’.

  • s3_output_uri (str) – S3 location for saving the training result (model artifacts and output files). If not specified, results are stored to a default bucket. If the bucket with the specific name does not exist, it will be created.

  • s3_output_prefix (str) – Name of the prefix where all the partitioned checkpoints to be uploaded. If not provided, the default value is aot-partitioned-checkpoints.

  • job_name (str) – Training job name. If not specified, a unique training job name will be created.

  • volume_size (int) – Size in GB of the storage volume to use for storing input and output data during training (default: 30).

  • volume_kms_key (str) – Optional. KMS key ID for encrypting EBS volume attached to the training instance (default: None).

  • output_kms_key (str) – Optional. KMS key ID for encrypting the training output (default: None).

  • use_spot_instances (bool) –

    Specifies whether to use SageMaker Managed Spot instances for training. If enabled then the max_wait arg should also be set.

    More information: https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html (default: False).

  • max_wait (int) – Timeout in seconds waiting for spot training job (default: None). After this amount of time Amazon SageMaker will stop waiting for managed spot training job to complete (default: None).

  • enable_network_isolation (bool) – Specifies whether container will run in network isolation mode (default: False). Network isolation mode restricts the container access to outside networks (such as the Internet). The container does not make any inbound or outbound network calls. Also known as Internet-free mode.

Returns

None

HuggingFaceAccelerateModel

class sagemaker.djl_inference.model.HuggingFaceAccelerateModel(model_id, *args, **kwargs)

Bases: DJLModel

A DJL Hugging Face SageMaker Model that can be deployed to a SageMaker Endpoint.

Initialize a HuggingFaceAccelerateModel.

Parameters
  • model_id (str) – This is either the HuggingFace Hub model_id, or the Amazon S3 location containing the uncompressed model artifacts (i.e. not a tar.gz file). The model artifacts are expected to be in HuggingFace pre-trained model format (i.e. model should be loadable from the huggingface transformers from_pretrained api, and should also include tokenizer configs if applicable).

  • role (str) – An AWS IAM role specified with either the name or full ARN. The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access model artifacts. After the endpoint is created, the inference code might use the IAM role, if it needs to access an AWS resource.

  • number_of_partitions (int) – The number of GPUs to partition the model across. The partitioning strategy is determined by the device_map setting. If device_map is not specified, the default HuggingFace strategy will be used.

  • device_id (int) – The device_id to use for instantiating the model. If provided, the model will only be instantiated once on the indicated device. Do not set this if you have also specified data_parallel_degree. Defaults to None.

  • device_map (str or dict) – The HuggingFace accelerate device_map to use. Defaults to None.

  • load_in_8bit (bool) – Whether to load the model in int8 precision using bits and bytes quantization. This is only supported for select model architectures. Defaults to False. If dtype is int8, then this is set to True.

  • low_cpu_mem_usage (bool) – Whether to limit CPU memory usage to 1x model size during model loading. This is an experimental feature in HuggingFace. This is useful when loading multiple instances of your model in parallel. Defaults to False.

  • **kwargs – Keyword arguments passed to the superclasses DJLModel, FrameworkModel, and Model

Tip

You can find additional parameters for initializing this class at DJLModel, FrameworkModel, and Model.

generate_serving_properties(serving_properties=None)

Generates the DJL Serving configuration to use for the model.

The configuration is generated using the arguments passed to the Model during initialization. If a serving.properties file is found in self.source_dir, those configuration as merged with the Model parameters, with Model parameters taking priority.

Parameters
  • serving_properties – Dictionary containing existing model server configuration

  • None. (obtained from self.source_dir. Defaults to) –

Returns

The model server configuration to use when deploying this model to SageMaker.

Return type

dict

partition(instance_type, s3_output_uri=None, s3_output_prefix='aot-partitioned-checkpoints', job_name=None, volume_size=30, volume_kms_key=None, output_kms_key=None, use_spot_instances=False, max_wait=None, enable_network_isolation=False)

Partitions the model using SageMaker Training Job. This is a synchronous API call.

Parameters
  • instance_type (str) – The EC2 instance type to partition this Model. For example, ‘ml.p4d.24xlarge’.

  • s3_output_uri (str) – S3 location for saving the training result (model artifacts and output files). If not specified, results are stored to a default bucket. If the bucket with the specific name does not exist, it will be created.

  • s3_output_prefix (str) – Name of the prefix where all the partitioned checkpoints to be uploaded. If not provided, the default value is aot-partitioned-checkpoints.

  • job_name (str) – Training job name. If not specified, a unique training job name will be created.

  • volume_size (int) – Size in GB of the storage volume to use for storing input and output data during training (default: 30).

  • volume_kms_key (str) – Optional. KMS key ID for encrypting EBS volume attached to the training instance (default: None).

  • output_kms_key (str) – Optional. KMS key ID for encrypting the training output (default: None).

  • use_spot_instances (bool) –

    Specifies whether to use SageMaker Managed Spot instances for training. If enabled then the max_wait arg should also be set.

    More information: https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html (default: False).

  • max_wait (int) – Timeout in seconds waiting for spot training job (default: None). After this amount of time Amazon SageMaker will stop waiting for managed spot training job to complete (default: None).

  • enable_network_isolation (bool) – Specifies whether container will run in network isolation mode (default: False). Network isolation mode restricts the container access to outside networks (such as the Internet). The container does not make any inbound or outbound network calls. Also known as Internet-free mode.

Returns

None

FasterTransformerModel

class sagemaker.djl_inference.model.FasterTransformerModel(model_id, *args, **kwargs)

Bases: DJLModel

A DJL FasterTransformer SageMaker Model

This can be deployed to a SageMaker Endpoint.

Initialize a FasterTransformerModel.

Parameters
  • model_id (str) – This is either the HuggingFace Hub model_id, or the Amazon S3 location containing the uncompressed model artifacts (i.e. not a tar.gz file). The model artifacts are expected to be in HuggingFace pre-trained model format (i.e. model should be loadable from the huggingface transformers from_pretrained api, and should also include tokenizer configs if applicable).

  • role (str) – An AWS IAM role specified with either the name or full ARN. The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access model artifacts. After the endpoint is created, the inference code might use the IAM role, if it needs to access an AWS resource.

  • tensor_parllel_degree (int) – The number of gpus to shard a single instance of the model across via tensor_parallelism. This should be set to greater than 1 if the size of the model is larger than the memory available on a single GPU on the instance. Defaults to None. If not set, no tensor parallel sharding is done.

  • **kwargs – Keyword arguments passed to the superclasses DJLModel, FrameworkModel, and Model

  • tensor_parallel_degree (Optional[int]) –

Tip

You can find additional parameters for initializing this class at DJLModel, FrameworkModel, and Model.

DJLPredictor

class sagemaker.djl_inference.model.DJLPredictor(endpoint_name, sagemaker_session=None, serializer=<sagemaker.base_serializers.JSONSerializer object>, deserializer=<sagemaker.base_deserializers.JSONDeserializer object>, component_name=None)

Bases: Predictor

A Predictor for inference against DJL Model Endpoints.

This is able to serialize Python lists, dictionaries, and numpy arrays to multidimensional tensors for DJL inference.

Initialize a DJLPredictor

Parameters
  • endpoint_name (str) – The name of the endpoint to perform inference on.

  • sagemaker_session (sagemaker.session.Session) – Session object that manages interactions with Amazon SageMaker APIs and any other AWS services needed. If not specified, the estimator creates one using the default AWS configuration chain.

  • serializer (sagemaker.serializers.BaseSerializer) – Optional. Default serializes input data to json format.

  • deserializer (sagemaker.deserializers.BaseDeserializer) – Optional. Default parses the response from json format to dictionary.

  • component_name (str) – Optional. Name of the Amazon SageMaker inference component corresponding the predictor.