DJL Classes
DJLModel
- class sagemaker.djl_inference.DJLModel(model_id=None, engine=None, djl_version='latest', djl_framework=None, task=None, dtype=None, tensor_parallel_degree=None, min_workers=None, max_workers=None, job_queue_size=None, parallel_loading=False, model_loading_timeout=None, prediction_timeout=None, predictor_cls=<class 'sagemaker.djl_inference.djl_predictor.DJLPredictor'>, huggingface_hub_token=None, **kwargs)
Bases:
Model
A DJL SageMaker
Model
that can be deployed to a SageMakerEndpoint
.Initialize a SageMaker model using one of the DJL Model Serving Containers.
- Parameters:
model_id (str) – This is either the HuggingFace Hub model_id, or the Amazon S3 location containing the uncompressed model artifacts (i.e. not a tar.gz file). The model artifacts are expected to be in HuggingFace pre-trained model format (i.e. model should be loadable from the huggingface transformers from_pretrained api, and should also include tokenizer configs if applicable). model artifact location must be specified using either the model_id parameter, model_data parameter, or HF_MODEL_ID environment variable in the env parameter
engine (str) – The DJL inference engine to use for your model. Defaults to None. If not provided, the engine is inferred based on the task. If no task is provided, the Python engine is used.
djl_version (str) – DJL Serving version you want to use for serving your model for inference. Defaults to None. If not provided, the latest available version of DJL Serving is used. This is not used if
image_uri
is provided.djl_framework (str) – The DJL container to use. This is used along with djl_version to fetch the image_uri of the djl inference container. This is not used if
image_uri
is provided.task (str) – The HuggingFace/NLP task you want to launch this model for. Defaults to None. If not provided, the task will be inferred from the model architecture by DJL.
tensor_parallel_degree (int) – The number of accelerators to partition the model across using tensor parallelism. Defaults to None. If not provided, the maximum number of available accelerators will be used.
min_workers (int) – The minimum number of worker processes. Defaults to None. If not provided, dJL Serving will automatically detect the minimum workers.
max_workers (int) – The maximum number of worker processes. Defaults to None. If not provided, DJL Serving will automatically detect the maximum workers.
job_queue_size (int) – The request job queue size. Defaults to None. If not specified, defaults to 1000.
parallel_loading (bool) – Whether to load model workers in parallel. Defaults to False, in which case DJL Serving will load the model workers sequentially to reduce the risk of running out of memory. Set to True if you want to reduce model loading time and know that peak memory usage will not cause out of memory issues.
model_loading_timeout (int) – The worker model loading timeout in seconds. Defaults to None. If not provided, the default is 240 seconds.
prediction_timeout (int) – The worker predict call (handler) timeout in seconds. Defaults to None. If not provided, the default is 120 seconds.
predictor_cls (Callable[[string, sagemaker.session.Session], Any]) – A function to call to create a predictor with an endpoint name and SageMaker
Session
. If specified,deploy()
returns the result of invoking this function on the created endpoint name.huggingface_hub_token (str) – The HuggingFace Hub token to use for downloading the model artifacts for a model stored on the huggingface hub. Defaults to None. If not provided, the token must be specified in the HF_TOKEN environment variable in the env parameter.
**kwargs – Keyword arguments passed to the superclass
FrameworkModel
and, subsequently, its superclassModel
.dtype (str | None) –
- serving_image_uri(region_name, instance_type=None, accelerator_type=None, serverless_inference_config=None)
Placeholder docstring
- package_for_edge(**_)
Not implemented.
DJLModels do not support SageMaker edge.
- Raises:
- compile(**_)
Not implemented.
DJLModels do not support SageMaker Neo compilation.
- Raises:
- transformer(**_)
Not implemented.
DJLModels do not support SageMaker Batch Transform.
- Raises:
- right_size(**_)
Not implemented.
DJLModels do not support SageMaker Inference Recommendation Jobs.
- Raises:
DJLPredictor
- class sagemaker.djl_inference.DJLPredictor(endpoint_name, sagemaker_session=None, serializer=<sagemaker.base_serializers.JSONSerializer object>, deserializer=<sagemaker.base_deserializers.JSONDeserializer object>, component_name=None)
Bases:
Predictor
A Predictor for inference against DJL Model Endpoints.
This is able to serialize Python lists, dictionaries, and numpy arrays to multidimensional tensors for DJL inference.
Initialize a
DJLPredictor
- Parameters:
endpoint_name (str) – The name of the endpoint to perform inference on.
sagemaker_session (sagemaker.session.Session) – Session object that manages interactions with Amazon SageMaker APIs and any other AWS services needed. If not specified, the estimator creates one using the default AWS configuration chain.
serializer (sagemaker.serializers.BaseSerializer) – Optional. Default serializes input data to json format.
deserializer (sagemaker.deserializers.BaseDeserializer) – Optional. Default parses the response from json format to dictionary.
component_name (str) – Optional. Name of the Amazon SageMaker inference component corresponding the predictor.