Processing¶
This module contains code related to the Processor
class.
which is used for Amazon SageMaker Processing Jobs. These jobs let users perform data pre-processing, post-processing, feature engineering, data validation, and model evaluation, and interpretation on Amazon SageMaker.
- class sagemaker.processing.Processor(role=None, image_uri=None, instance_count=None, instance_type=None, entrypoint=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)¶
Bases:
object
Handles Amazon SageMaker Processing tasks.
Initializes a
Processor
instance.The
Processor
handles Amazon SageMaker Processing tasks.- Parameters
role (str or PipelineVariable) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.
image_uri (str or PipelineVariable) – The URI of the Docker image to use for the processing jobs.
instance_count (int or PipelineVariable) – The number of instances to run a processing job with.
instance_type (str or PipelineVariable) – The type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.
entrypoint (list[str] or list[PipelineVariable]) – The entrypoint for the processing job (default: None). This is in the form of a list of strings that make a command.
volume_size_in_gb (int or PipelineVariable) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
volume_kms_key (str or PipelineVariable) – A KMS key for the processing volume (default: None).
output_kms_key (str or PipelineVariable) – The KMS key ID for processing job outputs (default: None).
max_runtime_in_seconds (int or PipelineVariable) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status. If max_runtime_in_seconds is not specified, the default value is 24 hours.
base_job_name (str) – Prefix for processing job name. If not specified, the processor generates a default job name, based on the processing image name and current timestamp.
sagemaker_session (
Session
) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.env (dict[str, str] or dict[str, PipelineVariable]) – Environment variables to be passed to the processing jobs (default: None).
tags (Optional[Tags]) – Tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.
network_config (
NetworkConfig
) – ANetworkConfig
object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.
- JOB_CLASS_NAME = 'processing-job'¶
- run(inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, kms_key=None)¶
Runs a processing job.
- Parameters
inputs (list[
ProcessingInput
]) – Input files for the processing job. These must be provided asProcessingInput
objects (default: None).outputs (list[
ProcessingOutput
]) – Outputs for the processing job. These can be specified as either path strings orProcessingOutput
objects (default: None).arguments (list[str] or list[PipelineVariable]) – A list of string arguments to be passed to a processing job (default: None).
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when
wait
is True (default: True).job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
experiment_config (dict[str, str]) – Experiment management configuration. Optionally, the dict can contain three keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’. The behavior of setting these keys is as follows: * If ExperimentName is supplied but TrialName is not a Trial will be automatically created and the job’s Trial Component associated with the Trial. * If TrialName is supplied and the Trial already exists the job’s Trial Component will be associated with the Trial. * If both ExperimentName and TrialName are not supplied the trial component will be unassociated. * TrialComponentDisplayName is used for display in Studio. * Both ExperimentName and TrialName will be ignored if the Processor instance is built with
PipelineSession
. However, the value of TrialComponentDisplayName is honored for display in Studio.kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
- Returns
None or pipeline step arguments in case the Processor instance is built with
PipelineSession
- Raises
ValueError – if
logs
is True butwait
is False.
- class sagemaker.processing.ScriptProcessor(role=None, image_uri=None, command=None, instance_count=None, instance_type=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)¶
Bases:
Processor
Handles Amazon SageMaker processing tasks for jobs using a machine learning framework.
Initializes a
ScriptProcessor
instance.The
ScriptProcessor
handles Amazon SageMaker Processing tasks for jobs using a machine learning framework, which allows for providing a script to be run as part of the Processing Job.- Parameters
role (str or PipelineVariable) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.
image_uri (str or PipelineVariable) – The URI of the Docker image to use for the processing jobs.
command ([str]) – The command to run, along with any command-line flags. Example: [“python3”, “-v”].
instance_count (int or PipelineVariable) – The number of instances to run a processing job with.
instance_type (str or PipelineVariable) – The type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.
volume_size_in_gb (int or PipelineVariable) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
volume_kms_key (str or PipelineVariable) – A KMS key for the processing volume (default: None).
output_kms_key (str or PipelineVariable) – The KMS key ID for processing job outputs (default: None).
max_runtime_in_seconds (int or PipelineVariable) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status. If max_runtime_in_seconds is not specified, the default value is 24 hours.
base_job_name (str) – Prefix for processing name. If not specified, the processor generates a default job name, based on the processing image name and current timestamp.
sagemaker_session (
Session
) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.env (dict[str, str] or dict[str, PipelineVariable])) – Environment variables to be passed to the processing jobs (default: None).
tags (Optional[Tags]) – Tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.
network_config (
NetworkConfig
) – ANetworkConfig
object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.
- get_run_args(code, inputs=None, outputs=None, arguments=None)¶
Returns a RunArgs object.
For processors (
PySparkProcessor
,SparkJar
) that have special run() arguments, this object contains the normalized arguments for passing toProcessingStep
.- Parameters
code (str) – This can be an S3 URI or a local path to a file with the framework script to run.
inputs (list[
ProcessingInput
]) – Input files for the processing job. These must be provided asProcessingInput
objects (default: None).outputs (list[
ProcessingOutput
]) – Outputs for the processing job. These can be specified as either path strings orProcessingOutput
objects (default: None).arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).
- run(code, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, kms_key=None)¶
Runs a processing job.
- Parameters
code (str) – This can be an S3 URI or a local path to a file with the framework script to run.
inputs (list[
ProcessingInput
]) – Input files for the processing job. These must be provided asProcessingInput
objects (default: None).outputs (list[
ProcessingOutput
]) – Outputs for the processing job. These can be specified as either path strings orProcessingOutput
objects (default: None).arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
experiment_config (dict[str, str]) – Experiment management configuration. Optionally, the dict can contain three keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’. The behavior of setting these keys is as follows: * If ExperimentName is supplied but TrialName is not a Trial will be automatically created and the job’s Trial Component associated with the Trial. * If TrialName is supplied and the Trial already exists the job’s Trial Component will be associated with the Trial. * If both ExperimentName and TrialName are not supplied the trial component will be unassociated. * TrialComponentDisplayName is used for display in Studio. * Both ExperimentName and TrialName will be ignored if the Processor instance is built with
PipelineSession
. However, the value of TrialComponentDisplayName is honored for display in Studio.kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
- Returns
None or pipeline step arguments in case the Processor instance is built with
PipelineSession
- class sagemaker.processing.ProcessingJob(sagemaker_session, job_name, inputs, outputs, output_kms_key=None)¶
Bases:
_Job
Provides functionality to start, describe, and stop processing jobs.
Initializes a Processing job.
- Parameters
sagemaker_session (
Session
) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.job_name (str) – Name of the Processing job.
inputs (list[
ProcessingInput
]) – A list ofProcessingInput
objects.outputs (list[
ProcessingOutput
]) – A list ofProcessingOutput
objects.output_kms_key (str) – The output KMS key associated with the job (default: None).
- classmethod start_new(processor, inputs, outputs, experiment_config)¶
Starts a new processing job using the provided inputs and outputs.
- Parameters
processor (
Processor
) – TheProcessor
instance that started the job.inputs (list[
ProcessingInput
]) – A list ofProcessingInput
objects.outputs (list[
ProcessingOutput
]) – A list ofProcessingOutput
objects.experiment_config (dict[str, str]) – Experiment management configuration. Optionally, the dict can contain three keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’. The behavior of setting these keys is as follows: * If ExperimentName is supplied but TrialName is not a Trial will be automatically created and the job’s Trial Component associated with the Trial. * If TrialName is supplied and the Trial already exists the job’s Trial Component will be associated with the Trial. * If both ExperimentName and TrialName are not supplied the trial component will be unassociated. * TrialComponentDisplayName is used for display in Studio.
- Returns
- The instance of
ProcessingJob
created using the
Processor
.
- The instance of
- Return type
- classmethod from_processing_name(sagemaker_session, processing_job_name)¶
Initializes a
ProcessingJob
from a processing job name.- Parameters
- Returns
- The instance of
ProcessingJob
created from the job name.
- The instance of
- Return type
- classmethod from_processing_arn(sagemaker_session, processing_job_arn)¶
Initializes a
ProcessingJob
from a Processing ARN.- Parameters
- Returns
- The instance of
ProcessingJob
created from the processing job’s ARN.
- The instance of
- Return type
- wait(logs=True)¶
Waits for the processing job to complete.
- Parameters
logs (bool) – Whether to show the logs produced by the job (default: True).
- describe()¶
Prints out a response from the DescribeProcessingJob API call.
- stop()¶
Stops the processing job.
- static prepare_app_specification(container_arguments, container_entrypoint, image_uri)¶
Prepares a dict that represents a ProcessingJob’s AppSpecification.
- Parameters
- Returns
Represents AppSpecification which configures the processing job to run a specified Docker container image.
- Return type
- static prepare_output_config(kms_key_id, outputs)¶
Prepares a dict that represents a ProcessingOutputConfig.
- Parameters
kms_key_id (str) – The AWS Key Management Service (AWS KMS) key that Amazon SageMaker uses to encrypt the processing job output. KmsKeyId can be an ID of a KMS key, ARN of a KMS key, alias of a KMS key, or alias of a KMS key. The KmsKeyId is applied to all outputs.
outputs (list[dict]) – Output configuration information for a processing job.
- Returns
Represents output configuration for the processing job.
- Return type
- static prepare_processing_resources(instance_count, instance_type, volume_kms_key_id, volume_size_in_gb)¶
Prepares a dict that represents the ProcessingResources.
- Parameters
instance_count (int) – The number of ML compute instances to use in the processing job. For distributed processing jobs, specify a value greater than 1. The default value is 1.
instance_type (str) – The ML compute instance type for the processing job.
volume_kms_key_id (str) – The AWS Key Management Service (AWS KMS) key that Amazon SageMaker uses to encrypt data on the storage volume attached to the ML compute instance(s) that run the processing job.
volume_size_in_gb (int) – The size of the ML storage volume in gigabytes that you want to provision. You must specify sufficient ML storage for your scenario.
- Returns
- Represents ProcessingResources which identifies the resources,
ML compute instances, and ML storage volumes to deploy for a processing job.
- Return type
- class sagemaker.processing.ProcessingInput(source=None, destination=None, input_name=None, s3_data_type='S3Prefix', s3_input_mode='File', s3_data_distribution_type='FullyReplicated', s3_compression_type='None', s3_input=None, dataset_definition=None, app_managed=False)¶
Bases:
object
Accepts parameters that specify an Amazon S3 input for a processing job.
Also provides a method to turn those parameters into a dictionary.
Initializes a
ProcessingInput
instance.ProcessingInput
accepts parameters that specify an Amazon S3 input for a processing job and provides a method to turn those parameters into a dictionary.- Parameters
source (str or PipelineVariable) – The source for the input. If a local path is provided, it will automatically be uploaded to S3 under: “s3://<default-bucket-name>/<job-name>/input/<input-name>”.
destination (str or PipelineVariable) – The destination of the input.
input_name (str or PipelineVariable) – The name for the input. If a name is not provided, one will be generated (eg. “input-1”).
s3_data_type (str or PipelineVariable) – Valid options are “ManifestFile” or “S3Prefix”.
s3_input_mode (str or PipelineVariable) – Valid options are “Pipe”, “File” or “FastFile”.
s3_data_distribution_type (str or PipelineVariable) – Valid options are “FullyReplicated” or “ShardedByS3Key”.
s3_compression_type (str or PipelineVariable) – Valid options are “None” or “Gzip”.
s3_input (
S3Input
) – Metadata of data objects stored in S3dataset_definition (
DatasetDefinition
) – DatasetDefinition inputapp_managed (bool or PipelineVariable) – Whether the input are managed by SageMaker or application
- class sagemaker.processing.ProcessingOutput(source=None, destination=None, output_name=None, s3_upload_mode='EndOfJob', app_managed=False, feature_store_output=None)¶
Bases:
object
Accepts parameters that specify an Amazon S3 output for a processing job.
It also provides a method to turn those parameters into a dictionary.
Initializes a
ProcessingOutput
instance.ProcessingOutput
accepts parameters that specify an Amazon S3 output for a processing job and provides a method to turn those parameters into a dictionary.- Parameters
source (str or PipelineVariable) – The source for the output.
destination (str or PipelineVariable) – The destination of the output. If a destination is not provided, one will be generated: “s3://<default-bucket-name>/<job-name>/output/<output-name>” (Note: this does not apply when used with
ProcessingStep
).output_name (str or PipelineVariable) – The name of the output. If a name is not provided, one will be generated (eg. “output-1”).
s3_upload_mode (str or PipelineVariable) – Valid options are “EndOfJob” or “Continuous”.
app_managed (bool or PipelineVariable) – Whether the input are managed by SageMaker or application
feature_store_output (
FeatureStoreOutput
) – Configuration for processing job outputs of FeatureStore.
- class sagemaker.processing.RunArgs(code, inputs=None, outputs=None, arguments=None)¶
Bases:
object
Accepts parameters that correspond to ScriptProcessors.
An instance of this class is returned from the
get_run_args()
method on processors, and is used for normalizing the arguments so that they can be passed toProcessingStep
- Parameters
code (str) – This can be an S3 URI or a local path to a file with the framework script to run.
inputs (list[
ProcessingInput
]) – Input files for the processing job. These must be provided asProcessingInput
objects (default: None).outputs (list[
ProcessingOutput
]) – Outputs for the processing job. These can be specified as either path strings orProcessingOutput
objects (default: None).arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).
Method generated by attrs for class RunArgs.
- class sagemaker.processing.FeatureStoreOutput(**kwargs)¶
Bases:
ApiObject
Configuration for processing job outputs in Amazon SageMaker Feature Store.
Init ApiObject.
- feature_group_name = None¶
- class sagemaker.processing.FrameworkProcessor(estimator_cls, framework_version, role=None, instance_count=None, instance_type=None, py_version='py3', image_uri=None, command=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, code_location=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)¶
Bases:
ScriptProcessor
Handles Amazon SageMaker processing tasks for jobs using a machine learning framework.
Initializes a
FrameworkProcessor
instance.The
FrameworkProcessor
handles Amazon SageMaker Processing tasks for jobs using a machine learning framework, which allows for a set of Python scripts to be run as part of the Processing Job.- Parameters
estimator_cls (type) – A subclass of the
Framework
estimatorframework_version (str) – The version of the framework. Value is ignored when
image_uri
is provided.role (str or PipelineVariable) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.
instance_count (int or PipelineVariable) – The number of instances to run a processing job with.
instance_type (str or PipelineVariable) – The type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.
py_version (str) – Python version you want to use for executing your model training code. One of ‘py2’ or ‘py3’. Defaults to ‘py3’. Value is ignored when
image_uri
is provided.image_uri (str or PipelineVariable) – The URI of the Docker image to use for the processing jobs (default: None).
command ([str]) – The command to run, along with any command-line flags to precede the
`code script`
. Example: [“python3”, “-v”]. If not provided, [“python”] will be chosen (default: None).volume_size_in_gb (int or PipelineVariable) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
volume_kms_key (str or PipelineVariable) – A KMS key for the processing volume (default: None).
output_kms_key (str or PipelineVariable) – The KMS key ID for processing job outputs (default: None).
code_location (str) – The S3 prefix URI where custom code will be uploaded (default: None). The code file uploaded to S3 is ‘code_location/job-name/source/sourcedir.tar.gz’. If not specified, the default
code location
is ‘s3://{sagemaker-default-bucket}’max_runtime_in_seconds (int or PipelineVariable) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status. If max_runtime_in_seconds is not specified, the default value is 24 hours.
base_job_name (str) – Prefix for processing name. If not specified, the processor generates a default job name, based on the processing image name and current timestamp (default: None).
sagemaker_session (
Session
) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain (default: None).env (dict[str, str] or dict[str, PipelineVariable]) – Environment variables to be passed to the processing jobs (default: None).
tags (Optional[Tags]) – Tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.
network_config (
NetworkConfig
) – ANetworkConfig
object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets (default: None).
- framework_entrypoint_command = ['/bin/bash']¶
- get_run_args(code, source_dir=None, dependencies=None, git_config=None, inputs=None, outputs=None, arguments=None, job_name=None)¶
Returns a RunArgs object.
This object contains the normalized inputs, outputs and arguments needed when using a
FrameworkProcessor
in aProcessingStep
.- Parameters
code (str) – This can be an S3 URI or a local path to a file with the framework script to run. See the
code
argument in sagemaker.processing.FrameworkProcessor.run().source_dir (str) – Path (absolute, relative, or an S3 URI) to a directory wit any other processing source code dependencies aside from the entrypoint file (default: None). See the
source_dir
argument in sagemaker.processing.FrameworkProcessor.run()dependencies (list[str]) – A list of paths to directories (absolute or relative) with any additional libraries that will be exported to the container (default: []). See the
dependencies
argument in sagemaker.processing.FrameworkProcessor.run().git_config (dict[str, str]) – Git configurations used for cloning files. See the git_config argument in sagemaker.processing.FrameworkProcessor.run().
inputs (list[
ProcessingInput
]) – Input files for the processing job. These must be provided asProcessingInput
objects (default: None).outputs (list[
ProcessingOutput
]) – Outputs for the processing job. These can be specified as either path strings orProcessingOutput
objects (default: None).arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).
job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
- run(code, source_dir=None, dependencies=None, git_config=None, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, kms_key=None, codeartifact_repo_arn=None)¶
Runs a processing job.
- Parameters
code (str) – This can be an S3 URI or a local path to a file with the framework script to run.Path (absolute or relative) to the local Python source file which should be executed as the entry point to training. When code is an S3 URI, ignore source_dir, dependencies, and git_config. If
source_dir
is specified, thencode
must point to a file located at the root ofsource_dir
.source_dir (str) – Path (absolute, relative or an S3 URI) to a directory with any other processing source code dependencies aside from the entry point file (default: None). If
source_dir
is an S3 URI, it must point to a file named sourcedir.tar.gz. Structure within this directory are preserved when processing on Amazon SageMaker (default: None).dependencies (list[str]) – A list of paths to directories (absolute or relative) with any additional libraries that will be exported to the container (default: []). The library folders will be copied to SageMaker in the same folder where the entrypoint is copied. If ‘git_config’ is provided, ‘dependencies’ should be a list of relative locations to directories with any additional libraries needed in the Git repo (default: None).
Git configurations used for cloning files, including
repo
,branch
,commit
,2FA_enabled
,username
,password
andtoken
. Therepo
field is required. All other fields are optional.repo
specifies the Git repository where your training script is stored. If you don’t providebranch
, the default value ‘master’ is used. If you don’t providecommit
, the latest commit in the specified branch is used. .. admonition:: ExampleThe following config:
>>> git_config = {'repo': 'https://github.com/aws/sagemaker-python-sdk.git', >>> 'branch': 'test-branch-git-config', >>> 'commit': '329bfcf884482002c05ff7f44f62599ebc9f445a'}
results in cloning the repo specified in ‘repo’, then checkout the ‘master’ branch, and checkout the specified commit.
2FA_enabled
,username
,password
andtoken
are used for authentication. For GitHub (or other Git) accounts, set2FA_enabled
to ‘True’ if two-factor authentication is enabled for the account, otherwise set it to ‘False’. If you do not provide a value for2FA_enabled
, a default value of ‘False’ is used. CodeCommit does not support two-factor authentication, so do not provide “2FA_enabled” with CodeCommit repositories.For GitHub and other Git repos, when SSH URLs are provided, it doesn’t matter whether 2FA is enabled or disabled; you should either have no passphrase for the SSH key pairs, or have the ssh-agent configured so that you will not be prompted for SSH passphrase when you do ‘git clone’ command with SSH URLs. When HTTPS URLs are provided: if 2FA is disabled, then either token or username+password will be used for authentication if provided (token prioritized); if 2FA is enabled, only token will be used for authentication if provided. If required authentication info is not provided, python SDK will try to use local credentials storage to authenticate. If that fails either, an error message will be thrown.
For CodeCommit repos, 2FA is not supported, so ‘2FA_enabled’ should not be provided. There is no token in CodeCommit, so ‘token’ should not be provided too. When ‘repo’ is an SSH URL, the requirements are the same as GitHub-like repos. When ‘repo’ is an HTTPS URL, username+password will be used for authentication if they are provided; otherwise, python SDK will try to use either CodeCommit credential helper or local credential storage for authentication.
inputs (list[
ProcessingInput
]) – Input files for the processing job. These must be provided asProcessingInput
objects (default: None).outputs (list[
ProcessingOutput
]) – Outputs for the processing job. These can be specified as either path strings orProcessingOutput
objects (default: None).arguments (list[str] or list[PipelineVariable]) – A list of string arguments to be passed to a processing job (default: None).
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
experiment_config (dict[str, str]) – Experiment management configuration. Optionally, the dict can contain three keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’. The behavior of setting these keys is as follows: * If ExperimentName is supplied but TrialName is not a Trial will be automatically created and the job’s Trial Component associated with the Trial. * If TrialName is supplied and the Trial already exists the job’s Trial Component will be associated with the Trial. * If both ExperimentName and TrialName are not supplied the trial component will be unassociated. * TrialComponentDisplayName is used for display in Studio. * Both ExperimentName and TrialName will be ignored if the Processor instance is built with
PipelineSession
. However, the value of TrialComponentDisplayName is honored for display in Studio.kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
codeartifact_repo_arn (str) – The ARN of the CodeArtifact repository that should be logged into before installing dependencies (default: None).
- Returns
None or pipeline step arguments in case the Processor instance is built with
PipelineSession
This module is the entry to run spark processing script.
This module contains code related to Spark Processors, which are used for Processing jobs. These jobs let customers perform data pre-processing, post-processing, feature engineering, data validation, and model evaluation on SageMaker using Spark and PySpark.
- class sagemaker.spark.processing.PySparkProcessor(role=None, instance_type=None, instance_count=None, framework_version=None, py_version=None, container_version=None, image_uri=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, configuration_location=None, dependency_location=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)¶
Bases:
_SparkProcessorBase
Handles Amazon SageMaker processing tasks for jobs using PySpark.
Initialize an
PySparkProcessor
instance.The PySparkProcessor handles Amazon SageMaker processing tasks for jobs using SageMaker PySpark.
- Parameters
framework_version (str) – The version of SageMaker PySpark.
py_version (str) – The version of python.
container_version (str) – The version of spark container.
role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3 (default: None). If not specified, the value from the defaults configuration file will be used.
instance_type (str or PipelineVariable) – Type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.
instance_count (int or PipelineVariable) – The number of instances to run the Processing job with. Defaults to 1.
volume_size_in_gb (int or PipelineVariable) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
volume_kms_key (str or PipelineVariable) – A KMS key for the processing volume.
output_kms_key (str or PipelineVariable) – The KMS key id for all ProcessingOutputs.
configuration_location (str) – The S3 prefix URI where the user-provided EMR application configuration will be uploaded (default: None). If not specified, the default
configuration location
is ‘s3://{sagemaker-default-bucket}’.dependency_location (str) – The S3 prefix URI where Spark dependencies will be uploaded (default: None). If not specified, the default
dependency location
is ‘s3://{sagemaker-default-bucket}’.max_runtime_in_seconds (int or PipelineVariable) – Timeout in seconds. After this amount of time Amazon SageMaker terminates the job regardless of its current status.
base_job_name (str) – Prefix for processing name. If not specified, the processor generates a default job name, based on the training image name and current timestamp.
sagemaker_session (sagemaker.session.Session) – Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.
env (dict[str, str] or dict[str, PipelineVariable]) – Environment variables to be passed to the processing job.
tags (Optional[Tags]) – List of tags to be passed to the processing job.
network_config (sagemaker.network.NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.
image_uri (Optional[Union[str, PipelineVariable]]) –
- get_run_args(submit_app, submit_py_files=None, submit_jars=None, submit_files=None, inputs=None, outputs=None, arguments=None, job_name=None, configuration=None, spark_event_logs_s3_uri=None)¶
Returns a RunArgs object.
This object contains the normalized inputs, outputs and arguments needed when using a
PySparkProcessor
in aProcessingStep
.- Parameters
submit_app (str) – Path (local or S3) to Python file to submit to Spark as the primary application. This is translated to the code property on the returned RunArgs object.
submit_py_files (list[str]) – List of paths (local or S3) to provide for spark-submit –py-files option
submit_jars (list[str]) – List of paths (local or S3) to provide for spark-submit –jars option
submit_files (list[str]) – List of paths (local or S3) to provide for spark-submit –files option
inputs (list[
ProcessingInput
]) – Input files for the processing job. These must be provided asProcessingInput
objects (default: None).outputs (list[
ProcessingOutput
]) – Outputs for the processing job. These can be specified as either path strings orProcessingOutput
objects (default: None).arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).
job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
configuration (list[dict] or dict) – Configuration for Hadoop, Spark, or Hive. List or dictionary of EMR-style classifications. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
spark_event_logs_s3_uri (str) – S3 path where spark application events will be published to.
- run(submit_app, submit_py_files=None, submit_jars=None, submit_files=None, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, configuration=None, spark_event_logs_s3_uri=None, kms_key=None)¶
Runs a processing job.
- Parameters
submit_app (str) – Path (local or S3) to Python file to submit to Spark as the primary application
submit_py_files (list[str] or list[PipelineVariable]) – List of paths (local or S3) to provide for spark-submit –py-files option
submit_jars (list[str] or list[PipelineVariable]) – List of paths (local or S3) to provide for spark-submit –jars option
submit_files (list[str] or list[PipelineVariable]) – List of paths (local or S3) to provide for spark-submit –files option
inputs (list[
ProcessingInput
]) – Input files for the processing job. These must be provided asProcessingInput
objects (default: None).outputs (list[
ProcessingOutput
]) – Outputs for the processing job. These can be specified as either path strings orProcessingOutput
objects (default: None).arguments (list[str] or list[PipelineVariable]) – A list of string arguments to be passed to a processing job (default: None).
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
experiment_config (dict[str, str]) – Experiment management configuration. Optionally, the dict can contain three keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’. The behavior of setting these keys is as follows: * If ExperimentName is supplied but TrialName is not a Trial will be automatically created and the job’s Trial Component associated with the Trial. * If TrialName is supplied and the Trial already exists the job’s Trial Component will be associated with the Trial. * If both ExperimentName and TrialName are not supplied the trial component will be unassociated. * TrialComponentDisplayName is used for display in Studio.
configuration (list[dict] or dict) – Configuration for Hadoop, Spark, or Hive. List or dictionary of EMR-style classifications. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
spark_event_logs_s3_uri (str or PipelineVariable) – S3 path where spark application events will be published to.
kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
- class sagemaker.spark.processing.SparkJarProcessor(role=None, instance_type=None, instance_count=None, framework_version=None, py_version=None, container_version=None, image_uri=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, configuration_location=None, dependency_location=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)¶
Bases:
_SparkProcessorBase
Handles Amazon SageMaker processing tasks for jobs using Spark with Java or Scala Jars.
Initialize a
SparkJarProcessor
instance.The SparkProcessor handles Amazon SageMaker processing tasks for jobs using SageMaker Spark.
- Parameters
framework_version (str) – The version of SageMaker PySpark.
py_version (str) – The version of python.
container_version (str) – The version of spark container.
role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3 (default: None). If not specified, the value from the defaults configuration file will be used.
instance_type (str or PipelineVariable) – Type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.
instance_count (int or PipelineVariable) – The number of instances to run the Processing job with. Defaults to 1.
volume_size_in_gb (int or PipelineVariable) – Size in GB of the EBS volume to use for storing data during processing (default: 30).
volume_kms_key (str or PipelineVariable) – A KMS key for the processing volume.
output_kms_key (str or PipelineVariable) – The KMS key id for all ProcessingOutputs.
configuration_location (str) – The S3 prefix URI where the user-provided EMR application configuration will be uploaded (default: None). If not specified, the default
configuration location
is ‘s3://{sagemaker-default-bucket}’.dependency_location (str) – The S3 prefix URI where Spark dependencies will be uploaded (default: None). If not specified, the default
dependency location
is ‘s3://{sagemaker-default-bucket}’.max_runtime_in_seconds (int or PipelineVariable) – Timeout in seconds. After this amount of time Amazon SageMaker terminates the job regardless of its current status.
base_job_name (str) – Prefix for processing name. If not specified, the processor generates a default job name, based on the training image name and current timestamp.
sagemaker_session (sagemaker.session.Session) – Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.
env (dict[str, str] or dict[str, PipelineVariable]) – Environment variables to be passed to the processing job.
tags (Optional[Tags]) – Tags to be passed to the processing job.
network_config (sagemaker.network.NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.
image_uri (Optional[Union[str, PipelineVariable]]) –
- get_run_args(submit_app, submit_class=None, submit_jars=None, submit_files=None, inputs=None, outputs=None, arguments=None, job_name=None, configuration=None, spark_event_logs_s3_uri=None)¶
Returns a RunArgs object.
This object contains the normalized inputs, outputs and arguments needed when using a
SparkJarProcessor
in aProcessingStep
.- Parameters
submit_app (str) – Path (local or S3) to Python file to submit to Spark as the primary application. This is translated to the code property on the returned RunArgs object
submit_class (str) – Java class reference to submit to Spark as the primary application
submit_jars (list[str]) – List of paths (local or S3) to provide for spark-submit –jars option
submit_files (list[str]) – List of paths (local or S3) to provide for spark-submit –files option
inputs (list[
ProcessingInput
]) – Input files for the processing job. These must be provided asProcessingInput
objects (default: None).outputs (list[
ProcessingOutput
]) – Outputs for the processing job. These can be specified as either path strings orProcessingOutput
objects (default: None).arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).
job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
configuration (list[dict] or dict) – Configuration for Hadoop, Spark, or Hive. List or dictionary of EMR-style classifications. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
spark_event_logs_s3_uri (str) – S3 path where spark application events will be published to.
- run(submit_app, submit_class, submit_jars=None, submit_files=None, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, configuration=None, spark_event_logs_s3_uri=None, kms_key=None)¶
Runs a processing job.
- Parameters
submit_app (str) – Path (local or S3) to Jar file to submit to Spark as the primary application
submit_class (str or PipelineVariable) – Java class reference to submit to Spark as the primary application
submit_jars (list[str] or list[PipelineVariable]) – List of paths (local or S3) to provide for spark-submit –jars option
submit_files (list[str] or list[PipelineVariable]) – List of paths (local or S3) to provide for spark-submit –files option
inputs (list[
ProcessingInput
]) – Input files for the processing job. These must be provided asProcessingInput
objects (default: None).outputs (list[
ProcessingOutput
]) – Outputs for the processing job. These can be specified as either path strings orProcessingOutput
objects (default: None).arguments (list[str] or list[PipelineVariable]) – A list of string arguments to be passed to a processing job (default: None).
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).
job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.
experiment_config (dict[str, str]) – Experiment management configuration. Optionally, the dict can contain three keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’. The behavior of setting these keys is as follows: * If ExperimentName is supplied but TrialName is not a Trial will be automatically created and the job’s Trial Component associated with the Trial. * If TrialName is supplied and the Trial already exists the job’s Trial Component will be associated with the Trial. * If both ExperimentName and TrialName are not supplied the trial component will be unassociated. * TrialComponentDisplayName is used for display in Studio.
configuration (list[dict] or dict) – Configuration for Hadoop, Spark, or Hive. List or dictionary of EMR-style classifications. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
spark_event_logs_s3_uri (str or PipelineVariable) – S3 path where spark application events will be published to.
kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
- class sagemaker.spark.processing.FileType(value)¶
Bases:
Enum
Enum of file type
- JAR = 1¶
- PYTHON = 2¶
- FILE = 3¶
- class sagemaker.spark.processing.SparkConfigUtils¶
Bases:
object
Util class for spark configurations
- static validate_configuration(configuration)¶
Validates the user-provided Hadoop/Spark/Hive configuration.
This ensures that the list or dictionary the user provides will serialize to JSON matching the schema of EMR’s application configuration
- Parameters
configuration (Dict) – A dict that contains the configuration overrides to the default values. For more information, please visit: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
This module configures the SageMaker Clarify bias and model explainability processor jobs.
SageMaker Clarify¶
- class sagemaker.clarify.DatasetType(value)¶
Bases:
Enum
Enum to store different dataset types supported in the Analysis config file
- TEXTCSV = 'text/csv'¶
- JSONLINES = 'application/jsonlines'¶
- JSON = 'application/json'¶
- PARQUET = 'application/x-parquet'¶
- IMAGE = 'application/x-image'¶
- class sagemaker.clarify.TimeSeriesJSONDatasetFormat(value)¶
Bases:
Enum
Possible dataset formats for JSON time series data files.
Below is an example
COLUMNS
dataset for time series explainability:{ "ids": [1, 2], "timestamps": [3, 4], "target_ts": [5, 6], "rts1": [0.25, 0.5], "rts2": [1.25, 1.5], "scv1": [10, 20], "scv2": [30, 40] }
For this example, JMESPaths are specified when creating
TimeSeriesDataConfig
as follows:item_id="ids" timestamp="timestamps" target_time_series="target_ts" related_time_series=["rts1", "rts2"] static_covariates=["scv1", "scv2"]
Below is an example
ITEM_RECORDS
dataset for time series explainability:[ { "id": 1, "scv1": 10, "scv2": "red", "timeseries": [ {"timestamp": 1, "target_ts": 5, "rts1": 0.25, "rts2": 10}, {"timestamp": 2, "target_ts": 6, "rts1": 0.35, "rts2": 20}, {"timestamp": 3, "target_ts": 4, "rts1": 0.45, "rts2": 30} ] }, { "id": 2, "scv1": 20, "scv2": "blue", "timeseries": [ {"timestamp": 1, "target_ts": 4, "rts1": 0.25, "rts2": 40}, {"timestamp": 2, "target_ts": 2, "rts1": 0.35, "rts2": 50} ] } ]
For this example, JMESPaths are specified when creating
TimeSeriesDataConfig
as follows:item_id="[*].id" timestamp="[*].timeseries[].timestamp" target_time_series="[*].timeseries[].target_ts" related_time_series=["[*].timeseries[].rts1", "[*].timeseries[].rts2"] static_covariates=["[*].scv1", "[*].scv2"]
Below is an example
TIMESTAMP_RECORDS
dataset for time series explainability:[ {"id": 1, "timestamp": 1, "target_ts": 5, "scv1": 10, "rts1": 0.25}, {"id": 1, "timestamp": 2, "target_ts": 6, "scv1": 10, "rts1": 0.5}, {"id": 1, "timestamp": 3, "target_ts": 3, "scv1": 10, "rts1": 0.75}, {"id": 2, "timestamp": 5, "target_ts": 10, "scv1": 20, "rts1": 1} ]
For this example, JMESPaths are specified when creating
TimeSeriesDataConfig
as follows:item_id="[*].id" timestamp="[*].timestamp" target_time_series="[*].target_ts" related_time_series=["[*].rts1"] static_covariates=["[*].scv1"]
- COLUMNS = 'columns'¶
- ITEM_RECORDS = 'item_records'¶
- TIMESTAMP_RECORDS = 'timestamp_records'¶
- class sagemaker.clarify.SegmentationConfig(name_or_index, segments, config_name=None, display_aliases=None)¶
Bases:
object
Config object that defines segment(s) of the dataset on which metrics are computed.
Initializes a segmentation configuration for a dataset column.
- Parameters
name_or_index (str or int) – The name or index of the column in the dataset on which the segment(s) is defined.
segments (List[List[str or int]]) – Each List of values represents one segment. If N Lists are provided, we generate N+1 segments - the additional segment, denoted as the ‘__default__’ segment, is for the rest of the values that are not covered by these lists. For continuous columns, a segment must be given as strings in interval notation (eg.: [“[1, 4]”] or [“(2, 5]”]). A segment can also be composed of multiple intervals (eg.: [“[1, 4]”, “(5, 6]”] is one segment). For categorical columns, each segment should contain one or more of the categorical values for the categorical column, which may be strings or integers. Eg,: For a continuous column,
segments
could be [[“[1, 4]”, “(5, 6]”], [“(7, 9)”]] - this generates 3 segments including the default segment. For a categorical columns with values (“A”, “B”, “C”, “D”),segments
,could be [[“A”, “B”]]. This generate 2 segments, including the default segment.config_name (str) –
display_aliases (List[str]) – the analysis output and report. This list should be the same length as the number of lists provided in
segments
or with one additional display alias for the default segment.
- Raises
ValueError – when the
name_or_index
is None,segments
is invalid, or a wrong number ofdisplay_aliases
are specified.
- class sagemaker.clarify.TimeSeriesDataConfig(target_time_series, item_id, timestamp, related_time_series=None, static_covariates=None, dataset_format=None)¶
Bases:
object
Config object for TimeSeries explainability data configuration fields.
Initialises TimeSeries explainability data configuration fields.
- Parameters
target_time_series (str or int) – A string or a zero-based integer index. Used to locate the target time series in the shared input dataset. If this parameter is a string, then all other parameters except dataset_format must be strings or lists of strings. If this parameter is an int, then all other parameters except dataset_format must be ints or lists of ints.
item_id (str or int) – A string or a zero-based integer index. Used to locate item id in the shared input dataset.
timestamp (str or int) – A string or a zero-based integer index. Used to locate timestamp in the shared input dataset.
related_time_series (list[str] or list[int]) – Optional. An array of strings or array of zero-based integer indices. Used to locate all related time series in the shared input dataset (if present).
static_covariates (list[str] or list[int]) – Optional. An array of strings or array of zero-based integer indices. Used to locate all static covariate fields in the shared input dataset (if present).
dataset_format (TimeSeriesJSONDatasetFormat) – Describes the format of the data files provided for analysis. Should only be provided when dataset is in JSON format.
- Raises
ValueError – If any required arguments are not provided or are the wrong type.
- get_time_series_data_config()¶
Returns part of an analysis config dictionary.
- class sagemaker.clarify.DataConfig(s3_data_input_path, s3_output_path, s3_analysis_config_output_path=None, label=None, headers=None, features=None, dataset_type='text/csv', s3_compression_type='None', joinsource=None, facet_dataset_uri=None, facet_headers=None, predicted_label_dataset_uri=None, predicted_label_headers=None, predicted_label=None, excluded_columns=None, segmentation_config=None, time_series_data_config=None)¶
Bases:
object
Config object related to configurations of the input and output dataset.
Initializes a configuration of both input and output datasets.
- Parameters
s3_data_input_path (str) – Dataset S3 prefix/object URI.
s3_output_path (str) – S3 prefix to store the output.
s3_analysis_config_output_path (str) – S3 prefix to store the analysis config output. If this field is None, then the
s3_output_path
will be used to store theanalysis_config
output.label (str) – Target attribute of the model required by bias metrics. Specified as column name or index for CSV dataset or a JMESPath expression for JSON/JSON Lines. Required parameter except for when the input dataset does not contain the label. Note: For JSON, the JMESPath query must result in a list of labels for each sample. For JSON Lines, it must result in the label for each line. Only a single label per sample is supported at this time.
headers ([str]) – List of column names in the dataset. If not provided, Clarify will generate headers to use internally. For time series explainability cases, please provide headers in the order of item_id, timestamp, target_time_series, all related_time_series columns, and then all static_covariate columns.
features (str) – JMESPath expression to locate the feature values if the dataset format is JSON/JSON Lines. Note: For JSON, the JMESPath query must result in a 2-D list (or a matrix) of feature values. For JSON Lines, it must result in a 1-D list of features for each line.
dataset_type (str) – Format of the dataset. Valid values are
"text/csv"
for CSV,"application/jsonlines"
for JSON Lines,"application/json"
for JSON, and"application/x-parquet"
for Parquet.s3_compression_type (str) – Valid options are “None” or
"Gzip"
.The name or index of the column in the dataset that acts as an identifier column (for instance, while performing a join). This column is only used as an identifier, and not used for any other computations. This is an optional field in all cases except:
The dataset contains more than one file and save_local_shap_values is set to true in
ShapConfig
, and/orWhen the dataset and/or facet dataset and/or predicted label dataset are in separate files.
facet_dataset_uri (str) –
Dataset S3 prefix/object URI that contains facet attribute(s), used for bias analysis on datasets without facets.
If the dataset and the facet dataset are one single file each, then the original dataset and facet dataset must have the same number of rows.
If the dataset and facet dataset are in multiple files (either one), then an index column,
joinsource
, is required to join the two datasets.
Clarify will not use the
joinsource
column and columns present in the facet dataset when calling model inference APIs. Note: this is only supported for"text/csv"
dataset type.facet_headers (list[str]) – List of column names in the facet dataset.
predicted_label_dataset_uri (str) –
Dataset S3 prefix/object URI with predicted labels, which are used directly for analysis instead of making model inference API calls.
If the dataset and the predicted label dataset are one single file each, then the original dataset and predicted label dataset must have the same number of rows.
If the dataset and predicted label dataset are in multiple files (either one), then an index column,
joinsource
, is required to join the two datasets.
Note: this is only supported for
"text/csv"
dataset type.predicted_label_headers (list[str]) – List of column names in the predicted label dataset
predicted_label (str or int) – Predicted label of the target attribute of the model required for running bias analysis. Specified as column name or index for CSV data, or a JMESPath expression for JSON/JSON Lines. Clarify uses the predicted labels directly instead of making model inference API calls. Note: For JSON, the JMESPath query must result in a list of predicted labels for each sample. For JSON Lines, it must result in the predicted label for each line. Only a single predicted label per sample is supported at this time.
excluded_columns (list[int] or list[str]) – A list of names or indices of the columns which are to be excluded from making model inference API calls.
segmentation_config (list[SegmentationConfig]) – A list of
SegmentationConfig
objects.time_series_data_config (TimeSeriesDataConfig) – Optional. A config object for TimeSeries data specific fields, required for TimeSeries explainability use cases.
- Raises
ValueError – when the
dataset_type
is invalid, predicted label dataset parameters are used with un-supporteddataset_type
, or facet dataset parameters are used with un-supporteddataset_type
- get_config()¶
Returns part of an analysis config dictionary.
- class sagemaker.clarify.BiasConfig(label_values_or_threshold, facet_name, facet_values_or_threshold=None, group_name=None)¶
Bases:
object
Config object with user-defined bias configurations of the input dataset.
Initializes a configuration of the sensitive groups in the dataset.
- Parameters
label_values_or_threshold ([int or float or str]) –
List of label value(s) or threshold to indicate positive outcome used for bias metrics. The appropriate threshold depends on the problem type:
Binary: The list has one positive value.
Categorical:The list has one or more (but not all) categories which are the positive values.
Regression: The list should include one threshold that defines the exclusive lower bound of positive values.
facet_name (str or int or list[str] or list[int]) – Sensitive attribute column name (or index in the input data) to use when computing bias metrics. It can also be a list of names (or indexes) for computing metrics for multiple sensitive attributes.
facet_values_or_threshold ([int or float or str] or [[int or float or str]]) –
The parameter controls the values of the sensitive group. If
facet_name
is a scalar, then it can be None or a list. Depending on the data type of the facet column, the values mean:Binary data: None means computing the bias metrics for each binary value. Or add one binary value to the list, to compute its bias metrics only.
Categorical data: None means computing the bias metrics for each category. Or add one or more (but not all) categories to the list, to compute their bias metrics v.s. the other categories.
Continuous data: The list should include one and only one threshold which defines the exclusive lower bound of a sensitive group.
If
facet_name
is a list, thenfacet_values_or_threshold
can be None if all facets are of binary or categorical type. Otherwise,facet_values_or_threshold
should be a list, and each element is the value or threshold of the corresponding facet.group_name (str) – Optional column name or index to indicate a group column to be used for the bias metric Conditional Demographic Disparity in Labels `(CDDL) or Conditional Demographic Disparity in Predicted Labels (CDDPL).
- Raises
ValueError – If the number of
facet_names
doesn’t equal number offacet values
- get_config()¶
Returns a dictionary of bias detection configurations, part of the analysis config
- class sagemaker.clarify.TimeSeriesModelConfig(forecast)¶
Bases:
object
Config object for TimeSeries predictor configuration fields.
Initializes model configuration fields for TimeSeries explainability use cases.
- Parameters
forecast (str) – JMESPath expression to extract the forecast result.
- Raises
ValueError – when
forecast
is not a string or not provided
- get_time_series_model_config()¶
Returns TimeSeries model config dictionary
- class sagemaker.clarify.ModelConfig(model_name=None, instance_count=None, instance_type=None, accept_type=None, content_type=None, content_template=None, record_template=None, custom_attributes=None, accelerator_type=None, endpoint_name_prefix=None, target_model=None, endpoint_name=None, time_series_model_config=None)¶
Bases:
object
Config object related to a model and its endpoint to be created.
Initializes a configuration of a model and the endpoint to be created for it.
- Parameters
model_name (str) – Model name (as created by CreateModel. Cannot be set when
endpoint_name
is set. Must be set withinstance_count
,instance_type
instance_count (int) – The number of instances of a new endpoint for model inference. Cannot be set when
endpoint_name
is set. Must be set withmodel_name
,instance_type
instance_type (str) – The type of EC2 instance to use for model inference; for example,
"ml.c5.xlarge"
. Cannot be set whenendpoint_name
is set. Must be set withinstance_count
,model_name
accept_type (str) – The model output format to be used for getting inferences with the shadow endpoint. Valid values are
"text/csv"
for CSV,"application/jsonlines"
for JSON Lines, and"application/json"
for JSON. Default is the same ascontent_type
.content_type (str) – The model input format to be used for getting inferences with the shadow endpoint. Valid values are
"text/csv"
for CSV,"application/jsonlines"
for JSON Lines, and"application/json"
for JSON. Default is the same asdataset_format
.content_template (str) – A template string to be used to construct the model input from dataset instances. It is only used, and required, when
model_content_type
is"application/jsonlines"
or"application/json"
. Whenmodel_content_type
isapplication/jsonlines
, the template should have one and only one placeholder,$features
, which will be replaced by a features list for each record to form the model inference input. Whenmodel_content_type
isapplication/json
, the template can have either placeholder$record
, which will be replaced by a single record templated byrecord_template
and only a single record at a time will be sent to the model, or placeholder$records
, which will be replaced by a list of records, each templated byrecord_template
.record_template (str) –
A template string to be used to construct each record of the model input from dataset instances. It is only used, and required, when
model_content_type
is"application/json"
. The template string may contain one of the following:Placeholder
$features
that will be substituted by the array of feature values and/or an optional placeholder$feature_names
that will be substituted by the array of feature names.Exactly one placeholder
$features_kvp
that will be substituted by the key-value pairs of feature name and feature value.Or for each feature, if “A” is the feature name in the
headers
configuration, then placeholder syntax"${A}"
(the double-quotes are part of the placeholder) will be substituted by the feature value.
record_template
will be used in conjunction withcontent_template
to construct the model input.Examples:
Given:
headers
:["A", "B"]
features
:[[0, 1], [3, 4]]
Example model input 1:
{ "instances": [[0, 1], [3, 4]], "feature_names": ["A", "B"] }
content_template and record_template to construct above:
content_template
:"{\"instances\": $records}"
record_template
:"$features"
Example model input 2:
[ { "A": 0, "B": 1 }, { "A": 3, "B": 4 }, ]
content_template and record_template to construct above:
content_template
:"$records"
record_template
:"$features_kvp"
Or, alternatively:
content_template
:"$records"
record_template
:"{\"A\": \"${A}\", \"B\": \"${B}\"}"
Example model input 3 (single record only):
{ "A": 0, "B": 1 }
content_template and record_template to construct above:
content_template
:"$record"
record_template
:"$features_kvp"
custom_attributes (str) – Provides additional information about a request for an inference submitted to a model hosted at an Amazon SageMaker endpoint. The information is an opaque value that is forwarded verbatim. You could use this value, for example, to provide an ID that you can use to track a request or to provide other metadata that a service endpoint was programmed to process. The value must consist of no more than 1024 visible US-ASCII characters as specified in Section 3.3.6. Field Value Components of the Hypertext Transfer Protocol (HTTP/1.1).
accelerator_type (str) – SageMaker Elastic Inference accelerator type to deploy to the model endpoint instance for making inferences to the model.
endpoint_name_prefix (str) – The endpoint name prefix of a new endpoint. Must follow pattern
^[a-zA-Z0-9](-\*[a-zA-Z0-9]
.target_model (str) – Sets the target model name when using a multi-model endpoint. For more information about multi-model endpoints, see https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html
endpoint_name (str) – Sets the endpoint_name when re-uses an existing endpoint. Cannot be set when
model_name
,instance_count
, andinstance_type
settime_series_model_config (TimeSeriesModelConfig) – Optional. A config object for TimeSeries predictor specific fields, required for TimeSeries explainability use cases.
- Raises
ValueError – when the -
endpoint_name_prefix
is invalid, -accept_type
is invalid, -content_type
is invalid, -content_template
has no placeholder “features” - both [endpoint_name
] AND [model_name
,instance_count
,instance_type
] are set - both [endpoint_name
] AND [endpoint_name_prefix
] are set
- get_predictor_config()¶
Returns part of the predictor dictionary of the analysis config.
- class sagemaker.clarify.ModelPredictedLabelConfig(label=None, probability=None, probability_threshold=None, label_headers=None)¶
Bases:
object
Config object to extract a predicted label from the model output.
Initializes a model output config to extract the predicted label or predicted score(s).
The following examples show different parameter configurations depending on the endpoint:
Regression task: The model returns the score, e.g.
1.2
. We don’t need to specify anything. For json output, e.g.{'score': 1.2}
, we can setlabel='score'
.Binary classification:
The model returns a single probability score. We want to classify as
"yes"
predictions with a probability score over0.2
. We can setprobability_threshold=0.2
andlabel_headers="yes"
.The model returns
{"probability": 0.3}
, for which we would like to apply a threshold of0.5
to obtain a predicted label in{0, 1}
. In this case we can setlabel="probability"
.The model returns a tuple of the predicted label and the probability. In this case we can set
label = 0
.
Multiclass classification:
The model returns
{'labels': ['cat', 'dog', 'fish'], 'probabilities': [0.35, 0.25, 0.4]}
. In this case we would setprobability='probabilities'
,label='labels'
, and infer the predicted label to be'fish'
.The model returns
{'predicted_label': 'fish', 'probabilities': [0.35, 0.25, 0.4]}
. In this case we would set thelabel='predicted_label'
.The model returns
[0.35, 0.25, 0.4]
. In this case, we can setlabel_headers=['cat','dog','fish']
and infer the predicted label to be'fish'
.
- Parameters
label (str or int) – Index or JMESPath expression to locate the prediction in the model output. In case, this is a predicted label of the same type as the label in the dataset, no further arguments need to be specified.
probability (str or int) – Index or JMESPath expression to locate the predicted score(s) in the model output.
probability_threshold (float) – An optional value for binary prediction tasks in which the model returns a probability, to indicate the threshold to convert the prediction to a boolean value. Default is
0.5
.label_headers (list[str]) – List of headers, each for a predicted score in model output. For bias analysis, it is used to extract the label value with the highest score as predicted label. For explainability jobs, it is used to beautify the analysis report by replacing placeholders like
'label0'
.
- Raises
TypeError – when the
probability_threshold
cannot be cast to a float
- get_predictor_config()¶
Returns
probability_threshold
and predictor config dictionary.
- class sagemaker.clarify.ExplainabilityConfig¶
Bases:
ABC
Abstract config class to configure an explainability method.
- abstract get_explainability_config()¶
Returns config.
- class sagemaker.clarify.PDPConfig(features=None, grid_resolution=15, top_k_features=10)¶
Bases:
ExplainabilityConfig
Config class for Partial Dependence Plots (PDP).
PDPs show the marginal effect (the dependence) a subset of features has on the predicted outcome of an ML model.
When PDP is requested (by passing in a
PDPConfig
to theexplainability_config
parameter ofSageMakerClarifyProcessor
), the Partial Dependence Plots are included in the output report and the corresponding values are included in the analysis output.Initializes PDP config.
- Parameters
features (None or list) – List of feature names or indices for which partial dependence plots are computed and plotted. When
ShapConfig
is provided, this parameter is optional, as Clarify will compute the partial dependence plots for top features based on SHAP attributions. WhenShapConfig
is not provided,features
must be provided.grid_resolution (int) – When using numerical features, this integer represents the number of buckets that the range of values must be divided into. This decides the granularity of the grid in which the PDP are plotted.
top_k_features (int) – Sets the number of top SHAP attributes used to compute partial dependence plots.
- get_explainability_config()¶
Returns PDP config dictionary.
- class sagemaker.clarify.TextConfig(granularity, language)¶
Bases:
object
Config object to handle text features for text explainability
SHAP analysis breaks down longer text into chunks (e.g. tokens, sentences, or paragraphs) and replaces them with the strings specified in the baseline for that feature. The shap value of a chunk then captures how much replacing it affects the prediction.
Initializes a text configuration.
- Parameters
granularity (str) – Determines the granularity in which text features are broken down to. Accepted values are
"token"
,"sentence"
, or"paragraph"
. Computes shap values for these units.language (str) – Specifies the language of the text features. Accepted values are one of the following:
"chinese"
,"danish"
,"dutch"
,"english"
,"french"
,"german"
,"greek"
,"italian"
,"japanese"
,"lithuanian"
,"multi-language"
,"norwegian bokmål"
,"polish"
,"portuguese"
,"romanian"
,"russian"
,"spanish"
,"afrikaans"
,"albanian"
,"arabic"
,"armenian"
,"basque"
,"bengali"
,"bulgarian"
,"catalan"
,"croatian"
,"czech"
,"estonian"
,"finnish"
,"gujarati"
,"hebrew"
,"hindi"
,"hungarian"
,"icelandic"
,"indonesian"
,"irish"
,"kannada"
,"kyrgyz"
,"latvian"
,"ligurian"
,"luxembourgish"
,"macedonian"
,"malayalam"
,"marathi"
,"nepali"
,"persian"
,"sanskrit"
,"serbian"
,"setswana"
,"sinhala"
,"slovak"
,"slovenian"
,"swedish"
,"tagalog"
,"tamil"
,"tatar"
,"telugu"
,"thai"
,"turkish"
,"ukrainian"
,"urdu"
,"vietnamese"
,"yoruba"
. Use “multi-language” for a mix of multiple languages. The corresponding two-letter ISO codes are also accepted.
- Raises
ValueError – when
granularity
is not in list of supported values orlanguage
is not in list of supported values
- get_text_config()¶
Returns a text config dictionary, part of the analysis config dictionary.
- class sagemaker.clarify.ImageConfig(model_type, num_segments=None, feature_extraction_method=None, segment_compactness=None, max_objects=None, iou_threshold=None, context=None)¶
Bases:
object
Config object for handling images
Initializes a config object for Computer Vision (CV) Image explainability.
SHAP for CV explainability. generating heat maps that visualize feature attributions for input images. These heat maps highlight the image’s features according to how much they contribute to the CV model prediction.
"IMAGE_CLASSIFICATION"
and"OBJECT_DETECTION"
are the two supported CV use cases.- Parameters
model_type (str) – Specifies the type of CV model and use case. Accepted options:
"IMAGE_CLASSIFICATION"
or"OBJECT_DETECTION"
.num_segments (None or int) – Approximate number of segments to generate when running SKLearn’s SLIC method for image segmentation to generate features/superpixels. The default is None. When set to None, runs SLIC with 20 segments.
feature_extraction_method (None or str) – method used for extracting features from the image (ex: “segmentation”). Default is
"segmentation"
.segment_compactness (None or float) – Balances color proximity and space proximity. Higher values give more weight to space proximity, making superpixel shapes more square/cubic. We recommend exploring possible values on a log scale, e.g., 0.01, 0.1, 1, 10, 100, before refining around a chosen value. The default is None. When set to None, runs with the default value of
5
.max_objects (None or int) – Maximum number of objects displayed when running SHAP with an
"OBJECT_DETECTION"
model. The Object detection algorithm may detect more than themax_objects
number of objects in a single image. In that case, the algorithm displays the topmax_objects
number of objects according to confidence score. Default value is None. In the"OBJECT_DETECTION"
case, passing in None leads to a default value of3
.iou_threshold (None or float) – Minimum intersection over union for the object bounding box to consider its confidence score for computing SHAP values, in the range
[0.0, 1.0]
. Used only for the"OBJECT_DETECTION"
case, where passing in None sets the default value of0.5
.context (None or float) – The portion of the image outside the bounding box used in SHAP analysis, in the range
[0.0, 1.0]
. If set to1.0
, the whole image is considered; if set to0.0
only the image inside bounding box is considered. Only used for the"OBJECT_DETECTION"
case, when passing in None sets the default value of1.0
.
- get_image_config()¶
Returns the image config part of an analysis config dictionary.
- class sagemaker.clarify.SHAPConfig(baseline=None, num_samples=None, agg_method=None, use_logit=False, save_local_shap_values=True, seed=None, num_clusters=None, text_config=None, image_config=None, features_to_explain=None)¶
Bases:
ExplainabilityConfig
Config class for SHAP.
The SHAP algorithm calculates feature attributions by computing the contribution of each feature to the prediction outcome, using the concept of Shapley values.
These attributions can be provided for specific predictions (locally) and at a global level for the model as a whole.
Initializes config for SHAP analysis.
- Parameters
baseline (None or str or list or dict) – Baseline dataset for the Kernel SHAP algorithm, accepted in the form of: S3 object URI, a list of rows (with at least one element), or None (for no input baseline). The baseline dataset must have the same format as the input dataset specified in
DataConfig
. Each row must have only the feature columns/values and omit the label column/values. If None, a baseline will be calculated automatically on the input dataset using K-means (for numerical data) or K-prototypes (if there is categorical data).num_samples (None or int) – Number of samples to be used in the Kernel SHAP algorithm. This number determines the size of the generated synthetic dataset to compute the SHAP values. If not provided then Clarify job will choose a proper value according to the count of features.
agg_method (None or str) – Aggregation method for global SHAP values. Valid values are
"mean_abs"
(mean of absolute SHAP values for all instances),"median"
(median of SHAP values for all instances) and"mean_sq"
(mean of squared SHAP values for all instances). If None is provided, then Clarify job uses the method"mean_abs"
.use_logit (bool) – Indicates whether to apply the logit function to model predictions. Default is False. If
use_logit
is true then the SHAP values will have log-odds units.save_local_shap_values (bool) – Indicates whether to save the local SHAP values in the output location. Default is True.
seed (int) – Seed value to get deterministic SHAP values. Default is None.
num_clusters (None or int) – If a
baseline
is not provided, Clarify automatically computes a baseline dataset via a clustering algorithm (K-means/K-prototypes), which takesnum_clusters
as a parameter.num_clusters
will be the resulting size of the baseline dataset. If not provided, Clarify job uses a default value.text_config (
TextConfig
) – Config object for handling text features. Default is None.image_config (
ImageConfig
) – Config for handling image features. Default is None.features_to_explain (Optional[List[Union[str, int]]]) – A list of names or indices of dataset features to compute SHAP values for. If not provided, SHAP values are computed for all features by default. Currently only supported for tabular datasets.
- Raises
ValueError – when
agg_method
is invalid,baseline
andnum_clusters
are provided together, orfeatures_to_explain
is specified whentext_config
orimage_config
is provided
- get_explainability_config()¶
Returns a shap config dictionary.
- class sagemaker.clarify.AsymmetricShapleyValueConfig(direction='chronological', granularity='timewise', num_samples=None, baseline=None)¶
Bases:
ExplainabilityConfig
Config class for Asymmetric Shapley value algorithm for time series explainability.
Asymmetric Shapley Values are a variant of the Shapley Value that drop the symmetry axiom [1]. We use these to determine how features contribute to the forecasting outcome. Asymmetric Shapley values can take into account the temporal dependencies of the time series that forecasting models take as input.
[1] Frye, Christopher, Colin Rowat, and Ilya Feige. “Asymmetric shapley values: incorporating causal knowledge into model-agnostic explainability.” NeurIPS (2020). https://doi.org/10.48550/arXiv.1910.06358
Initialises config for time series explainability with Asymmetric Shapley Values.
AsymmetricShapleyValueConfig is used specifically and only for TimeSeries explainability purposes.
- Parameters
direction (str) – Type of explanation to be used. Available explanation types are
"chronological"
,"anti_chronological"
, and"bidirectional"
.granularity (str) – Explanation granularity to be used. Available granularity options are
"timewise"
and"fine_grained"
.num_samples (None or int) – Number of samples to be used in the Asymmetric Shapley Value forecasting algorithm. Only applicable when using
"fine_grained"
explanations.Link to a baseline configuration or a dictionary for it. The baseline config is used to replace out-of-coalition values for the corresponding datasets (also known as background data). For temporal data (target time series, related time series), the baseline value types are “zero”, where all out-of-coalition values will be replaced with 0.0, or “mean”, all out-of-coalition values will be replaced with the average of a time series. For static data (static covariates), a baseline value for each covariate should be provided for each possible item_id. An example config follows, where
item1
anditem2
are item ids:{ "target_time_series": "zero", "related_time_series": "zero", "static_covariates": { "item1": [1, 1], "item2": [0, 1] } }
- Raises
ValueError – when
direction
orgranularity
are not valid,num_samples
is not provided for fine-grained explanations,num_samples
is provided for non fine-grained explanations, or whendirection
is not"chronological"
whilegranularity
is"fine_grained"
.
- get_explainability_config()¶
Returns an asymmetric shap config dictionary.
- class sagemaker.clarify.SageMakerClarifyProcessor(role=None, instance_count=None, instance_type=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, sagemaker_session=None, env=None, tags=None, network_config=None, job_name_prefix=None, version=None, skip_early_validation=False)¶
Bases:
Processor
Handles SageMaker Processing tasks to compute bias metrics and model explanations.
Initializes a SageMakerClarifyProcessor to compute bias metrics and model explanations.
Instance of
Processor
.- Parameters
role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.
instance_count (int) – The number of instances to run a processing job with.
instance_type (str) –
The type of EC2 instance to use for model inference; for example,
"ml.c5.xlarge"
.volume_size_in_gb (int) – Size in GB of the EBS volume. to use for storing data during processing (default: 30 GB).
volume_kms_key (str) – A KMS key for the processing volume (default: None).
output_kms_key (str) – The KMS key ID for processing job outputs (default: None).
max_runtime_in_seconds (int) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status. If
max_runtime_in_seconds
is not specified, the default value is86400
seconds (24 hours).sagemaker_session (
Session
) –Session
object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the Processor creates aSession
using the default AWS configuration chain.env (dict[str, str]) – Environment variables to be passed to the processing jobs (default: None).
tags (Optional[Tags]) – Tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.
network_config (
NetworkConfig
) – ANetworkConfig
object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.job_name_prefix (str) – Processing job name prefix.
version (str) – Clarify version to use.
skip_early_validation (bool) – To skip schema validation of the generated analysis_schema.json.
- run(**_)¶
Overriding the base class method but deferring to specific run_* methods.
- run_pre_training_bias(data_config, data_bias_config, methods='all', wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)¶
Runs a
ProcessingJob
to compute pre-training bias methodsComputes the requested
methods
on the input data. Themethods
compare metrics (e.g. fraction of examples) for the sensitive group(s) vs. the other examples.- Parameters
data_config (
DataConfig
) – Config of the input/output data.data_bias_config (
BiasConfig
) – Config of sensitive groups.methods (str or list[str]) – Selects a subset of potential metrics: [“CI”, “DPL”, “KL”, “JS”, “LP”, “TVD”, “KS”, “CDDL”]. Defaults to str “all” to run all metrics if left unspecified.
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when
wait
is True (default: True).job_name (str) – Processing job name. When
job_name
is not specified, ifjob_name_prefix
inSageMakerClarifyProcessor
is specified, the job name will be thejob_name_prefix
and current timestamp; otherwise use"Clarify-Pretraining-Bias"
as prefix.kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
experiment_config (dict[str, str]) –
Experiment management configuration. Optionally, the dict can contain three keys:
'ExperimentName'
,'TrialName'
, and'TrialComponentDisplayName'
.The behavior of setting these keys is as follows:
If
'ExperimentName'
is supplied but'TrialName'
is not, a Trial will be automatically created and the job’s Trial Component associated with the Trial.If
'TrialName'
is supplied and the Trial already exists, the job’s Trial Component will be associated with the Trial.If both
'ExperimentName'
and'TrialName'
are not supplied, the Trial Component will be unassociated.'TrialComponentDisplayName'
is used for display in Amazon SageMaker Studio.
- run_post_training_bias(data_config, data_bias_config, model_config=None, model_predicted_label_config=None, methods='all', wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)¶
Runs a
ProcessingJob
to compute posttraining biasSpins up a model endpoint and runs inference over the input dataset in the
s3_data_input_path
(from theDataConfig
) to obtain predicted labels. Using model predictions, computes the requested posttraining biasmethods
that compare metrics (e.g. accuracy, precision, recall) for the sensitive group(s) versus the other examples.- Parameters
data_config (
DataConfig
) – Config of the input/output data.data_bias_config (
BiasConfig
) – Config of sensitive groups.model_config (
ModelConfig
) – Config of the model and its endpoint to be created. This is required unless``predicted_label_dataset_uri`` orpredicted_label
is provided indata_config
.model_predicted_label_config (
ModelPredictedLabelConfig
) – Config of how to extract the predicted label from the model output.methods (str or list[str]) – Selector of a subset of potential metrics: [“DPPL” , “DI”, “DCA”, “DCR”, “RD”, “DAR”, “DRR”, “AD”, “CDDPL “, “TE”, “FT”]. Defaults to str “all” to run all metrics if left unspecified.
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when
wait
is True (default: True).job_name (str) – Processing job name. When
job_name
is not specified, ifjob_name_prefix
inSageMakerClarifyProcessor
is specified, the job name will be thejob_name_prefix
and current timestamp; otherwise use"Clarify-Posttraining-Bias"
as prefix.kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
experiment_config (dict[str, str]) –
Experiment management configuration. Optionally, the dict can contain three keys:
'ExperimentName'
,'TrialName'
, and'TrialComponentDisplayName'
.The behavior of setting these keys is as follows:
If
'ExperimentName'
is supplied but'TrialName'
is not, a Trial will be automatically created and the job’s Trial Component associated with the Trial.If
'TrialName'
is supplied and the Trial already exists, the job’s Trial Component will be associated with the Trial.If both
'ExperimentName'
and'TrialName'
are not supplied, the Trial Component will be unassociated.'TrialComponentDisplayName'
is used for display in Amazon SageMaker Studio.
- run_bias(data_config, bias_config, model_config=None, model_predicted_label_config=None, pre_training_methods='all', post_training_methods='all', wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)¶
Runs a
ProcessingJob
to compute the requested bias methodsComputes metrics for both the pre-training and the post-training methods. To calculate post-training methods, it spins up a model endpoint and runs inference over the input examples in ‘s3_data_input_path’ (from the
DataConfig
) to obtain predicted labels.- Parameters
data_config (
DataConfig
) – Config of the input/output data.bias_config (
BiasConfig
) – Config of sensitive groups.model_config (
ModelConfig
) – Config of the model and its endpoint to be created. This is required unless``predicted_label_dataset_uri`` orpredicted_label
is provided indata_config
.model_predicted_label_config (
ModelPredictedLabelConfig
) – Config of how to extract the predicted label from the model output.pre_training_methods (str or list[str]) –
Selector of a subset of potential metrics: [“CI”, “DPL”, “KL”, “JS”, “LP”, “TVD”, “KS”, “CDDL”]. Defaults to str “all” to run all metrics if left unspecified.
post_training_methods (str or list[str]) –
Selector of a subset of potential metrics: [“DPPL” , “DI”, “DCA”, “DCR”, “RD”, “DAR”, “DRR”, “AD”, “CDDPL “, “TE”, “FT”]. Defaults to str “all” to run all metrics if left unspecified.
wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when
wait
is True (default: True).job_name (str) – Processing job name. When
job_name
is not specified, ifjob_name_prefix
inSageMakerClarifyProcessor
is specified, the job name will bejob_name_prefix
and the current timestamp; otherwise use"Clarify-Bias"
as prefix.kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
experiment_config (dict[str, str]) –
Experiment management configuration. Optionally, the dict can contain three keys:
'ExperimentName'
,'TrialName'
, and'TrialComponentDisplayName'
.The behavior of setting these keys is as follows:
If
'ExperimentName'
is supplied but'TrialName'
is not, a Trial will be automatically created and the job’s Trial Component associated with the Trial.If
'TrialName'
is supplied and the Trial already exists, the job’s Trial Component will be associated with the Trial.If both
'ExperimentName'
and'TrialName'
are not supplied, the Trial Component will be unassociated.'TrialComponentDisplayName'
is used for display in Amazon SageMaker Studio.
- run_explainability(data_config, model_config, explainability_config, model_scores=None, wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)¶
Runs a
ProcessingJob
computing feature attributions.Spins up a model endpoint.
Currently, only SHAP and Partial Dependence Plots (PDP) are supported as explainability methods. You can request both methods or one at a time with the
explainability_config
parameter.When SHAP is requested in the
explainability_config
, the SHAP algorithm calculates the feature importance for each input example in thes3_data_input_path
of theDataConfig
, by creatingnum_samples
copies of the example with a subset of features replaced with values from thebaseline
. It then runs model inference to see how the model’s prediction changes with the replaced features. If the model output returns multiple scores importance is computed for each score. Across examples, feature importance is aggregated usingagg_method
.When PDP is requested in the
explainability_config
, the PDP algorithm calculates the dependence of the target response on the input features and marginalizes over the values of all other input features. The Partial Dependence Plots are included in the output report and the corresponding values are included in the analysis output.- Parameters
data_config (
DataConfig
) – Config of the input/output data.model_config (
ModelConfig
) – Config of the model and its endpoint to be created.explainability_config (
ExplainabilityConfig
or list) – Config of the specific explainability method or a list ofExplainabilityConfig
objects. Currently, SHAP and PDP are the two methods supported. You can request multiple methods at once by passing in a list of ~sagemaker.clarify.ExplainabilityConfig.model_scores (int or str or
ModelPredictedLabelConfig
) – Index or JMESPath expression to locate the predicted scores in the model output. This is not required if the model output is a single score. Alternatively, it can be an instance ofSageMakerClarifyProcessor
to provide more parameters likelabel_headers
.wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when
wait
is True (default: True).job_name (str) – Processing job name. When
job_name
is not specified, ifjob_name_prefix
inSageMakerClarifyProcessor
is specified, the job name will be composed ofjob_name_prefix
and current timestamp; otherwise use"Clarify-Explainability"
as prefix.kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
experiment_config (dict[str, str]) –
Experiment management configuration. Optionally, the dict can contain three keys:
'ExperimentName'
,'TrialName'
, and'TrialComponentDisplayName'
.The behavior of setting these keys is as follows:
If
'ExperimentName'
is supplied but'TrialName'
is not, a Trial will be automatically created and the job’s Trial Component associated with the Trial.If
'TrialName'
is supplied and the Trial already exists, the job’s Trial Component will be associated with the Trial.If both
'ExperimentName'
and'TrialName'
are not supplied, the Trial Component will be unassociated.'TrialComponentDisplayName'
is used for display in Amazon SageMaker Studio.
- run_bias_and_explainability(data_config, model_config, explainability_config, bias_config, pre_training_methods='all', post_training_methods='all', model_predicted_label_config=None, wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)¶
Runs a
ProcessingJob
computing feature attributions.For bias: Computes metrics for both the pre-training and the post-training methods. To calculate post-training methods, it spins up a model endpoint and runs inference over the input examples in ‘s3_data_input_path’ (from the
DataConfig
) to obtain predicted labels.For Explainability: Spins up a model endpoint.
Currently, only SHAP and Partial Dependence Plots (PDP) are supported as explainability methods. You can request both methods or one at a time with the
explainability_config
parameter.When SHAP is requested in the
explainability_config
, the SHAP algorithm calculates the feature importance for each input example in thes3_data_input_path
of theDataConfig
, by creatingnum_samples
copies of the example with a subset of features replaced with values from thebaseline
. It then runs model inference to see how the model’s prediction changes with the replaced features. If the model output returns multiple scores importance is computed for each score. Across examples, feature importance is aggregated usingagg_method
.When PDP is requested in the
explainability_config
, the PDP algorithm calculates the dependence of the target response on the input features and marginalizes over the values of all other input features. The Partial Dependence Plots are included in the output report and the corresponding values are included in the analysis output.- Parameters
data_config (
DataConfig
) – Config of the input/output data.model_config (
ModelConfig
) – Config of the model and its endpoint to be created.explainability_config (
ExplainabilityConfig
or list) – Config of the specific explainability method or a list ofExplainabilityConfig
objects. Currently, SHAP and PDP are the two methods supported. You can request multiple methods at once by passing in a list of ~sagemaker.clarify.ExplainabilityConfig.bias_config (
BiasConfig
) – Config of sensitive groups.pre_training_methods (str or list[str]) –
Selector of a subset of potential metrics: [“CI”, “DPL”, “KL”, “JS”, “LP”, “TVD”, “KS”, “CDDL”]. Defaults to str “all” to run all metrics if left unspecified.
post_training_methods (str or list[str]) –
Selector of a subset of potential metrics: [“DPPL” , “DI”, “DCA”, “DCR”, “RD”, “DAR”, “DRR”, “AD”, “CDDPL “, “TE”, “FT”]. Defaults to str “all” to run all metrics if left unspecified.
( (model_predicted_label_config) – int or str or
ModelPredictedLabelConfig
) – Index or JMESPath expression to locate the predicted scores in the model output. This is not required if the model output is a single score. Alternatively, it can be an instance of
SageMakerClarifyProcessor
to provide more parameters likelabel_headers
.wait (bool) – Whether the call should wait until the job completes (default: True).
logs (bool) – Whether to show the logs produced by the job. Only meaningful when
wait
is True (default: True).job_name (str) – Processing job name. When
job_name
is not specified, ifjob_name_prefix
inSageMakerClarifyProcessor
is specified, the job name will be composed ofjob_name_prefix
and current timestamp; otherwise use"Clarify-Explainability"
as prefix.kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).
experiment_config (dict[str, str]) –
Experiment management configuration. Optionally, the dict can contain three keys:
'ExperimentName'
,'TrialName'
, and'TrialComponentDisplayName'
.The behavior of setting these keys is as follows:
If
'ExperimentName'
is supplied but'TrialName'
is not, a Trial will be automatically created and the job’s Trial Component associated with the Trial.If
'TrialName'
is supplied and the Trial already exists, the job’s Trial Component will be associated with the Trial.If both
'ExperimentName'
and'TrialName'
are not supplied, the Trial Component will be unassociated.'TrialComponentDisplayName'
is used for display in Amazon SageMaker Studio.
model_predicted_label_config (Optional[ModelPredictedLabelConfig]) –