Processing

This module contains code related to the Processor class.

which is used for Amazon SageMaker Processing Jobs. These jobs let users perform data pre-processing, post-processing, feature engineering, data validation, and model evaluation, and interpretation on Amazon SageMaker.

class sagemaker.processing.Processor(role=None, image_uri=None, instance_count=None, instance_type=None, entrypoint=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)

Bases: object

Handles Amazon SageMaker Processing tasks.

Initializes a Processor instance.

The Processor handles Amazon SageMaker Processing tasks.

Parameters
  • role (str or PipelineVariable) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.

  • image_uri (str or PipelineVariable) – The URI of the Docker image to use for the processing jobs.

  • instance_count (int or PipelineVariable) – The number of instances to run a processing job with.

  • instance_type (str or PipelineVariable) – The type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.

  • entrypoint (list[str] or list[PipelineVariable]) – The entrypoint for the processing job (default: None). This is in the form of a list of strings that make a command.

  • volume_size_in_gb (int or PipelineVariable) – Size in GB of the EBS volume to use for storing data during processing (default: 30).

  • volume_kms_key (str or PipelineVariable) – A KMS key for the processing volume (default: None).

  • output_kms_key (str or PipelineVariable) – The KMS key ID for processing job outputs (default: None).

  • max_runtime_in_seconds (int or PipelineVariable) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status. If max_runtime_in_seconds is not specified, the default value is 24 hours.

  • base_job_name (str) – Prefix for processing job name. If not specified, the processor generates a default job name, based on the processing image name and current timestamp.

  • sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.

  • env (dict[str, str] or dict[str, PipelineVariable]) – Environment variables to be passed to the processing jobs (default: None).

  • tags (Optional[Tags]) – Tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.

  • network_config (NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.

JOB_CLASS_NAME = 'processing-job'
run(inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, kms_key=None)

Runs a processing job.

Parameters
  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).

  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).

  • arguments (list[str] or list[PipelineVariable]) – A list of string arguments to be passed to a processing job (default: None).

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).

  • job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.

  • experiment_config (dict[str, str]) – Experiment management configuration. Optionally, the dict can contain three keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’. The behavior of setting these keys is as follows: * If ExperimentName is supplied but TrialName is not a Trial will be automatically created and the job’s Trial Component associated with the Trial. * If TrialName is supplied and the Trial already exists the job’s Trial Component will be associated with the Trial. * If both ExperimentName and TrialName are not supplied the trial component will be unassociated. * TrialComponentDisplayName is used for display in Studio. * Both ExperimentName and TrialName will be ignored if the Processor instance is built with PipelineSession. However, the value of TrialComponentDisplayName is honored for display in Studio.

  • kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

Returns

None or pipeline step arguments in case the Processor instance is built with PipelineSession

Raises

ValueError – if logs is True but wait is False.

class sagemaker.processing.ScriptProcessor(role=None, image_uri=None, command=None, instance_count=None, instance_type=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)

Bases: Processor

Handles Amazon SageMaker processing tasks for jobs using a machine learning framework.

Initializes a ScriptProcessor instance.

The ScriptProcessor handles Amazon SageMaker Processing tasks for jobs using a machine learning framework, which allows for providing a script to be run as part of the Processing Job.

Parameters
  • role (str or PipelineVariable) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.

  • image_uri (str or PipelineVariable) – The URI of the Docker image to use for the processing jobs.

  • command ([str]) – The command to run, along with any command-line flags. Example: [“python3”, “-v”].

  • instance_count (int or PipelineVariable) – The number of instances to run a processing job with.

  • instance_type (str or PipelineVariable) – The type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.

  • volume_size_in_gb (int or PipelineVariable) – Size in GB of the EBS volume to use for storing data during processing (default: 30).

  • volume_kms_key (str or PipelineVariable) – A KMS key for the processing volume (default: None).

  • output_kms_key (str or PipelineVariable) – The KMS key ID for processing job outputs (default: None).

  • max_runtime_in_seconds (int or PipelineVariable) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status. If max_runtime_in_seconds is not specified, the default value is 24 hours.

  • base_job_name (str) – Prefix for processing name. If not specified, the processor generates a default job name, based on the processing image name and current timestamp.

  • sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.

  • env (dict[str, str] or dict[str, PipelineVariable])) – Environment variables to be passed to the processing jobs (default: None).

  • tags (Optional[Tags]) – Tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.

  • network_config (NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.

get_run_args(code, inputs=None, outputs=None, arguments=None)

Returns a RunArgs object.

For processors (PySparkProcessor, SparkJar) that have special run() arguments, this object contains the normalized arguments for passing to ProcessingStep.

Parameters
  • code (str) – This can be an S3 URI or a local path to a file with the framework script to run.

  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).

  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).

  • arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).

run(code, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, kms_key=None)

Runs a processing job.

Parameters
  • code (str) – This can be an S3 URI or a local path to a file with the framework script to run.

  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).

  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).

  • arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).

  • job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.

  • experiment_config (dict[str, str]) – Experiment management configuration. Optionally, the dict can contain three keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’. The behavior of setting these keys is as follows: * If ExperimentName is supplied but TrialName is not a Trial will be automatically created and the job’s Trial Component associated with the Trial. * If TrialName is supplied and the Trial already exists the job’s Trial Component will be associated with the Trial. * If both ExperimentName and TrialName are not supplied the trial component will be unassociated. * TrialComponentDisplayName is used for display in Studio. * Both ExperimentName and TrialName will be ignored if the Processor instance is built with PipelineSession. However, the value of TrialComponentDisplayName is honored for display in Studio.

  • kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

Returns

None or pipeline step arguments in case the Processor instance is built with PipelineSession

class sagemaker.processing.ProcessingJob(sagemaker_session, job_name, inputs, outputs, output_kms_key=None)

Bases: _Job

Provides functionality to start, describe, and stop processing jobs.

Initializes a Processing job.

Parameters
  • sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.

  • job_name (str) – Name of the Processing job.

  • inputs (list[ProcessingInput]) – A list of ProcessingInput objects.

  • outputs (list[ProcessingOutput]) – A list of ProcessingOutput objects.

  • output_kms_key (str) – The output KMS key associated with the job (default: None).

classmethod start_new(processor, inputs, outputs, experiment_config)

Starts a new processing job using the provided inputs and outputs.

Parameters
  • processor (Processor) – The Processor instance that started the job.

  • inputs (list[ProcessingInput]) – A list of ProcessingInput objects.

  • outputs (list[ProcessingOutput]) – A list of ProcessingOutput objects.

  • experiment_config (dict[str, str]) – Experiment management configuration. Optionally, the dict can contain three keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’. The behavior of setting these keys is as follows: * If ExperimentName is supplied but TrialName is not a Trial will be automatically created and the job’s Trial Component associated with the Trial. * If TrialName is supplied and the Trial already exists the job’s Trial Component will be associated with the Trial. * If both ExperimentName and TrialName are not supplied the trial component will be unassociated. * TrialComponentDisplayName is used for display in Studio.

Returns

The instance of ProcessingJob created

using the Processor.

Return type

ProcessingJob

classmethod from_processing_name(sagemaker_session, processing_job_name)

Initializes a ProcessingJob from a processing job name.

Parameters
  • processing_job_name (str) – Name of the processing job.

  • sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.

Returns

The instance of ProcessingJob created

from the job name.

Return type

ProcessingJob

classmethod from_processing_arn(sagemaker_session, processing_job_arn)

Initializes a ProcessingJob from a Processing ARN.

Parameters
  • processing_job_arn (str) – ARN of the processing job.

  • sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.

Returns

The instance of ProcessingJob created

from the processing job’s ARN.

Return type

ProcessingJob

wait(logs=True)

Waits for the processing job to complete.

Parameters

logs (bool) – Whether to show the logs produced by the job (default: True).

describe()

Prints out a response from the DescribeProcessingJob API call.

stop()

Stops the processing job.

static prepare_app_specification(container_arguments, container_entrypoint, image_uri)

Prepares a dict that represents a ProcessingJob’s AppSpecification.

Parameters
  • container_arguments (list[str]) – The arguments for a container used to run a processing job.

  • container_entrypoint (list[str]) – The entrypoint for a container used to run a processing job.

  • image_uri (str) – The container image to be run by the processing job.

Returns

Represents AppSpecification which configures the processing job to run a specified Docker container image.

Return type

dict

static prepare_output_config(kms_key_id, outputs)

Prepares a dict that represents a ProcessingOutputConfig.

Parameters
  • kms_key_id (str) – The AWS Key Management Service (AWS KMS) key that Amazon SageMaker uses to encrypt the processing job output. KmsKeyId can be an ID of a KMS key, ARN of a KMS key, alias of a KMS key, or alias of a KMS key. The KmsKeyId is applied to all outputs.

  • outputs (list[dict]) – Output configuration information for a processing job.

Returns

Represents output configuration for the processing job.

Return type

dict

static prepare_processing_resources(instance_count, instance_type, volume_kms_key_id, volume_size_in_gb)

Prepares a dict that represents the ProcessingResources.

Parameters
  • instance_count (int) – The number of ML compute instances to use in the processing job. For distributed processing jobs, specify a value greater than 1. The default value is 1.

  • instance_type (str) – The ML compute instance type for the processing job.

  • volume_kms_key_id (str) – The AWS Key Management Service (AWS KMS) key that Amazon SageMaker uses to encrypt data on the storage volume attached to the ML compute instance(s) that run the processing job.

  • volume_size_in_gb (int) – The size of the ML storage volume in gigabytes that you want to provision. You must specify sufficient ML storage for your scenario.

Returns

Represents ProcessingResources which identifies the resources,

ML compute instances, and ML storage volumes to deploy for a processing job.

Return type

dict

static prepare_stopping_condition(max_runtime_in_seconds)

Prepares a dict that represents the job’s StoppingCondition.

Parameters

max_runtime_in_seconds (int) – Specifies the maximum runtime in seconds.

Returns

dict

class sagemaker.processing.ProcessingInput(source=None, destination=None, input_name=None, s3_data_type='S3Prefix', s3_input_mode='File', s3_data_distribution_type='FullyReplicated', s3_compression_type='None', s3_input=None, dataset_definition=None, app_managed=False)

Bases: object

Accepts parameters that specify an Amazon S3 input for a processing job.

Also provides a method to turn those parameters into a dictionary.

Initializes a ProcessingInput instance.

ProcessingInput accepts parameters that specify an Amazon S3 input for a processing job and provides a method to turn those parameters into a dictionary.

Parameters
  • source (str or PipelineVariable) – The source for the input. If a local path is provided, it will automatically be uploaded to S3 under: “s3://<default-bucket-name>/<job-name>/input/<input-name>”.

  • destination (str or PipelineVariable) – The destination of the input.

  • input_name (str or PipelineVariable) – The name for the input. If a name is not provided, one will be generated (eg. “input-1”).

  • s3_data_type (str or PipelineVariable) – Valid options are “ManifestFile” or “S3Prefix”.

  • s3_input_mode (str or PipelineVariable) – Valid options are “Pipe”, “File” or “FastFile”.

  • s3_data_distribution_type (str or PipelineVariable) – Valid options are “FullyReplicated” or “ShardedByS3Key”.

  • s3_compression_type (str or PipelineVariable) – Valid options are “None” or “Gzip”.

  • s3_input (S3Input) – Metadata of data objects stored in S3

  • dataset_definition (DatasetDefinition) – DatasetDefinition input

  • app_managed (bool or PipelineVariable) – Whether the input are managed by SageMaker or application

class sagemaker.processing.ProcessingOutput(source=None, destination=None, output_name=None, s3_upload_mode='EndOfJob', app_managed=False, feature_store_output=None)

Bases: object

Accepts parameters that specify an Amazon S3 output for a processing job.

It also provides a method to turn those parameters into a dictionary.

Initializes a ProcessingOutput instance.

ProcessingOutput accepts parameters that specify an Amazon S3 output for a processing job and provides a method to turn those parameters into a dictionary.

Parameters
  • source (str or PipelineVariable) – The source for the output.

  • destination (str or PipelineVariable) – The destination of the output. If a destination is not provided, one will be generated: “s3://<default-bucket-name>/<job-name>/output/<output-name>” (Note: this does not apply when used with ProcessingStep).

  • output_name (str or PipelineVariable) – The name of the output. If a name is not provided, one will be generated (eg. “output-1”).

  • s3_upload_mode (str or PipelineVariable) – Valid options are “EndOfJob” or “Continuous”.

  • app_managed (bool or PipelineVariable) – Whether the input are managed by SageMaker or application

  • feature_store_output (FeatureStoreOutput) – Configuration for processing job outputs of FeatureStore.

class sagemaker.processing.RunArgs(code, inputs=None, outputs=None, arguments=None)

Bases: object

Accepts parameters that correspond to ScriptProcessors.

An instance of this class is returned from the get_run_args() method on processors, and is used for normalizing the arguments so that they can be passed to ProcessingStep

Parameters
  • code (str) – This can be an S3 URI or a local path to a file with the framework script to run.

  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).

  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).

  • arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).

Method generated by attrs for class RunArgs.

class sagemaker.processing.FeatureStoreOutput(**kwargs)

Bases: ApiObject

Configuration for processing job outputs in Amazon SageMaker Feature Store.

Init ApiObject.

feature_group_name = None
class sagemaker.processing.FrameworkProcessor(estimator_cls, framework_version, role=None, instance_count=None, instance_type=None, py_version='py3', image_uri=None, command=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, code_location=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)

Bases: ScriptProcessor

Handles Amazon SageMaker processing tasks for jobs using a machine learning framework.

Initializes a FrameworkProcessor instance.

The FrameworkProcessor handles Amazon SageMaker Processing tasks for jobs using a machine learning framework, which allows for a set of Python scripts to be run as part of the Processing Job.

Parameters
  • estimator_cls (type) – A subclass of the Framework estimator

  • framework_version (str) – The version of the framework. Value is ignored when image_uri is provided.

  • role (str or PipelineVariable) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.

  • instance_count (int or PipelineVariable) – The number of instances to run a processing job with.

  • instance_type (str or PipelineVariable) – The type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.

  • py_version (str) – Python version you want to use for executing your model training code. One of ‘py2’ or ‘py3’. Defaults to ‘py3’. Value is ignored when image_uri is provided.

  • image_uri (str or PipelineVariable) – The URI of the Docker image to use for the processing jobs (default: None).

  • command ([str]) – The command to run, along with any command-line flags to precede the `code script`. Example: [“python3”, “-v”]. If not provided, [“python”] will be chosen (default: None).

  • volume_size_in_gb (int or PipelineVariable) – Size in GB of the EBS volume to use for storing data during processing (default: 30).

  • volume_kms_key (str or PipelineVariable) – A KMS key for the processing volume (default: None).

  • output_kms_key (str or PipelineVariable) – The KMS key ID for processing job outputs (default: None).

  • code_location (str) – The S3 prefix URI where custom code will be uploaded (default: None). The code file uploaded to S3 is ‘code_location/job-name/source/sourcedir.tar.gz’. If not specified, the default code location is ‘s3://{sagemaker-default-bucket}’

  • max_runtime_in_seconds (int or PipelineVariable) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status. If max_runtime_in_seconds is not specified, the default value is 24 hours.

  • base_job_name (str) – Prefix for processing name. If not specified, the processor generates a default job name, based on the processing image name and current timestamp (default: None).

  • sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain (default: None).

  • env (dict[str, str] or dict[str, PipelineVariable]) – Environment variables to be passed to the processing jobs (default: None).

  • tags (Optional[Tags]) – Tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.

  • network_config (NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets (default: None).

framework_entrypoint_command = ['/bin/bash']
get_run_args(code, source_dir=None, dependencies=None, git_config=None, inputs=None, outputs=None, arguments=None, job_name=None)

Returns a RunArgs object.

This object contains the normalized inputs, outputs and arguments needed when using a FrameworkProcessor in a ProcessingStep.

Parameters
  • code (str) – This can be an S3 URI or a local path to a file with the framework script to run. See the code argument in sagemaker.processing.FrameworkProcessor.run().

  • source_dir (str) – Path (absolute, relative, or an S3 URI) to a directory wit any other processing source code dependencies aside from the entrypoint file (default: None). See the source_dir argument in sagemaker.processing.FrameworkProcessor.run()

  • dependencies (list[str]) – A list of paths to directories (absolute or relative) with any additional libraries that will be exported to the container (default: []). See the dependencies argument in sagemaker.processing.FrameworkProcessor.run().

  • git_config (dict[str, str]) – Git configurations used for cloning files. See the git_config argument in sagemaker.processing.FrameworkProcessor.run().

  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).

  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).

  • arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).

  • job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.

run(code, source_dir=None, dependencies=None, git_config=None, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, kms_key=None, codeartifact_repo_arn=None)

Runs a processing job.

Parameters
  • code (str) – This can be an S3 URI or a local path to a file with the framework script to run.Path (absolute or relative) to the local Python source file which should be executed as the entry point to training. When code is an S3 URI, ignore source_dir, dependencies, and git_config. If source_dir is specified, then code must point to a file located at the root of source_dir.

  • source_dir (str) – Path (absolute, relative or an S3 URI) to a directory with any other processing source code dependencies aside from the entry point file (default: None). If source_dir is an S3 URI, it must point to a file named sourcedir.tar.gz. Structure within this directory are preserved when processing on Amazon SageMaker (default: None).

  • dependencies (list[str]) – A list of paths to directories (absolute or relative) with any additional libraries that will be exported to the container (default: []). The library folders will be copied to SageMaker in the same folder where the entrypoint is copied. If ‘git_config’ is provided, ‘dependencies’ should be a list of relative locations to directories with any additional libraries needed in the Git repo (default: None).

  • git_config (dict[str, str]) –

    Git configurations used for cloning files, including repo, branch, commit, 2FA_enabled, username, password and token. The repo field is required. All other fields are optional. repo specifies the Git repository where your training script is stored. If you don’t provide branch, the default value ‘master’ is used. If you don’t provide commit, the latest commit in the specified branch is used. .. admonition:: Example

    The following config:

    >>> git_config = {'repo': 'https://github.com/aws/sagemaker-python-sdk.git',
    >>>               'branch': 'test-branch-git-config',
    >>>               'commit': '329bfcf884482002c05ff7f44f62599ebc9f445a'}
    

    results in cloning the repo specified in ‘repo’, then checkout the ‘master’ branch, and checkout the specified commit.

    2FA_enabled, username, password and token are used for authentication. For GitHub (or other Git) accounts, set 2FA_enabled to ‘True’ if two-factor authentication is enabled for the account, otherwise set it to ‘False’. If you do not provide a value for 2FA_enabled, a default value of ‘False’ is used. CodeCommit does not support two-factor authentication, so do not provide “2FA_enabled” with CodeCommit repositories.

    For GitHub and other Git repos, when SSH URLs are provided, it doesn’t matter whether 2FA is enabled or disabled; you should either have no passphrase for the SSH key pairs, or have the ssh-agent configured so that you will not be prompted for SSH passphrase when you do ‘git clone’ command with SSH URLs. When HTTPS URLs are provided: if 2FA is disabled, then either token or username+password will be used for authentication if provided (token prioritized); if 2FA is enabled, only token will be used for authentication if provided. If required authentication info is not provided, python SDK will try to use local credentials storage to authenticate. If that fails either, an error message will be thrown.

    For CodeCommit repos, 2FA is not supported, so ‘2FA_enabled’ should not be provided. There is no token in CodeCommit, so ‘token’ should not be provided too. When ‘repo’ is an SSH URL, the requirements are the same as GitHub-like repos. When ‘repo’ is an HTTPS URL, username+password will be used for authentication if they are provided; otherwise, python SDK will try to use either CodeCommit credential helper or local credential storage for authentication.

  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).

  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).

  • arguments (list[str] or list[PipelineVariable]) – A list of string arguments to be passed to a processing job (default: None).

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).

  • job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.

  • experiment_config (dict[str, str]) – Experiment management configuration. Optionally, the dict can contain three keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’. The behavior of setting these keys is as follows: * If ExperimentName is supplied but TrialName is not a Trial will be automatically created and the job’s Trial Component associated with the Trial. * If TrialName is supplied and the Trial already exists the job’s Trial Component will be associated with the Trial. * If both ExperimentName and TrialName are not supplied the trial component will be unassociated. * TrialComponentDisplayName is used for display in Studio. * Both ExperimentName and TrialName will be ignored if the Processor instance is built with PipelineSession. However, the value of TrialComponentDisplayName is honored for display in Studio.

  • kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

  • codeartifact_repo_arn (str) – The ARN of the CodeArtifact repository that should be logged into before installing dependencies (default: None).

Returns

None or pipeline step arguments in case the Processor instance is built with PipelineSession

This module is the entry to run spark processing script.

This module contains code related to Spark Processors, which are used for Processing jobs. These jobs let customers perform data pre-processing, post-processing, feature engineering, data validation, and model evaluation on SageMaker using Spark and PySpark.

class sagemaker.spark.processing.PySparkProcessor(role=None, instance_type=None, instance_count=None, framework_version=None, py_version=None, container_version=None, image_uri=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, configuration_location=None, dependency_location=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)

Bases: _SparkProcessorBase

Handles Amazon SageMaker processing tasks for jobs using PySpark.

Initialize an PySparkProcessor instance.

The PySparkProcessor handles Amazon SageMaker processing tasks for jobs using SageMaker PySpark.

Parameters
  • framework_version (str) – The version of SageMaker PySpark.

  • py_version (str) – The version of python.

  • container_version (str) – The version of spark container.

  • role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3 (default: None). If not specified, the value from the defaults configuration file will be used.

  • instance_type (str or PipelineVariable) – Type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.

  • instance_count (int or PipelineVariable) – The number of instances to run the Processing job with. Defaults to 1.

  • volume_size_in_gb (int or PipelineVariable) – Size in GB of the EBS volume to use for storing data during processing (default: 30).

  • volume_kms_key (str or PipelineVariable) – A KMS key for the processing volume.

  • output_kms_key (str or PipelineVariable) – The KMS key id for all ProcessingOutputs.

  • configuration_location (str) – The S3 prefix URI where the user-provided EMR application configuration will be uploaded (default: None). If not specified, the default configuration location is ‘s3://{sagemaker-default-bucket}’.

  • dependency_location (str) – The S3 prefix URI where Spark dependencies will be uploaded (default: None). If not specified, the default dependency location is ‘s3://{sagemaker-default-bucket}’.

  • max_runtime_in_seconds (int or PipelineVariable) – Timeout in seconds. After this amount of time Amazon SageMaker terminates the job regardless of its current status.

  • base_job_name (str) – Prefix for processing name. If not specified, the processor generates a default job name, based on the training image name and current timestamp.

  • sagemaker_session (sagemaker.session.Session) – Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.

  • env (dict[str, str] or dict[str, PipelineVariable]) – Environment variables to be passed to the processing job.

  • tags (Optional[Tags]) – List of tags to be passed to the processing job.

  • network_config (sagemaker.network.NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.

  • image_uri (Optional[Union[str, PipelineVariable]]) –

get_run_args(submit_app, submit_py_files=None, submit_jars=None, submit_files=None, inputs=None, outputs=None, arguments=None, job_name=None, configuration=None, spark_event_logs_s3_uri=None)

Returns a RunArgs object.

This object contains the normalized inputs, outputs and arguments needed when using a PySparkProcessor in a ProcessingStep.

Parameters
  • submit_app (str) – Path (local or S3) to Python file to submit to Spark as the primary application. This is translated to the code property on the returned RunArgs object.

  • submit_py_files (list[str]) – List of paths (local or S3) to provide for spark-submit –py-files option

  • submit_jars (list[str]) – List of paths (local or S3) to provide for spark-submit –jars option

  • submit_files (list[str]) – List of paths (local or S3) to provide for spark-submit –files option

  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).

  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).

  • arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).

  • job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.

  • configuration (list[dict] or dict) – Configuration for Hadoop, Spark, or Hive. List or dictionary of EMR-style classifications. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html

  • spark_event_logs_s3_uri (str) – S3 path where spark application events will be published to.

run(submit_app, submit_py_files=None, submit_jars=None, submit_files=None, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, configuration=None, spark_event_logs_s3_uri=None, kms_key=None)

Runs a processing job.

Parameters
  • submit_app (str) – Path (local or S3) to Python file to submit to Spark as the primary application

  • submit_py_files (list[str] or list[PipelineVariable]) – List of paths (local or S3) to provide for spark-submit –py-files option

  • submit_jars (list[str] or list[PipelineVariable]) – List of paths (local or S3) to provide for spark-submit –jars option

  • submit_files (list[str] or list[PipelineVariable]) – List of paths (local or S3) to provide for spark-submit –files option

  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).

  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).

  • arguments (list[str] or list[PipelineVariable]) – A list of string arguments to be passed to a processing job (default: None).

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).

  • job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.

  • experiment_config (dict[str, str]) – Experiment management configuration. Optionally, the dict can contain three keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’. The behavior of setting these keys is as follows: * If ExperimentName is supplied but TrialName is not a Trial will be automatically created and the job’s Trial Component associated with the Trial. * If TrialName is supplied and the Trial already exists the job’s Trial Component will be associated with the Trial. * If both ExperimentName and TrialName are not supplied the trial component will be unassociated. * TrialComponentDisplayName is used for display in Studio.

  • configuration (list[dict] or dict) – Configuration for Hadoop, Spark, or Hive. List or dictionary of EMR-style classifications. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html

  • spark_event_logs_s3_uri (str or PipelineVariable) – S3 path where spark application events will be published to.

  • kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

class sagemaker.spark.processing.SparkJarProcessor(role=None, instance_type=None, instance_count=None, framework_version=None, py_version=None, container_version=None, image_uri=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, configuration_location=None, dependency_location=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)

Bases: _SparkProcessorBase

Handles Amazon SageMaker processing tasks for jobs using Spark with Java or Scala Jars.

Initialize a SparkJarProcessor instance.

The SparkProcessor handles Amazon SageMaker processing tasks for jobs using SageMaker Spark.

Parameters
  • framework_version (str) – The version of SageMaker PySpark.

  • py_version (str) – The version of python.

  • container_version (str) – The version of spark container.

  • role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3 (default: None). If not specified, the value from the defaults configuration file will be used.

  • instance_type (str or PipelineVariable) – Type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.

  • instance_count (int or PipelineVariable) – The number of instances to run the Processing job with. Defaults to 1.

  • volume_size_in_gb (int or PipelineVariable) – Size in GB of the EBS volume to use for storing data during processing (default: 30).

  • volume_kms_key (str or PipelineVariable) – A KMS key for the processing volume.

  • output_kms_key (str or PipelineVariable) – The KMS key id for all ProcessingOutputs.

  • configuration_location (str) – The S3 prefix URI where the user-provided EMR application configuration will be uploaded (default: None). If not specified, the default configuration location is ‘s3://{sagemaker-default-bucket}’.

  • dependency_location (str) – The S3 prefix URI where Spark dependencies will be uploaded (default: None). If not specified, the default dependency location is ‘s3://{sagemaker-default-bucket}’.

  • max_runtime_in_seconds (int or PipelineVariable) – Timeout in seconds. After this amount of time Amazon SageMaker terminates the job regardless of its current status.

  • base_job_name (str) – Prefix for processing name. If not specified, the processor generates a default job name, based on the training image name and current timestamp.

  • sagemaker_session (sagemaker.session.Session) – Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.

  • env (dict[str, str] or dict[str, PipelineVariable]) – Environment variables to be passed to the processing job.

  • tags (Optional[Tags]) – Tags to be passed to the processing job.

  • network_config (sagemaker.network.NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.

  • image_uri (Optional[Union[str, PipelineVariable]]) –

get_run_args(submit_app, submit_class=None, submit_jars=None, submit_files=None, inputs=None, outputs=None, arguments=None, job_name=None, configuration=None, spark_event_logs_s3_uri=None)

Returns a RunArgs object.

This object contains the normalized inputs, outputs and arguments needed when using a SparkJarProcessor in a ProcessingStep.

Parameters
  • submit_app (str) – Path (local or S3) to Python file to submit to Spark as the primary application. This is translated to the code property on the returned RunArgs object

  • submit_class (str) – Java class reference to submit to Spark as the primary application

  • submit_jars (list[str]) – List of paths (local or S3) to provide for spark-submit –jars option

  • submit_files (list[str]) – List of paths (local or S3) to provide for spark-submit –files option

  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).

  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).

  • arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).

  • job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.

  • configuration (list[dict] or dict) – Configuration for Hadoop, Spark, or Hive. List or dictionary of EMR-style classifications. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html

  • spark_event_logs_s3_uri (str) – S3 path where spark application events will be published to.

run(submit_app, submit_class, submit_jars=None, submit_files=None, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, configuration=None, spark_event_logs_s3_uri=None, kms_key=None)

Runs a processing job.

Parameters
  • submit_app (str) – Path (local or S3) to Jar file to submit to Spark as the primary application

  • submit_class (str or PipelineVariable) – Java class reference to submit to Spark as the primary application

  • submit_jars (list[str] or list[PipelineVariable]) – List of paths (local or S3) to provide for spark-submit –jars option

  • submit_files (list[str] or list[PipelineVariable]) – List of paths (local or S3) to provide for spark-submit –files option

  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).

  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).

  • arguments (list[str] or list[PipelineVariable]) – A list of string arguments to be passed to a processing job (default: None).

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).

  • job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.

  • experiment_config (dict[str, str]) – Experiment management configuration. Optionally, the dict can contain three keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’. The behavior of setting these keys is as follows: * If ExperimentName is supplied but TrialName is not a Trial will be automatically created and the job’s Trial Component associated with the Trial. * If TrialName is supplied and the Trial already exists the job’s Trial Component will be associated with the Trial. * If both ExperimentName and TrialName are not supplied the trial component will be unassociated. * TrialComponentDisplayName is used for display in Studio.

  • configuration (list[dict] or dict) – Configuration for Hadoop, Spark, or Hive. List or dictionary of EMR-style classifications. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html

  • spark_event_logs_s3_uri (str or PipelineVariable) – S3 path where spark application events will be published to.

  • kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

class sagemaker.spark.processing.FileType(value)

Bases: Enum

Enum of file type

JAR = 1
PYTHON = 2
FILE = 3
class sagemaker.spark.processing.SparkConfigUtils

Bases: object

Util class for spark configurations

static validate_configuration(configuration)

Validates the user-provided Hadoop/Spark/Hive configuration.

This ensures that the list or dictionary the user provides will serialize to JSON matching the schema of EMR’s application configuration

Parameters

configuration (Dict) – A dict that contains the configuration overrides to the default values. For more information, please visit: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html

static validate_s3_uri(spark_output_s3_path)

Validate whether the URI uses an S3 scheme.

In the future, this validation will perform deeper S3 validation.

Parameters

spark_output_s3_path (str) – The URI of the Spark output S3 Path.

This module configures the SageMaker Clarify bias and model explainability processor jobs.

SageMaker Clarify

class sagemaker.clarify.DatasetType(value)

Bases: Enum

Enum to store different dataset types supported in the Analysis config file

TEXTCSV = 'text/csv'
JSONLINES = 'application/jsonlines'
JSON = 'application/json'
PARQUET = 'application/x-parquet'
IMAGE = 'application/x-image'
class sagemaker.clarify.TimeSeriesJSONDatasetFormat(value)

Bases: Enum

Possible dataset formats for JSON time series data files.

Below is an example COLUMNS dataset for time series explainability:

{
    "ids": [1, 2],
    "timestamps": [3, 4],
    "target_ts": [5, 6],
    "rts1": [0.25, 0.5],
    "rts2": [1.25, 1.5],
    "scv1": [10, 20],
    "scv2": [30, 40]
}

For this example, JMESPaths are specified when creating TimeSeriesDataConfig as follows:

item_id="ids"
timestamp="timestamps"
target_time_series="target_ts"
related_time_series=["rts1", "rts2"]
static_covariates=["scv1", "scv2"]

Below is an example ITEM_RECORDS dataset for time series explainability:

[
    {
        "id": 1,
        "scv1": 10,
        "scv2": "red",
        "timeseries": [
            {"timestamp": 1, "target_ts": 5, "rts1": 0.25, "rts2": 10},
            {"timestamp": 2, "target_ts": 6, "rts1": 0.35, "rts2": 20},
            {"timestamp": 3, "target_ts": 4, "rts1": 0.45, "rts2": 30}
        ]
    },
    {
        "id": 2,
        "scv1": 20,
        "scv2": "blue",
        "timeseries": [
            {"timestamp": 1, "target_ts": 4, "rts1": 0.25, "rts2": 40},
            {"timestamp": 2, "target_ts": 2, "rts1": 0.35, "rts2": 50}
        ]
    }
]

For this example, JMESPaths are specified when creating TimeSeriesDataConfig as follows:

item_id="[*].id"
timestamp="[*].timeseries[].timestamp"
target_time_series="[*].timeseries[].target_ts"
related_time_series=["[*].timeseries[].rts1", "[*].timeseries[].rts2"]
static_covariates=["[*].scv1", "[*].scv2"]

Below is an example TIMESTAMP_RECORDS dataset for time series explainability:

[
    {"id": 1, "timestamp": 1, "target_ts": 5, "scv1": 10, "rts1": 0.25},
    {"id": 1, "timestamp": 2, "target_ts": 6, "scv1": 10, "rts1": 0.5},
    {"id": 1, "timestamp": 3, "target_ts": 3, "scv1": 10, "rts1": 0.75},
    {"id": 2, "timestamp": 5, "target_ts": 10, "scv1": 20, "rts1": 1}
]

For this example, JMESPaths are specified when creating TimeSeriesDataConfig as follows:

item_id="[*].id"
timestamp="[*].timestamp"
target_time_series="[*].target_ts"
related_time_series=["[*].rts1"]
static_covariates=["[*].scv1"]
COLUMNS = 'columns'
ITEM_RECORDS = 'item_records'
TIMESTAMP_RECORDS = 'timestamp_records'
class sagemaker.clarify.SegmentationConfig(name_or_index, segments, config_name=None, display_aliases=None)

Bases: object

Config object that defines segment(s) of the dataset on which metrics are computed.

Initializes a segmentation configuration for a dataset column.

Parameters
  • name_or_index (str or int) – The name or index of the column in the dataset on which the segment(s) is defined.

  • segments (List[List[str or int]]) – Each List of values represents one segment. If N Lists are provided, we generate N+1 segments - the additional segment, denoted as the ‘__default__’ segment, is for the rest of the values that are not covered by these lists. For continuous columns, a segment must be given as strings in interval notation (eg.: [“[1, 4]”] or [“(2, 5]”]). A segment can also be composed of multiple intervals (eg.: [“[1, 4]”, “(5, 6]”] is one segment). For categorical columns, each segment should contain one or more of the categorical values for the categorical column, which may be strings or integers. Eg,: For a continuous column, segments could be [[“[1, 4]”, “(5, 6]”], [“(7, 9)”]] - this generates 3 segments including the default segment. For a categorical columns with values (“A”, “B”, “C”, “D”), segments,could be [[“A”, “B”]]. This generate 2 segments, including the default segment.

  • config_name (str) –

  • display_aliases (List[str]) – the analysis output and report. This list should be the same length as the number of lists provided in segments or with one additional display alias for the default segment.

Raises

ValueError – when the name_or_index is None, segments is invalid, or a wrong number of display_aliases are specified.

to_dict()

Returns SegmentationConfig as a dict.

Return type

Dict[str, Any]

class sagemaker.clarify.TimeSeriesDataConfig(target_time_series, item_id, timestamp, related_time_series=None, static_covariates=None, dataset_format=None)

Bases: object

Config object for TimeSeries explainability data configuration fields.

Initialises TimeSeries explainability data configuration fields.

Parameters
  • target_time_series (str or int) – A string or a zero-based integer index. Used to locate the target time series in the shared input dataset. If this parameter is a string, then all other parameters except dataset_format must be strings or lists of strings. If this parameter is an int, then all other parameters except dataset_format must be ints or lists of ints.

  • item_id (str or int) – A string or a zero-based integer index. Used to locate item id in the shared input dataset.

  • timestamp (str or int) – A string or a zero-based integer index. Used to locate timestamp in the shared input dataset.

  • related_time_series (list[str] or list[int]) – Optional. An array of strings or array of zero-based integer indices. Used to locate all related time series in the shared input dataset (if present).

  • static_covariates (list[str] or list[int]) – Optional. An array of strings or array of zero-based integer indices. Used to locate all static covariate fields in the shared input dataset (if present).

  • dataset_format (TimeSeriesJSONDatasetFormat) – Describes the format of the data files provided for analysis. Should only be provided when dataset is in JSON format.

Raises

ValueError – If any required arguments are not provided or are the wrong type.

get_time_series_data_config()

Returns part of an analysis config dictionary.

class sagemaker.clarify.DataConfig(s3_data_input_path, s3_output_path, s3_analysis_config_output_path=None, label=None, headers=None, features=None, dataset_type='text/csv', s3_compression_type='None', joinsource=None, facet_dataset_uri=None, facet_headers=None, predicted_label_dataset_uri=None, predicted_label_headers=None, predicted_label=None, excluded_columns=None, segmentation_config=None, time_series_data_config=None)

Bases: object

Config object related to configurations of the input and output dataset.

Initializes a configuration of both input and output datasets.

Parameters
  • s3_data_input_path (str) – Dataset S3 prefix/object URI.

  • s3_output_path (str) – S3 prefix to store the output.

  • s3_analysis_config_output_path (str) – S3 prefix to store the analysis config output. If this field is None, then the s3_output_path will be used to store the analysis_config output.

  • label (str) – Target attribute of the model required by bias metrics. Specified as column name or index for CSV dataset or a JMESPath expression for JSON/JSON Lines. Required parameter except for when the input dataset does not contain the label. Note: For JSON, the JMESPath query must result in a list of labels for each sample. For JSON Lines, it must result in the label for each line. Only a single label per sample is supported at this time.

  • headers ([str]) – List of column names in the dataset. If not provided, Clarify will generate headers to use internally. For time series explainability cases, please provide headers in the order of item_id, timestamp, target_time_series, all related_time_series columns, and then all static_covariate columns.

  • features (str) – JMESPath expression to locate the feature values if the dataset format is JSON/JSON Lines. Note: For JSON, the JMESPath query must result in a 2-D list (or a matrix) of feature values. For JSON Lines, it must result in a 1-D list of features for each line.

  • dataset_type (str) – Format of the dataset. Valid values are "text/csv" for CSV, "application/jsonlines" for JSON Lines, "application/json" for JSON, and "application/x-parquet" for Parquet.

  • s3_compression_type (str) – Valid options are “None” or "Gzip".

  • joinsource (str or int) –

    The name or index of the column in the dataset that acts as an identifier column (for instance, while performing a join). This column is only used as an identifier, and not used for any other computations. This is an optional field in all cases except:

    • The dataset contains more than one file and save_local_shap_values is set to true in ShapConfig, and/or

    • When the dataset and/or facet dataset and/or predicted label dataset are in separate files.

  • facet_dataset_uri (str) –

    Dataset S3 prefix/object URI that contains facet attribute(s), used for bias analysis on datasets without facets.

    • If the dataset and the facet dataset are one single file each, then the original dataset and facet dataset must have the same number of rows.

    • If the dataset and facet dataset are in multiple files (either one), then an index column, joinsource, is required to join the two datasets.

    Clarify will not use the joinsource column and columns present in the facet dataset when calling model inference APIs. Note: this is only supported for "text/csv" dataset type.

  • facet_headers (list[str]) – List of column names in the facet dataset.

  • predicted_label_dataset_uri (str) –

    Dataset S3 prefix/object URI with predicted labels, which are used directly for analysis instead of making model inference API calls.

    • If the dataset and the predicted label dataset are one single file each, then the original dataset and predicted label dataset must have the same number of rows.

    • If the dataset and predicted label dataset are in multiple files (either one), then an index column, joinsource, is required to join the two datasets.

    Note: this is only supported for "text/csv" dataset type.

  • predicted_label_headers (list[str]) – List of column names in the predicted label dataset

  • predicted_label (str or int) – Predicted label of the target attribute of the model required for running bias analysis. Specified as column name or index for CSV data, or a JMESPath expression for JSON/JSON Lines. Clarify uses the predicted labels directly instead of making model inference API calls. Note: For JSON, the JMESPath query must result in a list of predicted labels for each sample. For JSON Lines, it must result in the predicted label for each line. Only a single predicted label per sample is supported at this time.

  • excluded_columns (list[int] or list[str]) – A list of names or indices of the columns which are to be excluded from making model inference API calls.

  • segmentation_config (list[SegmentationConfig]) – A list of SegmentationConfig objects.

  • time_series_data_config (TimeSeriesDataConfig) – Optional. A config object for TimeSeries data specific fields, required for TimeSeries explainability use cases.

Raises

ValueError – when the dataset_type is invalid, predicted label dataset parameters are used with un-supported dataset_type, or facet dataset parameters are used with un-supported dataset_type

get_config()

Returns part of an analysis config dictionary.

class sagemaker.clarify.BiasConfig(label_values_or_threshold, facet_name, facet_values_or_threshold=None, group_name=None)

Bases: object

Config object with user-defined bias configurations of the input dataset.

Initializes a configuration of the sensitive groups in the dataset.

Parameters
  • label_values_or_threshold ([int or float or str]) –

    List of label value(s) or threshold to indicate positive outcome used for bias metrics. The appropriate threshold depends on the problem type:

    • Binary: The list has one positive value.

    • Categorical:The list has one or more (but not all) categories which are the positive values.

    • Regression: The list should include one threshold that defines the exclusive lower bound of positive values.

  • facet_name (str or int or list[str] or list[int]) – Sensitive attribute column name (or index in the input data) to use when computing bias metrics. It can also be a list of names (or indexes) for computing metrics for multiple sensitive attributes.

  • facet_values_or_threshold ([int or float or str] or [[int or float or str]]) –

    The parameter controls the values of the sensitive group. If facet_name is a scalar, then it can be None or a list. Depending on the data type of the facet column, the values mean:

    • Binary data: None means computing the bias metrics for each binary value. Or add one binary value to the list, to compute its bias metrics only.

    • Categorical data: None means computing the bias metrics for each category. Or add one or more (but not all) categories to the list, to compute their bias metrics v.s. the other categories.

    • Continuous data: The list should include one and only one threshold which defines the exclusive lower bound of a sensitive group.

    If facet_name is a list, then facet_values_or_threshold can be None if all facets are of binary or categorical type. Otherwise, facet_values_or_threshold should be a list, and each element is the value or threshold of the corresponding facet.

  • group_name (str) – Optional column name or index to indicate a group column to be used for the bias metric Conditional Demographic Disparity in Labels `(CDDL) or Conditional Demographic Disparity in Predicted Labels (CDDPL).

Raises

ValueError – If the number of facet_names doesn’t equal number of facet values

get_config()

Returns a dictionary of bias detection configurations, part of the analysis config

class sagemaker.clarify.TimeSeriesModelConfig(forecast)

Bases: object

Config object for TimeSeries predictor configuration fields.

Initializes model configuration fields for TimeSeries explainability use cases.

Parameters

forecast (str) – JMESPath expression to extract the forecast result.

Raises

ValueError – when forecast is not a string or not provided

get_time_series_model_config()

Returns TimeSeries model config dictionary

class sagemaker.clarify.ModelConfig(model_name=None, instance_count=None, instance_type=None, accept_type=None, content_type=None, content_template=None, record_template=None, custom_attributes=None, accelerator_type=None, endpoint_name_prefix=None, target_model=None, endpoint_name=None, time_series_model_config=None)

Bases: object

Config object related to a model and its endpoint to be created.

Initializes a configuration of a model and the endpoint to be created for it.

Parameters
  • model_name (str) – Model name (as created by CreateModel. Cannot be set when endpoint_name is set. Must be set with instance_count, instance_type

  • instance_count (int) – The number of instances of a new endpoint for model inference. Cannot be set when endpoint_name is set. Must be set with model_name, instance_type

  • instance_type (str) – The type of EC2 instance to use for model inference; for example, "ml.c5.xlarge". Cannot be set when endpoint_name is set. Must be set with instance_count, model_name

  • accept_type (str) – The model output format to be used for getting inferences with the shadow endpoint. Valid values are "text/csv" for CSV, "application/jsonlines" for JSON Lines, and "application/json" for JSON. Default is the same as content_type.

  • content_type (str) – The model input format to be used for getting inferences with the shadow endpoint. Valid values are "text/csv" for CSV, "application/jsonlines" for JSON Lines, and "application/json" for JSON. Default is the same as dataset_format.

  • content_template (str) – A template string to be used to construct the model input from dataset instances. It is only used, and required, when model_content_type is "application/jsonlines" or "application/json". When model_content_type is application/jsonlines, the template should have one and only one placeholder, $features, which will be replaced by a features list for each record to form the model inference input. When model_content_type is application/json, the template can have either placeholder $record, which will be replaced by a single record templated by record_template and only a single record at a time will be sent to the model, or placeholder $records, which will be replaced by a list of records, each templated by record_template.

  • record_template (str) –

    A template string to be used to construct each record of the model input from dataset instances. It is only used, and required, when model_content_type is "application/json". The template string may contain one of the following:

    • Placeholder $features that will be substituted by the array of feature values and/or an optional placeholder $feature_names that will be substituted by the array of feature names.

    • Exactly one placeholder $features_kvp that will be substituted by the key-value pairs of feature name and feature value.

    • Or for each feature, if “A” is the feature name in the headers configuration, then placeholder syntax "${A}" (the double-quotes are part of the placeholder) will be substituted by the feature value.

    record_template will be used in conjunction with content_template to construct the model input.

    Examples:

    Given:

    • headers: ["A", "B"]

    • features: [[0, 1], [3, 4]]

    Example model input 1:

    {
        "instances": [[0, 1], [3, 4]],
        "feature_names": ["A", "B"]
    }
    

    content_template and record_template to construct above:

    • content_template: "{\"instances\": $records}"

    • record_template: "$features"

    Example model input 2:

    [
        { "A": 0, "B": 1 },
        { "A": 3, "B": 4 },
    ]
    

    content_template and record_template to construct above:

    • content_template: "$records"

    • record_template: "$features_kvp"

    Or, alternatively:

    • content_template: "$records"

    • record_template: "{\"A\": \"${A}\", \"B\": \"${B}\"}"

    Example model input 3 (single record only):

    { "A": 0, "B": 1 }
    

    content_template and record_template to construct above:

    • content_template: "$record"

    • record_template: "$features_kvp"

  • custom_attributes (str) – Provides additional information about a request for an inference submitted to a model hosted at an Amazon SageMaker endpoint. The information is an opaque value that is forwarded verbatim. You could use this value, for example, to provide an ID that you can use to track a request or to provide other metadata that a service endpoint was programmed to process. The value must consist of no more than 1024 visible US-ASCII characters as specified in Section 3.3.6. Field Value Components of the Hypertext Transfer Protocol (HTTP/1.1).

  • accelerator_type (str) – SageMaker Elastic Inference accelerator type to deploy to the model endpoint instance for making inferences to the model.

  • endpoint_name_prefix (str) – The endpoint name prefix of a new endpoint. Must follow pattern ^[a-zA-Z0-9](-\*[a-zA-Z0-9].

  • target_model (str) – Sets the target model name when using a multi-model endpoint. For more information about multi-model endpoints, see https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html

  • endpoint_name (str) – Sets the endpoint_name when re-uses an existing endpoint. Cannot be set when model_name, instance_count, and instance_type set

  • time_series_model_config (TimeSeriesModelConfig) – Optional. A config object for TimeSeries predictor specific fields, required for TimeSeries explainability use cases.

Raises

ValueError – when the - endpoint_name_prefix is invalid, - accept_type is invalid, - content_type is invalid, - content_template has no placeholder “features” - both [endpoint_name] AND [model_name, instance_count, instance_type] are set - both [endpoint_name] AND [endpoint_name_prefix] are set

get_predictor_config()

Returns part of the predictor dictionary of the analysis config.

class sagemaker.clarify.ModelPredictedLabelConfig(label=None, probability=None, probability_threshold=None, label_headers=None)

Bases: object

Config object to extract a predicted label from the model output.

Initializes a model output config to extract the predicted label or predicted score(s).

The following examples show different parameter configurations depending on the endpoint:

  • Regression task: The model returns the score, e.g. 1.2. We don’t need to specify anything. For json output, e.g. {'score': 1.2}, we can set label='score'.

  • Binary classification:

    • The model returns a single probability score. We want to classify as "yes" predictions with a probability score over 0.2. We can set probability_threshold=0.2 and label_headers="yes".

    • The model returns {"probability": 0.3}, for which we would like to apply a threshold of 0.5 to obtain a predicted label in {0, 1}. In this case we can set label="probability".

    • The model returns a tuple of the predicted label and the probability. In this case we can set label = 0.

  • Multiclass classification:

    • The model returns {'labels': ['cat', 'dog', 'fish'], 'probabilities': [0.35, 0.25, 0.4]}. In this case we would set probability='probabilities', label='labels', and infer the predicted label to be 'fish'.

    • The model returns {'predicted_label': 'fish', 'probabilities': [0.35, 0.25, 0.4]}. In this case we would set the label='predicted_label'.

    • The model returns [0.35, 0.25, 0.4]. In this case, we can set label_headers=['cat','dog','fish'] and infer the predicted label to be 'fish'.

Parameters
  • label (str or int) – Index or JMESPath expression to locate the prediction in the model output. In case, this is a predicted label of the same type as the label in the dataset, no further arguments need to be specified.

  • probability (str or int) – Index or JMESPath expression to locate the predicted score(s) in the model output.

  • probability_threshold (float) – An optional value for binary prediction tasks in which the model returns a probability, to indicate the threshold to convert the prediction to a boolean value. Default is 0.5.

  • label_headers (list[str]) – List of headers, each for a predicted score in model output. For bias analysis, it is used to extract the label value with the highest score as predicted label. For explainability jobs, it is used to beautify the analysis report by replacing placeholders like 'label0'.

Raises

TypeError – when the probability_threshold cannot be cast to a float

get_predictor_config()

Returns probability_threshold and predictor config dictionary.

class sagemaker.clarify.ExplainabilityConfig

Bases: ABC

Abstract config class to configure an explainability method.

abstract get_explainability_config()

Returns config.

class sagemaker.clarify.PDPConfig(features=None, grid_resolution=15, top_k_features=10)

Bases: ExplainabilityConfig

Config class for Partial Dependence Plots (PDP).

PDPs show the marginal effect (the dependence) a subset of features has on the predicted outcome of an ML model.

When PDP is requested (by passing in a PDPConfig to the explainability_config parameter of SageMakerClarifyProcessor), the Partial Dependence Plots are included in the output report and the corresponding values are included in the analysis output.

Initializes PDP config.

Parameters
  • features (None or list) – List of feature names or indices for which partial dependence plots are computed and plotted. When ShapConfig is provided, this parameter is optional, as Clarify will compute the partial dependence plots for top features based on SHAP attributions. When ShapConfig is not provided, features must be provided.

  • grid_resolution (int) – When using numerical features, this integer represents the number of buckets that the range of values must be divided into. This decides the granularity of the grid in which the PDP are plotted.

  • top_k_features (int) – Sets the number of top SHAP attributes used to compute partial dependence plots.

get_explainability_config()

Returns PDP config dictionary.

class sagemaker.clarify.TextConfig(granularity, language)

Bases: object

Config object to handle text features for text explainability

SHAP analysis breaks down longer text into chunks (e.g. tokens, sentences, or paragraphs) and replaces them with the strings specified in the baseline for that feature. The shap value of a chunk then captures how much replacing it affects the prediction.

Initializes a text configuration.

Parameters
  • granularity (str) – Determines the granularity in which text features are broken down to. Accepted values are "token", "sentence", or "paragraph". Computes shap values for these units.

  • language (str) – Specifies the language of the text features. Accepted values are one of the following: "chinese", "danish", "dutch", "english", "french", "german", "greek", "italian", "japanese", "lithuanian", "multi-language", "norwegian bokmål", "polish", "portuguese", "romanian", "russian", "spanish", "afrikaans", "albanian", "arabic", "armenian", "basque", "bengali", "bulgarian", "catalan", "croatian", "czech", "estonian", "finnish", "gujarati", "hebrew", "hindi", "hungarian", "icelandic", "indonesian", "irish", "kannada", "kyrgyz", "latvian", "ligurian", "luxembourgish", "macedonian", "malayalam", "marathi", "nepali", "persian", "sanskrit", "serbian", "setswana", "sinhala", "slovak", "slovenian", "swedish", "tagalog", "tamil", "tatar", "telugu", "thai", "turkish", "ukrainian", "urdu", "vietnamese", "yoruba". Use “multi-language” for a mix of multiple languages. The corresponding two-letter ISO codes are also accepted.

Raises

ValueError – when granularity is not in list of supported values or language is not in list of supported values

get_text_config()

Returns a text config dictionary, part of the analysis config dictionary.

class sagemaker.clarify.ImageConfig(model_type, num_segments=None, feature_extraction_method=None, segment_compactness=None, max_objects=None, iou_threshold=None, context=None)

Bases: object

Config object for handling images

Initializes a config object for Computer Vision (CV) Image explainability.

SHAP for CV explainability. generating heat maps that visualize feature attributions for input images. These heat maps highlight the image’s features according to how much they contribute to the CV model prediction.

"IMAGE_CLASSIFICATION" and "OBJECT_DETECTION" are the two supported CV use cases.

Parameters
  • model_type (str) – Specifies the type of CV model and use case. Accepted options: "IMAGE_CLASSIFICATION" or "OBJECT_DETECTION".

  • num_segments (None or int) – Approximate number of segments to generate when running SKLearn’s SLIC method for image segmentation to generate features/superpixels. The default is None. When set to None, runs SLIC with 20 segments.

  • feature_extraction_method (None or str) – method used for extracting features from the image (ex: “segmentation”). Default is "segmentation".

  • segment_compactness (None or float) – Balances color proximity and space proximity. Higher values give more weight to space proximity, making superpixel shapes more square/cubic. We recommend exploring possible values on a log scale, e.g., 0.01, 0.1, 1, 10, 100, before refining around a chosen value. The default is None. When set to None, runs with the default value of 5.

  • max_objects (None or int) – Maximum number of objects displayed when running SHAP with an "OBJECT_DETECTION" model. The Object detection algorithm may detect more than the max_objects number of objects in a single image. In that case, the algorithm displays the top max_objects number of objects according to confidence score. Default value is None. In the "OBJECT_DETECTION" case, passing in None leads to a default value of 3.

  • iou_threshold (None or float) – Minimum intersection over union for the object bounding box to consider its confidence score for computing SHAP values, in the range [0.0, 1.0]. Used only for the "OBJECT_DETECTION" case, where passing in None sets the default value of 0.5.

  • context (None or float) – The portion of the image outside the bounding box used in SHAP analysis, in the range [0.0, 1.0]. If set to 1.0, the whole image is considered; if set to 0.0 only the image inside bounding box is considered. Only used for the "OBJECT_DETECTION" case, when passing in None sets the default value of 1.0.

get_image_config()

Returns the image config part of an analysis config dictionary.

class sagemaker.clarify.SHAPConfig(baseline=None, num_samples=None, agg_method=None, use_logit=False, save_local_shap_values=True, seed=None, num_clusters=None, text_config=None, image_config=None, features_to_explain=None)

Bases: ExplainabilityConfig

Config class for SHAP.

The SHAP algorithm calculates feature attributions by computing the contribution of each feature to the prediction outcome, using the concept of Shapley values.

These attributions can be provided for specific predictions (locally) and at a global level for the model as a whole.

Initializes config for SHAP analysis.

Parameters
  • baseline (None or str or list or dict) – Baseline dataset for the Kernel SHAP algorithm, accepted in the form of: S3 object URI, a list of rows (with at least one element), or None (for no input baseline). The baseline dataset must have the same format as the input dataset specified in DataConfig. Each row must have only the feature columns/values and omit the label column/values. If None, a baseline will be calculated automatically on the input dataset using K-means (for numerical data) or K-prototypes (if there is categorical data).

  • num_samples (None or int) – Number of samples to be used in the Kernel SHAP algorithm. This number determines the size of the generated synthetic dataset to compute the SHAP values. If not provided then Clarify job will choose a proper value according to the count of features.

  • agg_method (None or str) – Aggregation method for global SHAP values. Valid values are "mean_abs" (mean of absolute SHAP values for all instances), "median" (median of SHAP values for all instances) and "mean_sq" (mean of squared SHAP values for all instances). If None is provided, then Clarify job uses the method "mean_abs".

  • use_logit (bool) – Indicates whether to apply the logit function to model predictions. Default is False. If use_logit is true then the SHAP values will have log-odds units.

  • save_local_shap_values (bool) – Indicates whether to save the local SHAP values in the output location. Default is True.

  • seed (int) – Seed value to get deterministic SHAP values. Default is None.

  • num_clusters (None or int) – If a baseline is not provided, Clarify automatically computes a baseline dataset via a clustering algorithm (K-means/K-prototypes), which takes num_clusters as a parameter. num_clusters will be the resulting size of the baseline dataset. If not provided, Clarify job uses a default value.

  • text_config (TextConfig) – Config object for handling text features. Default is None.

  • image_config (ImageConfig) – Config for handling image features. Default is None.

  • features_to_explain (Optional[List[Union[str, int]]]) – A list of names or indices of dataset features to compute SHAP values for. If not provided, SHAP values are computed for all features by default. Currently only supported for tabular datasets.

Raises

ValueError – when agg_method is invalid, baseline and num_clusters are provided together, or features_to_explain is specified when text_config or image_config is provided

get_explainability_config()

Returns a shap config dictionary.

class sagemaker.clarify.AsymmetricShapleyValueConfig(direction='chronological', granularity='timewise', num_samples=None, baseline=None)

Bases: ExplainabilityConfig

Config class for Asymmetric Shapley value algorithm for time series explainability.

Asymmetric Shapley Values are a variant of the Shapley Value that drop the symmetry axiom [1]. We use these to determine how features contribute to the forecasting outcome. Asymmetric Shapley values can take into account the temporal dependencies of the time series that forecasting models take as input.

[1] Frye, Christopher, Colin Rowat, and Ilya Feige. “Asymmetric shapley values: incorporating causal knowledge into model-agnostic explainability.” NeurIPS (2020). https://doi.org/10.48550/arXiv.1910.06358

Initialises config for time series explainability with Asymmetric Shapley Values.

AsymmetricShapleyValueConfig is used specifically and only for TimeSeries explainability purposes.

Parameters
  • direction (str) – Type of explanation to be used. Available explanation types are "chronological", "anti_chronological", and "bidirectional".

  • granularity (str) – Explanation granularity to be used. Available granularity options are "timewise" and "fine_grained".

  • num_samples (None or int) – Number of samples to be used in the Asymmetric Shapley Value forecasting algorithm. Only applicable when using "fine_grained" explanations.

  • baseline (str or dict) –

    Link to a baseline configuration or a dictionary for it. The baseline config is used to replace out-of-coalition values for the corresponding datasets (also known as background data). For temporal data (target time series, related time series), the baseline value types are “zero”, where all out-of-coalition values will be replaced with 0.0, or “mean”, all out-of-coalition values will be replaced with the average of a time series. For static data (static covariates), a baseline value for each covariate should be provided for each possible item_id. An example config follows, where item1 and item2 are item ids:

    {
        "target_time_series": "zero",
        "related_time_series": "zero",
        "static_covariates": {
            "item1": [1, 1],
            "item2": [0, 1]
        }
    }
    

Raises

ValueError – when direction or granularity are not valid, num_samples is not provided for fine-grained explanations, num_samples is provided for non fine-grained explanations, or when direction is not "chronological" while granularity is "fine_grained".

get_explainability_config()

Returns an asymmetric shap config dictionary.

class sagemaker.clarify.SageMakerClarifyProcessor(role=None, instance_count=None, instance_type=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, sagemaker_session=None, env=None, tags=None, network_config=None, job_name_prefix=None, version=None, skip_early_validation=False)

Bases: Processor

Handles SageMaker Processing tasks to compute bias metrics and model explanations.

Initializes a SageMakerClarifyProcessor to compute bias metrics and model explanations.

Instance of Processor.

Parameters
  • role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.

  • instance_count (int) – The number of instances to run a processing job with.

  • instance_type (str) –

    The type of EC2 instance to use for model inference; for example, "ml.c5.xlarge".

  • volume_size_in_gb (int) – Size in GB of the EBS volume. to use for storing data during processing (default: 30 GB).

  • volume_kms_key (str) – A KMS key for the processing volume (default: None).

  • output_kms_key (str) – The KMS key ID for processing job outputs (default: None).

  • max_runtime_in_seconds (int) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status. If max_runtime_in_seconds is not specified, the default value is 86400 seconds (24 hours).

  • sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the Processor creates a Session using the default AWS configuration chain.

  • env (dict[str, str]) – Environment variables to be passed to the processing jobs (default: None).

  • tags (Optional[Tags]) – Tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.

  • network_config (NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.

  • job_name_prefix (str) – Processing job name prefix.

  • version (str) – Clarify version to use.

  • skip_early_validation (bool) – To skip schema validation of the generated analysis_schema.json.

run(**_)

Overriding the base class method but deferring to specific run_* methods.

run_pre_training_bias(data_config, data_bias_config, methods='all', wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)

Runs a ProcessingJob to compute pre-training bias methods

Computes the requested methods on the input data. The methods compare metrics (e.g. fraction of examples) for the sensitive group(s) vs. the other examples.

Parameters
  • data_config (DataConfig) – Config of the input/output data.

  • data_bias_config (BiasConfig) – Config of sensitive groups.

  • methods (str or list[str]) – Selects a subset of potential metrics: [“CI”, “DPL”, “KL”, “JS”, “LP”, “TVD”, “KS”, “CDDL”]. Defaults to str “all” to run all metrics if left unspecified.

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).

  • job_name (str) – Processing job name. When job_name is not specified, if job_name_prefix in SageMakerClarifyProcessor is specified, the job name will be the job_name_prefix and current timestamp; otherwise use "Clarify-Pretraining-Bias" as prefix.

  • kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

  • experiment_config (dict[str, str]) –

    Experiment management configuration. Optionally, the dict can contain three keys: 'ExperimentName', 'TrialName', and 'TrialComponentDisplayName'.

    The behavior of setting these keys is as follows:

    • If 'ExperimentName' is supplied but 'TrialName' is not, a Trial will be automatically created and the job’s Trial Component associated with the Trial.

    • If 'TrialName' is supplied and the Trial already exists, the job’s Trial Component will be associated with the Trial.

    • If both 'ExperimentName' and 'TrialName' are not supplied, the Trial Component will be unassociated.

    • 'TrialComponentDisplayName' is used for display in Amazon SageMaker Studio.

run_post_training_bias(data_config, data_bias_config, model_config=None, model_predicted_label_config=None, methods='all', wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)

Runs a ProcessingJob to compute posttraining bias

Spins up a model endpoint and runs inference over the input dataset in the s3_data_input_path (from the DataConfig) to obtain predicted labels. Using model predictions, computes the requested posttraining bias methods that compare metrics (e.g. accuracy, precision, recall) for the sensitive group(s) versus the other examples.

Parameters
  • data_config (DataConfig) – Config of the input/output data.

  • data_bias_config (BiasConfig) – Config of sensitive groups.

  • model_config (ModelConfig) – Config of the model and its endpoint to be created. This is required unless``predicted_label_dataset_uri`` or predicted_label is provided in data_config.

  • model_predicted_label_config (ModelPredictedLabelConfig) – Config of how to extract the predicted label from the model output.

  • methods (str or list[str]) – Selector of a subset of potential metrics: [“DPPL” , “DI”, “DCA”, “DCR”, “RD”, “DAR”, “DRR”, “AD”, “CDDPL “, “TE”, “FT”]. Defaults to str “all” to run all metrics if left unspecified.

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).

  • job_name (str) – Processing job name. When job_name is not specified, if job_name_prefix in SageMakerClarifyProcessor is specified, the job name will be the job_name_prefix and current timestamp; otherwise use "Clarify-Posttraining-Bias" as prefix.

  • kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

  • experiment_config (dict[str, str]) –

    Experiment management configuration. Optionally, the dict can contain three keys: 'ExperimentName', 'TrialName', and 'TrialComponentDisplayName'.

    The behavior of setting these keys is as follows:

    • If 'ExperimentName' is supplied but 'TrialName' is not, a Trial will be automatically created and the job’s Trial Component associated with the Trial.

    • If 'TrialName' is supplied and the Trial already exists, the job’s Trial Component will be associated with the Trial.

    • If both 'ExperimentName' and 'TrialName' are not supplied, the Trial Component will be unassociated.

    • 'TrialComponentDisplayName' is used for display in Amazon SageMaker Studio.

run_bias(data_config, bias_config, model_config=None, model_predicted_label_config=None, pre_training_methods='all', post_training_methods='all', wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)

Runs a ProcessingJob to compute the requested bias methods

Computes metrics for both the pre-training and the post-training methods. To calculate post-training methods, it spins up a model endpoint and runs inference over the input examples in ‘s3_data_input_path’ (from the DataConfig) to obtain predicted labels.

Parameters
  • data_config (DataConfig) – Config of the input/output data.

  • bias_config (BiasConfig) – Config of sensitive groups.

  • model_config (ModelConfig) – Config of the model and its endpoint to be created. This is required unless``predicted_label_dataset_uri`` or predicted_label is provided in data_config.

  • model_predicted_label_config (ModelPredictedLabelConfig) – Config of how to extract the predicted label from the model output.

  • pre_training_methods (str or list[str]) –

    Selector of a subset of potential metrics: [“CI”, “DPL”, “KL”, “JS”, “LP”, “TVD”, “KS”, “CDDL”]. Defaults to str “all” to run all metrics if left unspecified.

  • post_training_methods (str or list[str]) –

    Selector of a subset of potential metrics: [“DPPL” , “DI”, “DCA”, “DCR”, “RD”, “DAR”, “DRR”, “AD”, “CDDPL “, “TE”, “FT”]. Defaults to str “all” to run all metrics if left unspecified.

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).

  • job_name (str) – Processing job name. When job_name is not specified, if job_name_prefix in SageMakerClarifyProcessor is specified, the job name will be job_name_prefix and the current timestamp; otherwise use "Clarify-Bias" as prefix.

  • kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

  • experiment_config (dict[str, str]) –

    Experiment management configuration. Optionally, the dict can contain three keys: 'ExperimentName', 'TrialName', and 'TrialComponentDisplayName'.

    The behavior of setting these keys is as follows:

    • If 'ExperimentName' is supplied but 'TrialName' is not, a Trial will be automatically created and the job’s Trial Component associated with the Trial.

    • If 'TrialName' is supplied and the Trial already exists, the job’s Trial Component will be associated with the Trial.

    • If both 'ExperimentName' and 'TrialName' are not supplied, the Trial Component will be unassociated.

    • 'TrialComponentDisplayName' is used for display in Amazon SageMaker Studio.

run_explainability(data_config, model_config, explainability_config, model_scores=None, wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)

Runs a ProcessingJob computing feature attributions.

Spins up a model endpoint.

Currently, only SHAP and Partial Dependence Plots (PDP) are supported as explainability methods. You can request both methods or one at a time with the explainability_config parameter.

When SHAP is requested in the explainability_config, the SHAP algorithm calculates the feature importance for each input example in the s3_data_input_path of the DataConfig, by creating num_samples copies of the example with a subset of features replaced with values from the baseline. It then runs model inference to see how the model’s prediction changes with the replaced features. If the model output returns multiple scores importance is computed for each score. Across examples, feature importance is aggregated using agg_method.

When PDP is requested in the explainability_config, the PDP algorithm calculates the dependence of the target response on the input features and marginalizes over the values of all other input features. The Partial Dependence Plots are included in the output report and the corresponding values are included in the analysis output.

Parameters
  • data_config (DataConfig) – Config of the input/output data.

  • model_config (ModelConfig) – Config of the model and its endpoint to be created.

  • explainability_config (ExplainabilityConfig or list) – Config of the specific explainability method or a list of ExplainabilityConfig objects. Currently, SHAP and PDP are the two methods supported. You can request multiple methods at once by passing in a list of ~sagemaker.clarify.ExplainabilityConfig.

  • model_scores (int or str or ModelPredictedLabelConfig) – Index or JMESPath expression to locate the predicted scores in the model output. This is not required if the model output is a single score. Alternatively, it can be an instance of SageMakerClarifyProcessor to provide more parameters like label_headers.

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).

  • job_name (str) – Processing job name. When job_name is not specified, if job_name_prefix in SageMakerClarifyProcessor is specified, the job name will be composed of job_name_prefix and current timestamp; otherwise use "Clarify-Explainability" as prefix.

  • kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

  • experiment_config (dict[str, str]) –

    Experiment management configuration. Optionally, the dict can contain three keys: 'ExperimentName', 'TrialName', and 'TrialComponentDisplayName'.

    The behavior of setting these keys is as follows:

    • If 'ExperimentName' is supplied but 'TrialName' is not, a Trial will be automatically created and the job’s Trial Component associated with the Trial.

    • If 'TrialName' is supplied and the Trial already exists, the job’s Trial Component will be associated with the Trial.

    • If both 'ExperimentName' and 'TrialName' are not supplied, the Trial Component will be unassociated.

    • 'TrialComponentDisplayName' is used for display in Amazon SageMaker Studio.

run_bias_and_explainability(data_config, model_config, explainability_config, bias_config, pre_training_methods='all', post_training_methods='all', model_predicted_label_config=None, wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)

Runs a ProcessingJob computing feature attributions.

For bias: Computes metrics for both the pre-training and the post-training methods. To calculate post-training methods, it spins up a model endpoint and runs inference over the input examples in ‘s3_data_input_path’ (from the DataConfig) to obtain predicted labels.

For Explainability: Spins up a model endpoint.

Currently, only SHAP and Partial Dependence Plots (PDP) are supported as explainability methods. You can request both methods or one at a time with the explainability_config parameter.

When SHAP is requested in the explainability_config, the SHAP algorithm calculates the feature importance for each input example in the s3_data_input_path of the DataConfig, by creating num_samples copies of the example with a subset of features replaced with values from the baseline. It then runs model inference to see how the model’s prediction changes with the replaced features. If the model output returns multiple scores importance is computed for each score. Across examples, feature importance is aggregated using agg_method.

When PDP is requested in the explainability_config, the PDP algorithm calculates the dependence of the target response on the input features and marginalizes over the values of all other input features. The Partial Dependence Plots are included in the output report and the corresponding values are included in the analysis output.

Parameters
  • data_config (DataConfig) – Config of the input/output data.

  • model_config (ModelConfig) – Config of the model and its endpoint to be created.

  • explainability_config (ExplainabilityConfig or list) – Config of the specific explainability method or a list of ExplainabilityConfig objects. Currently, SHAP and PDP are the two methods supported. You can request multiple methods at once by passing in a list of ~sagemaker.clarify.ExplainabilityConfig.

  • bias_config (BiasConfig) – Config of sensitive groups.

  • pre_training_methods (str or list[str]) –

    Selector of a subset of potential metrics: [“CI”, “DPL”, “KL”, “JS”, “LP”, “TVD”, “KS”, “CDDL”]. Defaults to str “all” to run all metrics if left unspecified.

  • post_training_methods (str or list[str]) –

    Selector of a subset of potential metrics: [“DPPL” , “DI”, “DCA”, “DCR”, “RD”, “DAR”, “DRR”, “AD”, “CDDPL “, “TE”, “FT”]. Defaults to str “all” to run all metrics if left unspecified.

  • ( (model_predicted_label_config) – int or str or ModelPredictedLabelConfig

  • ) – Index or JMESPath expression to locate the predicted scores in the model output. This is not required if the model output is a single score. Alternatively, it can be an instance of SageMakerClarifyProcessor to provide more parameters like label_headers.

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).

  • job_name (str) – Processing job name. When job_name is not specified, if job_name_prefix in SageMakerClarifyProcessor is specified, the job name will be composed of job_name_prefix and current timestamp; otherwise use "Clarify-Explainability" as prefix.

  • kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

  • experiment_config (dict[str, str]) –

    Experiment management configuration. Optionally, the dict can contain three keys: 'ExperimentName', 'TrialName', and 'TrialComponentDisplayName'.

    The behavior of setting these keys is as follows:

    • If 'ExperimentName' is supplied but 'TrialName' is not, a Trial will be automatically created and the job’s Trial Component associated with the Trial.

    • If 'TrialName' is supplied and the Trial already exists, the job’s Trial Component will be associated with the Trial.

    • If both 'ExperimentName' and 'TrialName' are not supplied, the Trial Component will be unassociated.

    • 'TrialComponentDisplayName' is used for display in Amazon SageMaker Studio.

  • model_predicted_label_config (Optional[ModelPredictedLabelConfig]) –

class sagemaker.clarify.ProcessingOutputHandler

Bases: object

Class to handle the parameters for SagemakerProcessor.Processingoutput

class S3UploadMode(value)

Bases: Enum

Enum values for different uplaod modes to s3 bucket

CONTINUOUS = 'Continuous'
ENDOFJOB = 'EndOfJob'
classmethod get_s3_upload_mode(analysis_config)

Fetches s3_upload mode based on the shap_config values

Parameters

analysis_config (dict) – dict Config following the analysis_config.json format

Returns

The s3_upload_mode type for the processing output.

Return type

str