Processing

This module contains code related to the Processor class.

which is used for Amazon SageMaker Processing Jobs. These jobs let users perform data pre-processing, post-processing, feature engineering, data validation, and model evaluation, and interpretation on Amazon SageMaker.

class sagemaker.processing.Processor(role, image_uri, instance_count, instance_type, entrypoint=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)

Bases: object

Handles Amazon SageMaker Processing tasks.

Initializes a Processor instance.

The Processor handles Amazon SageMaker Processing tasks.

Parameters
  • role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.

  • image_uri (str) – The URI of the Docker image to use for the processing jobs.

  • instance_count (int) – The number of instances to run a processing job with.

  • instance_type (str) – The type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.

  • entrypoint (list[str]) – The entrypoint for the processing job (default: None). This is in the form of a list of strings that make a command.

  • volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).

  • volume_kms_key (str) – A KMS key for the processing volume (default: None).

  • output_kms_key (str) – The KMS key ID for processing job outputs (default: None).

  • max_runtime_in_seconds (int) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status. If max_runtime_in_seconds is not specified, the default value is 24 hours.

  • base_job_name (str) – Prefix for processing job name. If not specified, the processor generates a default job name, based on the processing image name and current timestamp.

  • sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.

  • env (dict[str, str]) – Environment variables to be passed to the processing jobs (default: None).

  • tags (list[dict]) – List of tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.

  • network_config (NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.

run(inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, kms_key=None)

Runs a processing job.

Parameters
  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).

  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).

  • arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).

  • job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.

  • experiment_config (dict[str, str]) – Experiment management configuration. Optionally, the dict can contain three keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’. The behavior of setting these keys is as follows: * If ExperimentName is supplied but TrialName is not a Trial will be automatically created and the job’s Trial Component associated with the Trial. * If TrialName is supplied and the Trial already exists the job’s Trial Component will be associated with the Trial. * If both ExperimentName and TrialName are not supplied the trial component will be unassociated. * TrialComponentDisplayName is used for display in Studio.

  • kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

Raises

ValueError – if logs is True but wait is False.

class sagemaker.processing.ScriptProcessor(role, image_uri, command, instance_count, instance_type, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)

Bases: sagemaker.processing.Processor

Handles Amazon SageMaker processing tasks for jobs using a machine learning framework.

Initializes a ScriptProcessor instance.

The ScriptProcessor handles Amazon SageMaker Processing tasks for jobs using a machine learning framework, which allows for providing a script to be run as part of the Processing Job.

Parameters
  • role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.

  • image_uri (str) – The URI of the Docker image to use for the processing jobs.

  • command ([str]) – The command to run, along with any command-line flags. Example: [“python3”, “-v”].

  • instance_count (int) – The number of instances to run a processing job with.

  • instance_type (str) – The type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.

  • volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).

  • volume_kms_key (str) – A KMS key for the processing volume (default: None).

  • output_kms_key (str) – The KMS key ID for processing job outputs (default: None).

  • max_runtime_in_seconds (int) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status. If max_runtime_in_seconds is not specified, the default value is 24 hours.

  • base_job_name (str) – Prefix for processing name. If not specified, the processor generates a default job name, based on the processing image name and current timestamp.

  • sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.

  • env (dict[str, str]) – Environment variables to be passed to the processing jobs (default: None).

  • tags (list[dict]) – List of tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.

  • network_config (NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.

get_run_args(code, inputs=None, outputs=None, arguments=None)

Returns a RunArgs object.

For processors (PySparkProcessor, SparkJar) that have special run() arguments, this object contains the normalized arguments for passing to ProcessingStep.

Parameters
  • code (str) – This can be an S3 URI or a local path to a file with the framework script to run.

  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).

  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).

  • arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).

run(code, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, kms_key=None)

Runs a processing job.

Parameters
  • code (str) – This can be an S3 URI or a local path to a file with the framework script to run.

  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).

  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).

  • arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).

  • job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.

  • experiment_config (dict[str, str]) – Experiment management configuration. Optionally, the dict can contain three keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’. The behavior of setting these keys is as follows: * If ExperimentName is supplied but TrialName is not a Trial will be automatically created and the job’s Trial Component associated with the Trial. * If TrialName is supplied and the Trial already exists the job’s Trial Component will be associated with the Trial. * If both ExperimentName and TrialName are not supplied the trial component will be unassociated. * TrialComponentDisplayName is used for display in Studio.

class sagemaker.processing.ProcessingJob(sagemaker_session, job_name, inputs, outputs, output_kms_key=None)

Bases: sagemaker.job._Job

Provides functionality to start, describe, and stop processing jobs.

Initializes a Processing job.

Parameters
  • sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.

  • job_name (str) – Name of the Processing job.

  • inputs (list[ProcessingInput]) – A list of ProcessingInput objects.

  • outputs (list[ProcessingOutput]) – A list of ProcessingOutput objects.

  • output_kms_key (str) – The output KMS key associated with the job (default: None).

classmethod start_new(processor, inputs, outputs, experiment_config)

Starts a new processing job using the provided inputs and outputs.

Parameters
  • processor (Processor) – The Processor instance that started the job.

  • inputs (list[ProcessingInput]) – A list of ProcessingInput objects.

  • outputs (list[ProcessingOutput]) – A list of ProcessingOutput objects.

  • experiment_config (dict[str, str]) – Experiment management configuration. Optionally, the dict can contain three keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’. The behavior of setting these keys is as follows: * If ExperimentName is supplied but TrialName is not a Trial will be automatically created and the job’s Trial Component associated with the Trial. * If TrialName is supplied and the Trial already exists the job’s Trial Component will be associated with the Trial. * If both ExperimentName and TrialName are not supplied the trial component will be unassociated. * TrialComponentDisplayName is used for display in Studio.

Returns

The instance of ProcessingJob created

using the Processor.

Return type

ProcessingJob

classmethod from_processing_name(sagemaker_session, processing_job_name)

Initializes a ProcessingJob from a processing job name.

Parameters
  • processing_job_name (str) – Name of the processing job.

  • sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.

Returns

The instance of ProcessingJob created

from the job name.

Return type

ProcessingJob

classmethod from_processing_arn(sagemaker_session, processing_job_arn)

Initializes a ProcessingJob from a Processing ARN.

Parameters
  • processing_job_arn (str) – ARN of the processing job.

  • sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.

Returns

The instance of ProcessingJob created

from the processing job’s ARN.

Return type

ProcessingJob

wait(logs=True)

Waits for the processing job to complete.

Parameters

logs (bool) – Whether to show the logs produced by the job (default: True).

describe()

Prints out a response from the DescribeProcessingJob API call.

stop()

Stops the processing job.

static prepare_app_specification(container_arguments, container_entrypoint, image_uri)

Prepares a dict that represents a ProcessingJob’s AppSpecification.

Parameters
  • container_arguments (list[str]) – The arguments for a container used to run a processing job.

  • container_entrypoint (list[str]) – The entrypoint for a container used to run a processing job.

  • image_uri (str) – The container image to be run by the processing job.

Returns

Represents AppSpecification which configures the processing job to run a specified Docker container image.

Return type

dict

static prepare_output_config(kms_key_id, outputs)

Prepares a dict that represents a ProcessingOutputConfig.

Parameters
  • kms_key_id (str) – The AWS Key Management Service (AWS KMS) key that Amazon SageMaker uses to encrypt the processing job output. KmsKeyId can be an ID of a KMS key, ARN of a KMS key, alias of a KMS key, or alias of a KMS key. The KmsKeyId is applied to all outputs.

  • outputs (list[dict]) – Output configuration information for a processing job.

Returns

Represents output configuration for the processing job.

Return type

dict

static prepare_processing_resources(instance_count, instance_type, volume_kms_key_id, volume_size_in_gb)

Prepares a dict that represents the ProcessingResources.

Parameters
  • instance_count (int) – The number of ML compute instances to use in the processing job. For distributed processing jobs, specify a value greater than 1. The default value is 1.

  • instance_type (str) – The ML compute instance type for the processing job.

  • volume_kms_key_id (str) – The AWS Key Management Service (AWS KMS) key that Amazon SageMaker uses to encrypt data on the storage volume attached to the ML compute instance(s) that run the processing job.

  • volume_size_in_gb (int) – The size of the ML storage volume in gigabytes that you want to provision. You must specify sufficient ML storage for your scenario.

Returns

Represents ProcessingResources which identifies the resources,

ML compute instances, and ML storage volumes to deploy for a processing job.

Return type

dict

static prepare_stopping_condition(max_runtime_in_seconds)

Prepares a dict that represents the job’s StoppingCondition.

Parameters

max_runtime_in_seconds (int) – Specifies the maximum runtime in seconds.

Returns

dict

class sagemaker.processing.ProcessingInput(source=None, destination=None, input_name=None, s3_data_type='S3Prefix', s3_input_mode='File', s3_data_distribution_type='FullyReplicated', s3_compression_type='None', s3_input=None, dataset_definition=None, app_managed=False)

Bases: object

Accepts parameters that specify an Amazon S3 input for a processing job.

Also provides a method to turn those parameters into a dictionary.

Initializes a ProcessingInput instance.

ProcessingInput accepts parameters that specify an Amazon S3 input for a processing job and provides a method to turn those parameters into a dictionary.

Parameters
  • source (str) – The source for the input. If a local path is provided, it will automatically be uploaded to S3 under: “s3://<default-bucket-name>/<job-name>/input/<input-name>”.

  • destination (str) – The destination of the input.

  • input_name (str) – The name for the input. If a name is not provided, one will be generated (eg. “input-1”).

  • s3_data_type (str) – Valid options are “ManifestFile” or “S3Prefix”.

  • s3_input_mode (str) – Valid options are “Pipe” or “File”.

  • s3_data_distribution_type (str) – Valid options are “FullyReplicated” or “ShardedByS3Key”.

  • s3_compression_type (str) – Valid options are “None” or “Gzip”.

  • s3_input (S3Input) – Metadata of data objects stored in S3

  • dataset_definition (DatasetDefinition) – DatasetDefinition input

  • app_managed (bool) – Whether the input are managed by SageMaker or application

class sagemaker.processing.ProcessingOutput(source=None, destination=None, output_name=None, s3_upload_mode='EndOfJob', app_managed=False, feature_store_output=None)

Bases: object

Accepts parameters that specify an Amazon S3 output for a processing job.

It also provides a method to turn those parameters into a dictionary.

Initializes a ProcessingOutput instance.

ProcessingOutput accepts parameters that specify an Amazon S3 output for a processing job and provides a method to turn those parameters into a dictionary.

Parameters
  • source (str) – The source for the output.

  • destination (str) – The destination of the output. If a destination is not provided, one will be generated: “s3://<default-bucket-name>/<job-name>/output/<output-name>”.

  • output_name (str) – The name of the output. If a name is not provided, one will be generated (eg. “output-1”).

  • s3_upload_mode (str) – Valid options are “EndOfJob” or “Continuous”.

  • app_managed (bool) – Whether the input are managed by SageMaker or application

  • feature_store_output (FeatureStoreOutput) – Configuration for processing job outputs of FeatureStore.

class sagemaker.processing.RunArgs(code, inputs=None, outputs=None, arguments=None)

Bases: object

Accepts parameters that correspond to ScriptProcessors.

An instance of this class is returned from the get_run_args() method on processors, and is used for normalizing the arguments so that they can be passed to ProcessingStep

Parameters
  • code (str) – This can be an S3 URI or a local path to a file with the framework script to run.

  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).

  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).

  • arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).

Method generated by attrs for class RunArgs.

class sagemaker.processing.FeatureStoreOutput(**kwargs)

Bases: sagemaker.apiutils._base_types.ApiObject

Configuration for processing job outputs in Amazon SageMaker Feature Store.

Init ApiObject.

feature_group_name = None
class sagemaker.processing.FrameworkProcessor(estimator_cls, framework_version, role, instance_count, instance_type, py_version='py3', image_uri=None, command=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, code_location=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)

Bases: sagemaker.processing.ScriptProcessor

Handles Amazon SageMaker processing tasks for jobs using a machine learning framework.

Initializes a FrameworkProcessor instance.

The FrameworkProcessor handles Amazon SageMaker Processing tasks for jobs using a machine learning framework, which allows for a set of Python scripts to be run as part of the Processing Job.

Parameters
  • estimator_cls (type) – A subclass of the Framework estimator

  • framework_version (str) – The version of the framework. Value is ignored when image_uri is provided.

  • role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.

  • instance_count (int) – The number of instances to run a processing job with.

  • instance_type (str) – The type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.

  • py_version (str) – Python version you want to use for executing your model training code. One of ‘py2’ or ‘py3’. Defaults to ‘py3’. Value is ignored when image_uri is provided.

  • image_uri (str) – The URI of the Docker image to use for the processing jobs (default: None).

  • command ([str]) – The command to run, along with any command-line flags to precede the `code script`. Example: [“python3”, “-v”]. If not provided, [“python”] will be chosen (default: None).

  • volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).

  • volume_kms_key (str) – A KMS key for the processing volume (default: None).

  • output_kms_key (str) – The KMS key ID for processing job outputs (default: None).

  • code_location (str) – The S3 prefix URI where custom code will be uploaded (default: None). The code file uploaded to S3 is ‘code_location/job-name/source/sourcedir.tar.gz’. If not specified, the default code location is ‘s3://{sagemaker-default-bucket}’

  • max_runtime_in_seconds (int) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status. If max_runtime_in_seconds is not specified, the default value is 24 hours.

  • base_job_name (str) – Prefix for processing name. If not specified, the processor generates a default job name, based on the processing image name and current timestamp (default: None).

  • sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain (default: None).

  • env (dict[str, str]) – Environment variables to be passed to the processing jobs (default: None).

  • tags (list[dict]) – List of tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.

  • network_config (NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets (default: None).

framework_entrypoint_command = ['/bin/bash']
get_run_args(code, source_dir=None, dependencies=None, git_config=None, inputs=None, outputs=None, arguments=None, job_name=None)

Returns a RunArgs object.

This object contains the normalized inputs, outputs and arguments needed when using a FrameworkProcessor in a ProcessingStep.

Parameters
  • code (str) – This can be an S3 URI or a local path to a file with the framework script to run. See the code argument in sagemaker.processing.FrameworkProcessor.run().

  • source_dir (str) – Path (absolute, relative, or an S3 URI) to a directory wit any other processing source code dependencies aside from the entrypoint file (default: None). See the source_dir argument in sagemaker.processing.FrameworkProcessor.run()

  • dependencies (list[str]) – A list of paths to directories (absolute or relative) with any additional libraries that will be exported to the container (default: []). See the dependencies argument in sagemaker.processing.FrameworkProcessor.run().

  • git_config (dict[str, str]) – Git configurations used for cloning files. See the git_config argument in sagemaker.processing.FrameworkProcessor.run().

  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).

  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).

  • arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).

  • job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.

run(code, source_dir=None, dependencies=None, git_config=None, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, kms_key=None)

Runs a processing job.

Parameters
  • code (str) – This can be an S3 URI or a local path to a file with the framework script to run.Path (absolute or relative) to the local Python source file which should be executed as the entry point to training. When code is an S3 URI, ignore source_dir, dependencies, and `git_config. If source_dir is specified, then code must point to a file located at the root of source_dir.

  • source_dir (str) – Path (absolute, relative or an S3 URI) to a directory with any other processing source code dependencies aside from the entry point file (default: None). If source_dir is an S3 URI, it must point to a tar.gz file. Structure within this directory are preserved when processing on Amazon SageMaker (default: None).

  • dependencies (list[str]) – A list of paths to directories (absolute or relative) with any additional libraries that will be exported to the container (default: []). The library folders will be copied to SageMaker in the same folder where the entrypoint is copied. If ‘git_config’ is provided, ‘dependencies’ should be a list of relative locations to directories with any additional libraries needed in the Git repo (default: None).

  • git_config (dict[str, str]) –

    Git configurations used for cloning files, including repo, branch, commit, 2FA_enabled, username, password and token. The repo field is required. All other fields are optional. repo specifies the Git repository where your training script is stored. If you don’t provide branch, the default value ‘master’ is used. If you don’t provide commit, the latest commit in the specified branch is used. .. admonition:: Example

    The following config:

    >>> git_config = {'repo': 'https://github.com/aws/sagemaker-python-sdk.git',
    >>>               'branch': 'test-branch-git-config',
    >>>               'commit': '329bfcf884482002c05ff7f44f62599ebc9f445a'}
    

    results in cloning the repo specified in ‘repo’, then checkout the ‘master’ branch, and checkout the specified commit.

    2FA_enabled, username, password and token are used for authentication. For GitHub (or other Git) accounts, set 2FA_enabled to ‘True’ if two-factor authentication is enabled for the account, otherwise set it to ‘False’. If you do not provide a value for 2FA_enabled, a default value of ‘False’ is used. CodeCommit does not support two-factor authentication, so do not provide “2FA_enabled” with CodeCommit repositories.

    For GitHub and other Git repos, when SSH URLs are provided, it doesn’t matter whether 2FA is enabled or disabled; you should either have no passphrase for the SSH key pairs, or have the ssh-agent configured so that you will not be prompted for SSH passphrase when you do ‘git clone’ command with SSH URLs. When HTTPS URLs are provided: if 2FA is disabled, then either token or username+password will be used for authentication if provided (token prioritized); if 2FA is enabled, only token will be used for authentication if provided. If required authentication info is not provided, python SDK will try to use local credentials storage to authenticate. If that fails either, an error message will be thrown.

    For CodeCommit repos, 2FA is not supported, so ‘2FA_enabled’ should not be provided. There is no token in CodeCommit, so ‘token’ should not be provided too. When ‘repo’ is an SSH URL, the requirements are the same as GitHub-like repos. When ‘repo’ is an HTTPS URL, username+password will be used for authentication if they are provided; otherwise, python SDK will try to use either CodeCommit credential helper or local credential storage for authentication.

  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).

  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).

  • arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).

  • job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.

  • experiment_config (dict[str, str]) – Experiment management configuration. Optionally, the dict can contain three keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’. The behavior of setting these keys is as follows: * If ExperimentName is supplied but TrialName is not a Trial will be automatically created and the job’s Trial Component associated with the Trial. * If TrialName is supplied and the Trial already exists the job’s Trial Component will be associated with the Trial. * If both ExperimentName and TrialName are not supplied the trial component will be unassociated. * TrialComponentDisplayName is used for display in Studio.

  • kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

This module is the entry to run spark processing script.

This module contains code related to Spark Processors, which are used for Processing jobs. These jobs let customers perform data pre-processing, post-processing, feature engineering, data validation, and model evaluation on SageMaker using Spark and PySpark.

class sagemaker.spark.processing.PySparkProcessor(role, instance_type, instance_count, framework_version=None, py_version=None, container_version=None, image_uri=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)

Bases: sagemaker.spark.processing._SparkProcessorBase

Handles Amazon SageMaker processing tasks for jobs using PySpark.

Initialize an PySparkProcessor instance.

The PySparkProcessor handles Amazon SageMaker processing tasks for jobs using SageMaker PySpark.

Parameters
  • framework_version (str) – The version of SageMaker PySpark.

  • py_version (str) – The version of python.

  • container_version (str) – The version of spark container.

  • role (str) – An AWS IAM role name or ARN. The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access training data and model artifacts. After the endpoint is created, the inference code might use the IAM role, if it needs to access an AWS resource.

  • instance_type (str) – Type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.

  • instance_count (int) – The number of instances to run the Processing job with. Defaults to 1.

  • volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).

  • volume_kms_key (str) – A KMS key for the processing volume.

  • output_kms_key (str) – The KMS key id for all ProcessingOutputs.

  • max_runtime_in_seconds (int) – Timeout in seconds. After this amount of time Amazon SageMaker terminates the job regardless of its current status.

  • base_job_name (str) – Prefix for processing name. If not specified, the processor generates a default job name, based on the training image name and current timestamp.

  • sagemaker_session (sagemaker.session.Session) – Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.

  • env (dict) – Environment variables to be passed to the processing job.

  • tags ([dict]) – List of tags to be passed to the processing job.

  • network_config (sagemaker.network.NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.

get_run_args(submit_app, submit_py_files=None, submit_jars=None, submit_files=None, inputs=None, outputs=None, arguments=None, job_name=None, configuration=None, spark_event_logs_s3_uri=None)

Returns a RunArgs object.

This object contains the normalized inputs, outputs and arguments needed when using a PySparkProcessor in a ProcessingStep.

Parameters
  • submit_app (str) – Path (local or S3) to Python file to submit to Spark as the primary application. This is translated to the code property on the returned RunArgs object.

  • submit_py_files (list[str]) – List of paths (local or S3) to provide for spark-submit –py-files option

  • submit_jars (list[str]) – List of paths (local or S3) to provide for spark-submit –jars option

  • submit_files (list[str]) – List of paths (local or S3) to provide for spark-submit –files option

  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).

  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).

  • arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).

  • job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.

  • configuration (list[dict] or dict) – Configuration for Hadoop, Spark, or Hive. List or dictionary of EMR-style classifications. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html

  • spark_event_logs_s3_uri (str) – S3 path where spark application events will be published to.

run(submit_app, submit_py_files=None, submit_jars=None, submit_files=None, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, configuration=None, spark_event_logs_s3_uri=None, kms_key=None)

Runs a processing job.

Parameters
  • submit_app (str) – Path (local or S3) to Python file to submit to Spark as the primary application

  • submit_py_files (list[str]) – List of paths (local or S3) to provide for spark-submit –py-files option

  • submit_jars (list[str]) – List of paths (local or S3) to provide for spark-submit –jars option

  • submit_files (list[str]) – List of paths (local or S3) to provide for spark-submit –files option

  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).

  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).

  • arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).

  • job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.

  • experiment_config (dict[str, str]) – Experiment management configuration. Optionally, the dict can contain three keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’. The behavior of setting these keys is as follows: * If ExperimentName is supplied but TrialName is not a Trial will be automatically created and the job’s Trial Component associated with the Trial. * If TrialName is supplied and the Trial already exists the job’s Trial Component will be associated with the Trial. * If both ExperimentName and TrialName are not supplied the trial component will be unassociated. * TrialComponentDisplayName is used for display in Studio.

  • configuration (list[dict] or dict) – Configuration for Hadoop, Spark, or Hive. List or dictionary of EMR-style classifications. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html

  • spark_event_logs_s3_uri (str) – S3 path where spark application events will be published to.

  • kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

class sagemaker.spark.processing.SparkJarProcessor(role, instance_type, instance_count, framework_version=None, py_version=None, container_version=None, image_uri=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)

Bases: sagemaker.spark.processing._SparkProcessorBase

Handles Amazon SageMaker processing tasks for jobs using Spark with Java or Scala Jars.

Initialize a SparkJarProcessor instance.

The SparkProcessor handles Amazon SageMaker processing tasks for jobs using SageMaker Spark.

Parameters
  • framework_version (str) – The version of SageMaker PySpark.

  • py_version (str) – The version of python.

  • container_version (str) – The version of spark container.

  • role (str) – An AWS IAM role name or ARN. The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access training data and model artifacts. After the endpoint is created, the inference code might use the IAM role, if it needs to access an AWS resource.

  • instance_type (str) – Type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.

  • instance_count (int) – The number of instances to run the Processing job with. Defaults to 1.

  • volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).

  • volume_kms_key (str) – A KMS key for the processing volume.

  • output_kms_key (str) – The KMS key id for all ProcessingOutputs.

  • max_runtime_in_seconds (int) – Timeout in seconds. After this amount of time Amazon SageMaker terminates the job regardless of its current status.

  • base_job_name (str) – Prefix for processing name. If not specified, the processor generates a default job name, based on the training image name and current timestamp.

  • sagemaker_session (sagemaker.session.Session) – Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.

  • env (dict) – Environment variables to be passed to the processing job.

  • tags ([dict]) – List of tags to be passed to the processing job.

  • network_config (sagemaker.network.NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.

get_run_args(submit_app, submit_class=None, submit_jars=None, submit_files=None, inputs=None, outputs=None, arguments=None, job_name=None, configuration=None, spark_event_logs_s3_uri=None)

Returns a RunArgs object.

This object contains the normalized inputs, outputs and arguments needed when using a SparkJarProcessor in a ProcessingStep.

Parameters
  • submit_app (str) – Path (local or S3) to Python file to submit to Spark as the primary application. This is translated to the code property on the returned RunArgs object

  • submit_class (str) – Java class reference to submit to Spark as the primary application

  • submit_jars (list[str]) – List of paths (local or S3) to provide for spark-submit –jars option

  • submit_files (list[str]) – List of paths (local or S3) to provide for spark-submit –files option

  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).

  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).

  • arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).

  • job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.

  • configuration (list[dict] or dict) – Configuration for Hadoop, Spark, or Hive. List or dictionary of EMR-style classifications. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html

  • spark_event_logs_s3_uri (str) – S3 path where spark application events will be published to.

run(submit_app, submit_class=None, submit_jars=None, submit_files=None, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, configuration=None, spark_event_logs_s3_uri=None, kms_key=None)

Runs a processing job.

Parameters
  • submit_app (str) – Path (local or S3) to Jar file to submit to Spark as the primary application

  • submit_class (str) – Java class reference to submit to Spark as the primary application

  • submit_jars (list[str]) – List of paths (local or S3) to provide for spark-submit –jars option

  • submit_files (list[str]) – List of paths (local or S3) to provide for spark-submit –files option

  • inputs (list[ProcessingInput]) – Input files for the processing job. These must be provided as ProcessingInput objects (default: None).

  • outputs (list[ProcessingOutput]) – Outputs for the processing job. These can be specified as either path strings or ProcessingOutput objects (default: None).

  • arguments (list[str]) – A list of string arguments to be passed to a processing job (default: None).

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).

  • job_name (str) – Processing job name. If not specified, the processor generates a default job name, based on the base job name and current timestamp.

  • experiment_config (dict[str, str]) – Experiment management configuration. Optionally, the dict can contain three keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’. The behavior of setting these keys is as follows: * If ExperimentName is supplied but TrialName is not a Trial will be automatically created and the job’s Trial Component associated with the Trial. * If TrialName is supplied and the Trial already exists the job’s Trial Component will be associated with the Trial. * If both ExperimentName and TrialName are not supplied the trial component will be unassociated. * TrialComponentDisplayName is used for display in Studio.

  • configuration (list[dict] or dict) – Configuration for Hadoop, Spark, or Hive. List or dictionary of EMR-style classifications. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html

  • spark_event_logs_s3_uri (str) – S3 path where spark application events will be published to.

  • kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

class sagemaker.spark.processing.FileType(value)

Bases: enum.Enum

Enum of file type

JAR = 1
PYTHON = 2
FILE = 3

This module configures the SageMaker Clarify bias and model explainability processor job.

class sagemaker.clarify.DataConfig(s3_data_input_path, s3_output_path, label=None, headers=None, features=None, dataset_type='text/csv', s3_data_distribution_type='FullyReplicated', s3_compression_type='None')

Bases: object

Config object related to configurations of the input and output dataset.

Initializes a configuration of both input and output datasets.

Parameters
  • s3_data_input_path (str) – Dataset S3 prefix/object URI.

  • s3_output_path (str) – S3 prefix to store the output.

  • label (str) – Target attribute of the model required by bias metrics (optional for SHAP) Specified as column name or index for CSV dataset, or as JSONPath for JSONLines.

  • headers (list[str]) – A list of column names in the input dataset.

  • features (str) – JSONPath for locating the feature columns for bias metrics if the dataset format is JSONLines.

  • dataset_type (str) – Format of the dataset. Valid values are “text/csv” for CSV, “application/jsonlines” for JSONLines, and “application/x-parquet” for Parquet.

  • s3_data_distribution_type (str) – Valid options are “FullyReplicated” or “ShardedByS3Key”.

  • s3_compression_type (str) – Valid options are “None” or “Gzip”.

get_config()

Returns part of an analysis config dictionary.

class sagemaker.clarify.BiasConfig(label_values_or_threshold, facet_name, facet_values_or_threshold=None, group_name=None)

Bases: object

Config object related to bias configurations of the input dataset.

Initializes a configuration of the sensitive groups in the dataset.

Parameters
  • label_values_or_threshold (Any) – List of label values or threshold to indicate positive outcome used for bias metrics.

  • facet_name (str or [str]) – String or List of strings of sensitive attribute(s) in the

  • data for which we like to compare metrics. (input) –

  • facet_values_or_threshold (list) – Optional list of values to form a sensitive group or threshold for a numeric facet column that defines the lower bound of a sensitive group. Defaults to considering each possible value as sensitive group and computing metrics vs all the other examples. If facet_name is a list, this needs to be None or a List consisting of lists or None with the same length as facet_name list.

  • group_name (str) – Optional column name or index to indicate a group column to be used for the bias metric ‘Conditional Demographic Disparity in Labels - CDDL’ or ‘Conditional Demographic Disparity in Predicted Labels - CDDPL’.

get_config()

Returns part of an analysis config dictionary.

class sagemaker.clarify.ModelConfig(model_name, instance_count, instance_type, accept_type=None, content_type=None, content_template=None, custom_attributes=None, accelerator_type=None, endpoint_name_prefix=None)

Bases: object

Config object related to a model and its endpoint to be created.

Initializes a configuration of a model and the endpoint to be created for it.

Parameters
  • model_name (str) – Model name (as created by ‘CreateModel’).

  • instance_count (int) – The number of instances of a new endpoint for model inference.

  • instance_type (str) – The type of EC2 instance to use for model inference, for example, ‘ml.c5.xlarge’.

  • accept_type (str) – The model output format to be used for getting inferences with the shadow endpoint. Valid values are “text/csv” for CSV and “application/jsonlines”. Default is the same as content_type.

  • content_type (str) – The model input format to be used for getting inferences with the shadow endpoint. Valid values are “text/csv” for CSV and “application/jsonlines”. Default is the same as dataset format.

  • content_template (str) – A template string to be used to construct the model input from dataset instances. It is only used when “model_content_type” is “application/jsonlines”. The template should have one and only one placeholder $features which will be replaced by a features list for to form the model inference input.

  • custom_attributes (str) – Provides additional information about a request for an inference submitted to a model hosted at an Amazon SageMaker endpoint. The information is an opaque value that is forwarded verbatim. You could use this value, for example, to provide an ID that you can use to track a request or to provide other metadata that a service endpoint was programmed to process. The value must consist of no more than 1024 visible US-ASCII characters as specified in Section 3.3.6. Field Value Components ( https://tools.ietf.org/html/rfc7230#section-3.2.6) of the Hypertext Transfer Protocol (HTTP/1.1).

  • accelerator_type (str) – The Elastic Inference accelerator type to deploy to the model endpoint instance for making inferences to the model, see https://docs.aws.amazon.com/sagemaker/latest/dg/ei.html.

  • endpoint_name_prefix (str) – The endpoint name prefix of a new endpoint. Must follow pattern “^[a-zA-Z0-9](-*[a-zA-Z0-9]”.

get_predictor_config()

Returns part of the predictor dictionary of the analysis config.

class sagemaker.clarify.ModelPredictedLabelConfig(label=None, probability=None, probability_threshold=None, label_headers=None)

Bases: object

Config object to extract a predicted label from the model output.

Initializes a model output config to extract the predicted label or predicted score(s).

The following examples show different parameter configurations depending on the endpoint:
  • Regression Task: The model returns the score, e.g. 1.2. we don’t need to specify

    anything. For json output, e.g. {‘score’: 1.2} we can set ‘label=’score’’.

  • Binary classification:
    • The model returns a single probability and we would like to classify as ‘yes’

      those with a probability exceeding 0.2. We can set ‘probability_threshold=0.2, label_headers=’yes’’.

    • The model returns {‘probability’: 0.3}, for which we would like to apply a

      threshold of 0.5 to obtain a predicted label in {0, 1}. In this case we can set ‘label=’probability’’.

    • The model returns a tuple of the predicted label and the probability.

      In this case we can set ‘label=0’.

  • Multiclass classification:
    • The model returns

      {‘labels’: [‘cat’, ‘dog’, ‘fish’], ‘probabilities’: [0.35, 0.25, 0.4]}. In this case we would set the ‘probability=’probabilities’’ and ‘label=’labels’’ and infer the predicted label to be ‘fish.’

    • The model returns {‘predicted_label’: ‘fish’, ‘probabilities’: [0.35, 0.25, 0.4]}.

      In this case we would set the ‘label=’predicted_label’’.

    • The model returns [0.35, 0.25, 0.4]. In this case, we can set

      ‘label_headers=[‘cat’,’dog’,’fish’]’ and infer the predicted label to be ‘fish.’

Parameters
  • label (str or int) – Index or JSONPath location in the model output for the prediction. In case, this is a predicted label of the same type as the label in the dataset, no further arguments need to be specified.

  • probability (str or int) – Index or JSONPath location in the model output for the predicted score(s).

  • probability_threshold (float) – An optional value for binary prediction tasks in which the model returns a probability, to indicate the threshold to convert the prediction to a boolean value. Default is 0.5.

  • label_headers (list) – List of label values - one for each score of the probability.

get_predictor_config()

Returns probability_threshold, predictor config.

class sagemaker.clarify.ExplainabilityConfig

Bases: abc.ABC

Abstract config class to configure an explainability method.

abstract get_explainability_config()

Returns config.

class sagemaker.clarify.SHAPConfig(baseline, num_samples, agg_method, use_logit=False, save_local_shap_values=True, seed=None)

Bases: sagemaker.clarify.ExplainabilityConfig

Config class of SHAP.

Initializes config for SHAP.

Parameters
  • baseline (None or str or list) – None or S3 object Uri or A list of rows (at least one) to be used asthe baseline dataset in the Kernel SHAP algorithm. The format should be the same as the dataset format. Each row should contain only the feature columns/values and omit the label column/values. If None a baseline will be calculated automatically by using K-means or K-prototypes in the input dataset.

  • num_samples (int) – Number of samples to be used in the Kernel SHAP algorithm. This number determines the size of the generated synthetic dataset to compute the SHAP values.

  • agg_method (str) – Aggregation method for global SHAP values. Valid values are “mean_abs” (mean of absolute SHAP values for all instances), “median” (median of SHAP values for all instances) and “mean_sq” (mean of squared SHAP values for all instances).

  • use_logit (bool) – Indicator of whether the logit function is to be applied to the model predictions. Default is False. If “use_logit” is true then the SHAP values will have log-odds units.

  • save_local_shap_values (bool) – Indicator of whether to save the local SHAP values in the output location. Default is True.

  • seed (int) – seed value to get deterministic SHAP values. Default is None.

get_explainability_config()

Returns config.

class sagemaker.clarify.SageMakerClarifyProcessor(role, instance_count, instance_type, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, sagemaker_session=None, env=None, tags=None, network_config=None, job_name_prefix=None, version=None)

Bases: sagemaker.processing.Processor

Handles SageMaker Processing task to compute bias metrics and explain a model.

Initializes a Processor instance, computing bias metrics and model explanations.

Parameters
  • role (str) – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.

  • instance_count (int) – The number of instances to run a processing job with.

  • instance_type (str) – The type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.

  • volume_size_in_gb (int) – Size in GB of the EBS volume to use for storing data during processing (default: 30).

  • volume_kms_key (str) – A KMS key for the processing volume (default: None).

  • output_kms_key (str) – The KMS key ID for processing job outputs (default: None).

  • max_runtime_in_seconds (int) – Timeout in seconds (default: None). After this amount of time, Amazon SageMaker terminates the job, regardless of its current status. If max_runtime_in_seconds is not specified, the default value is 24 hours.

  • sagemaker_session (Session) – Session object which manages interactions with Amazon SageMaker and any other AWS services needed. If not specified, the processor creates one using the default AWS configuration chain.

  • env (dict[str, str]) – Environment variables to be passed to the processing jobs (default: None).

  • tags (list[dict]) – List of tags to be passed to the processing job (default: None). For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.

  • network_config (NetworkConfig) – A NetworkConfig object that configures network isolation, encryption of inter-container traffic, security group IDs, and subnets.

  • job_name_prefix (str) – Processing job name prefix.

  • version (str) – Clarify version want to be used.

run(**_)

Overriding the base class method but deferring to specific run_* methods.

run_pre_training_bias(data_config, data_bias_config, methods='all', wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)

Runs a ProcessingJob to compute the pre-training bias methods of the input data.

Computes the requested methods that compare ‘methods’ (e.g. fraction of examples) for the sensitive group vs the other examples.

Parameters
  • data_config (DataConfig) – Config of the input/output data.

  • data_bias_config (BiasConfig) – Config of sensitive groups.

  • methods (str or list[str]) – Selector of a subset of potential metrics: [“CI”, “DPL”, “KL”, “JS”, “LP”, “TVD”, “KS”, “CDDL”]. Defaults to computing all.

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).

  • job_name (str) – Processing job name. When job_name is not specified, if job_name_prefix in SageMakerClarifyProcessor specified, the job name will be composed of job_name_prefix and current timestamp; otherwise use “Clarify-Pretraining-Bias” as prefix.

  • kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

  • experiment_config (dict[str, str]) – Experiment management configuration. Optionally, the dict can contain three keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’. The behavior of setting these keys is as follows: * If ExperimentName is supplied but TrialName is not a Trial will be automatically created and the job’s Trial Component associated with the Trial. * If TrialName is supplied and the Trial already exists the job’s Trial Component will be associated with the Trial. * If both ExperimentName and TrialName are not supplied the trial component will be unassociated. * TrialComponentDisplayName is used for display in Studio.

run_post_training_bias(data_config, data_bias_config, model_config, model_predicted_label_config, methods='all', wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)

Runs a ProcessingJob to compute the post-training bias methods of the model predictions.

Spins up a model endpoint, runs inference over the input example in the ‘s3_data_input_path’ to obtain predicted labels. Computes a the requested methods that compare ‘methods’ (e.g. accuracy, precision, recall) for the sensitive group vs the other examples.

Parameters
  • data_config (DataConfig) – Config of the input/output data.

  • data_bias_config (BiasConfig) – Config of sensitive groups.

  • model_config (ModelConfig) – Config of the model and its endpoint to be created.

  • model_predicted_label_config (ModelPredictedLabelConfig) – Config of how to extract the predicted label from the model output.

  • methods (str or list[str]) – Selector of a subset of potential metrics: [“DPPL” , “DI”, “DCA”, “DCR”, “RD”, “DAR”, “DRR”, “AD”, “CDDPL “, “TE”, “FT”]. Defaults to computing all.

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).

  • job_name (str) – Processing job name. When job_name is not specified, if job_name_prefix in SageMakerClarifyProcessor specified, the job name will be composed of job_name_prefix and current timestamp; otherwise use “Clarify-Posttraining-Bias” as prefix.

  • kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

  • experiment_config (dict[str, str]) – Experiment management configuration. Optionally, the dict can contain three keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’. The behavior of setting these keys is as follows: * If ExperimentName is supplied but TrialName is not a Trial will be automatically created and the job’s Trial Component associated with the Trial. * If TrialName is supplied and the Trial already exists the job’s Trial Component will be associated with the Trial. * If both ExperimentName and TrialName are not supplied the trial component will be unassociated. * TrialComponentDisplayName is used for display in Studio.

run_bias(data_config, bias_config, model_config, model_predicted_label_config=None, pre_training_methods='all', post_training_methods='all', wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)

Runs a ProcessingJob to compute the requested bias methods.

It computes the metrics of both the pre-training methods and the post-training methods. To calculate post-training methods, it needs to spin up a model endpoint, runs inference over the input example in the ‘s3_data_input_path’ to obtain predicted labels.

Parameters
  • data_config (DataConfig) – Config of the input/output data.

  • bias_config (BiasConfig) – Config of sensitive groups.

  • model_config (ModelConfig) – Config of the model and its endpoint to be created.

  • model_predicted_label_config (ModelPredictedLabelConfig) – Config of how to extract the predicted label from the model output.

  • pre_training_methods (str or list[str]) –

    Selector of a subset of potential metrics: [“CI”, “DPL”, “KL”, “JS”, “LP”, “TVD”, “KS”, “CDDL”]. Defaults to computing all.

  • post_training_methods (str or list[str]) –

    Selector of a subset of potential metrics: [“DPPL” , “DI”, “DCA”, “DCR”, “RD”, “DAR”, “DRR”, “AD”, “CDDPL “, “TE”, “FT”]. Defaults to computing all.

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).

  • job_name (str) – Processing job name. When job_name is not specified, if job_name_prefix in SageMakerClarifyProcessor specified, the job name will be composed of job_name_prefix and current timestamp; otherwise use “Clarify-Bias” as prefix.

  • kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

  • experiment_config (dict[str, str]) – Experiment management configuration. Optionally, the dict can contain three keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’. The behavior of setting these keys is as follows: * If ExperimentName is supplied but TrialName is not a Trial will be automatically created and the job’s Trial Component associated with the Trial. * If TrialName is supplied and the Trial already exists the job’s Trial Component will be associated with the Trial. * If both ExperimentName and TrialName are not supplied the trial component will be unassociated. * TrialComponentDisplayName is used for display in Studio.

run_explainability(data_config, model_config, explainability_config, model_scores=None, wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)

Runs a ProcessingJob computing for each example in the input the feature importance.

Currently, only SHAP is supported as explainability method.

Spins up a model endpoint. For each input example in the ‘s3_data_input_path’ the SHAP algorithm determines feature importance, by creating ‘num_samples’ copies of the example with a subset of features replaced with values from the ‘baseline’. Model inference is run to see how the prediction changes with the replaced features. If the model output returns multiple scores importance is computed for each of them. Across examples, feature importance is aggregated using ‘agg_method’.

Parameters
  • data_config (DataConfig) – Config of the input/output data.

  • model_config (ModelConfig) – Config of the model and its endpoint to be created.

  • explainability_config (ExplainabilityConfig) – Config of the specific explainability method. Currently, only SHAP is supported.

  • model_scores (str|int|ModelPredictedLabelConfig) – Index or JSONPath location in the model output for the predicted scores to be explained. This is not required if the model output is a single score. Alternatively, an instance of ModelPredictedLabelConfig can be provided.

  • wait (bool) – Whether the call should wait until the job completes (default: True).

  • logs (bool) – Whether to show the logs produced by the job. Only meaningful when wait is True (default: True).

  • job_name (str) – Processing job name. When job_name is not specified, if job_name_prefix in SageMakerClarifyProcessor specified, the job name will be composed of job_name_prefix and current timestamp; otherwise use “Clarify-Explainability” as prefix.

  • kms_key (str) – The ARN of the KMS key that is used to encrypt the user code file (default: None).

  • experiment_config (dict[str, str]) – Experiment management configuration. Optionally, the dict can contain three keys: ‘ExperimentName’, ‘TrialName’, and ‘TrialComponentDisplayName’. The behavior of setting these keys is as follows: * If ExperimentName is supplied but TrialName is not a Trial will be automatically created and the job’s Trial Component associated with the Trial. * If TrialName is supplied and the Trial already exists the job’s Trial Component will be associated with the Trial. * If both ExperimentName and TrialName are not supplied the trial component will be unassociated. * TrialComponentDisplayName is used for display in Studio.